|
#1
|
|||
|
|||
|
Something fishy with Google library project
Something fishy is going on.
In the NYT on December 14, 2004, "Google Is Adding Major Libraries to Its Database," by John Markoff and Edward Wyatt: "Each agreement with a library is slightly different. Google plans to digitize nearly all the eight million books in Stanford's collection and the seven million at Michigan." ... "At Stanford, Google hopes to be able to scan 50,000 pages a day within the month, eventually doubling that rate, according to a person involved in the project." ____________ 50,000 pages a day is 2,083 pages per hour. Let's double this rate, as Google will do "eventually," and call it 4,167 pages per hour. How many years will it take to do 8 million x 200 pages per volume? 8 million x 200 = 1,600,000,000 pages to be scanned. 1,600,000,000 / 4,167 = 383,969 hours to scan Stanford's library at the speed they hope to attain "eventually." Let's run 24-hours a day (three shifts of temp workers at minimum wage!) and assume that the wizards at the Googleplex will never have any down time. How many days is this? 383,969 / 24 = 15,999 days. How many years is this? 15,999 / 365.25 = 43.8 years. Even their cookie won't last that long! ________________ But there's another army of temp workers at Michigan. Let's look at the Michigan figures. According to University of Michigan librarian John Wilkin, as reported in the Detroit Free Press on December 14 by columnist Mike Wendland: 7,000,000: Volumes in the U-M library to be digitized. 2,380,000,000: Estimated number of pages. Hold it right there, Mr. Librarian! Are you saying that each volume has an average of 340 pages? Well okay, you're the librarian! I have to adjust my Stanford figures. I assumed 200 pages per volume for 8 million volumes. If it's really 340 pages per volume, then the Stanford project will take 1.7 times longer. Instead of 43.8 years, Stanford will take 74.46 years! (Two back-to-back cookies are needed!) Then Mr. Wilkin goes on to say, "Going as fast as we can with the traditional means of doing this, it would take us about 1,600 years to do all 7 million volumes," he said. "Google will do it in six years." Wow, I'm impressed. Google really is God. What's the scan rate for 7 million volumes over 6 years, if you run around the clock? 7 million x 340 pages per volume = 2,380,000,000 pages 6 years = 365.25 x 24 x 6 = 52,596 hours scan rate = 2,380,000,000 / 52,596 = 45,251 per hour For 24 hours, that comes to 1,086,024 pages per day. Now remember at Stanford, Google will "eventually" double the rate of 50,000 per day, which means 100,000 per day when they do this. Recall from above that this means 4,167 pages per hour. In other words, even running full-speed 24 hours per day, the scan rate Google will have to achieve at Michigan in order to pull it off in six years, is 10.86 times greater than the rate they will "eventually" achieve at Stanford. But of course, the Mike Wendland column also says this: "The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail -- the task would be impossible, says John Wilkin, the U-M associate librarian who is heading the project." Wait a minute, the NYT piece said this: "At least initially, Google's digitizing task will be labor intensive, with people placing the books and documents on sophisticated scanners whose high-resolution cameras capture an image of each page and convert it to a digital file." ... "The company refused to comment on the technology that it was using to digitize books, except to say that it was nondestructive. But according to a person who has been briefed on the project, Google's technology is more labor-intensive than systems that are already commercially available." So their secret sauce isn't even ready for tasting! Better hurry, the clock is ticking.... Is it possible that the NYT piece dropped a zero and the rate is really ten times the figure they reported? I doubt it, from what I know about the technology. If anyone thinks this is possible, the NYT will probably be happy to check out their source again and run a correction if they goofed. |
|
#2
|
|||
|
|||
|
Interesting article in the Sunday Times today, says that the "first phase" will take 10 years.
The article also talks about the possible dangers to the public from the move: Quote:
|
|
#3
|
|||
|
|||
|
That's an interesting link, NFFC. The libraries have said that they will get copies of the scanned images. I'm assuming at this point, and attempting to confirm, that there will be a uncorrected OCR layer that is invisible to the user. No proofreading required. It's probably 90 percent clean, which is good enough for keyword searching. The searcher always sees the raster image, which is error-free, of course. But searching won't be 100 percent accurate.
I'm also assuming that this OCR layer, and the software for searching it, will remain proprietary to Google. The libraries may have to run their own OCR on the images if they want their own search engine. This is speculation at this point, but it seems reasonable to me that Google would do it this way. I've gotten some feedback from John Markoff at the NYT and from librarian John Wilkin at the University of Michigan. Markoff said "My numbers are probably incomplete. I assume they will scale beyond the 100,000 number, or else it will be a long long time..." In other words, it's not a typo in the NYT. Markoff confirms that this is what his source told him about the Stanford project. Markoff's number is 100,000 pages per day, or 4,167 pages per hour if you run three shifts. A Kirtas APT BookScan 1200 can do 1,200 pages per hour. This means four Kirtas machines running around the clock for the Stanford project. The books will be trucked from Stanford to the Googleplex. One employee can run two machines simultaneously, according to Kirtas. The reason I think Google will be using Kirtas is because the only other automated page-turner weighs ten times more (1,600 pounds), costs twice as much ($240,000) and is only slightly faster at 1,500 pages per hour. It's made by 4DigitalBooks in Switzerland. The Kirtus weighs 170 pounds and costs $120,000 for a single unit. I'd be surprised if Google had their own machine that was better than this. Kirtas and Google were both finalists for the prestigious 2004 World Technology Awards. The Kirtas machine has only been in production for a year. The Kirtas machines require two employees on duty around the clock for these presumed four machines for the Stanford project. People need sleep, so this means about 8 employees per 4 machines. I'm willing to accept Markoff's number of 100,000 pages per day, because it sounds reasonable. However, John P. Wilkin at the University of Michigan confirmed to me that in his negotiations and experiments with Google, they arrived at a figure of 750,000 volumes a year. Wilkin's number is 6.98 times greater than Markoff's, and would require 25 Kirtas machines running around the clock at the University of Michigan, and 50 employees to tend them. Wilkin said that Google was willing to scale this up to double the number of stations. I don't buy Wilkin's number. If I did buy it, then his estimate of six years to do the 7 million volumes at Michigan is okay, assuming that Google does indeed double the operation at some point to 50 machines and 100 employees. But if I don't buy Wilkin's number and go with Markoff's number, then I'd have to insist that the Stanford project will take 74 years, and the simultaneous Michigan project will take 65 years. Yes, technology will improve in coming years. The pages will probably get turned faster with better robots. But already I'm assuming that there is zero proofreading of the OCR layer used for keyword searching, and I'm not assuming any great amounts of time for camera work and processing. Improvements in speed would be all mechanical, and perhaps at the cost of damaging some books -- which Google cannot afford to do because these libraries are rightfully protective of their collections. |
|
#4
|
|||
|
|||
|
I heard back from Mr. Wilkin.
1) The University of Michigan signed a non-disclosure agreement with Google. 2) The OCR layer is as I described it. 3) I am wrong in the sense that the University of Michigan will also get a copy of the OCR for the images. 4) I am wrong about the page-turning process. Google is using something competely unlike the Kirtas machine. Quote:
Okay, does anyone know what sort of technology is being used? This is like a mystery now! |
|
#5
|
|||
|
|||
|
Quote:
That's guessing. Google's simply not saying. My feeling has been to sit back for a few weeks and watch what leaks out of the libraries. I have no doubt we'll see them discuss the project in much more depth, NDAs regardless. The universities may have signed those NDAs. Those in the libraries, from workers to researchers, did not. They'll hear and leak details. Quote:
What Google has NOT said are any specifics of the format. My assumption has been this material will be indexed in a way that only Google's own technology can deal with. So great that you get a copy, but can you read it. In other words, say all the books were scanned to be made into Microsoft Word documents. Fine for the library, as long as they have Microsoft Word. In Google's case, say the books are scanned into Google Library Format. Can you get a reader for that down at CompUSA? No. But I suspect that if you purchase a copy of Google's enterprise appliance, that might be able to help you with your needs. So is this perhaps a nice way for Google to get libraries into its own technology? And those using libraries a stronger tie to Google web search? After all, I can imagine in the future, someone wants to search the collections of a particular library, they hit a page that lets them search the library itself or perhaps the entire web with, I dunno, Google? Overall, I'm glad the project is going ahead -- but I'd feel better if it were happening with more public disclosure and perhaps some overall plan or some open source effort so that efforts to digitize print aren't going to be duplicated. |
|
#6
|
||||
|
||||
|
Quote:
Quote:
Quote:
Quote:
|
|
#7
|
|||
|
|||
|
Google is primarily using humans to turn pages, it would appear from this San Francisco Chronicle article.
Production is already underway at Michigan. Wilkin estimates that an operator gets through about 50 books in a workday. If you operate seven days a week, in order to meet Michigan's six-year time frame for 7 million volumes, this would mean 64 operator workstations at Michigan. If you work three shifts, you could cut this down to 22 workstations. This scan rate for a single operator is 2,125 pages per hour. The Kirtas robot does 1,200 per hour. There must be some robotics involved to at least help out the operator, but it still sounds like a sweatshop. Anyone want a job with Google in Michigan? |
|
#8
|
|||
|
|||
|
This is a massive job and well worth the effort. Good on ya Google!
|
|
#9
|
||||
|
||||
|
aren't most books kept on micro fiche ??
DaveN |
|
#10
|
|||
|
|||
|
Quote:
If google is using some sort of photography, and it takes a human a second to turn the page, and the "camera" can see 2 pages at a time, then the max throughput would be 7,200 pages an hour, right? I'm thinking that google would "offload" the processing of the images - they do run a very "large" computer - with software written by some of the best minds in CS. |
|
#11
|
|||
|
|||
|
Technical sweatshop
Quote:
Its all human, at least at the headquarters. The scan rate for de-bined books is around 700,000 pages a day. As far as it being a sweatshop- that's exactly what it has turned into. The temps are no longer allowed in main Google buildings, nor allowed to attend any of the functions. You'd think an 80 billion dollar company would treat its pawns a little better. |
|
#12
|
|||
|
|||
|
snitchybitch, thanks for your input. The only way the public will find out what they have a right to know about Google, will be from whistleblowers.
I had a reporter tell me months ago that he didn't get to square one in trying to coax information out of Harvard about their Google indexing. The Freedom of Information office at the U. of Michigan says that they will respond to my request on June 17. I expect to be refused on the grounds that they signed a nondisclosure agreement, and then I'll have to appeal to the president of the University. I just read that Google is looking for lots of office space in Michigan in connection with the project. If they were allowed to ship the books to El Salvador, I'm sure Google would prefer to do it there for the cheap labor. Throw up a factory in the garment district and hire a bunch of children. For the world's largest media/information company, you'd think that by now the public sector would know more about what's going on at the Googleplex, and be in a position to ask questions. But no, all we get is hype from Wall Street. |
|
#13
|
|||
|
|||
|
Everyman's request comes through, and the agreement is published. Closing this thread so fresh discussion about that can begin over here: Google Library Agreement With University Of Michigan Published
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|