Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Other Google Issues
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Closed Thread
 
Thread Tools
Old 12-19-2004   #1
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
Something fishy with Google library project

Something fishy is going on.

In the NYT on December 14, 2004, "Google Is Adding Major Libraries to Its Database," by John Markoff and Edward Wyatt:

"Each agreement with a library is slightly different. Google plans to digitize nearly all the eight million books in Stanford's collection and the seven million at Michigan."
...
"At Stanford, Google hopes to be able to scan 50,000 pages a day within the month, eventually doubling that rate, according to a person involved in the project."
____________

50,000 pages a day is 2,083 pages per hour.

Let's double this rate, as Google will do "eventually," and call it 4,167 pages per hour. How many years will it take to do 8 million x 200 pages per volume?

8 million x 200 = 1,600,000,000 pages to be scanned.

1,600,000,000 / 4,167 = 383,969 hours to scan Stanford's library at the speed they hope to attain "eventually."

Let's run 24-hours a day (three shifts of temp workers at minimum wage!) and assume that the wizards at the Googleplex will never have any down time. How many days is this? 383,969 / 24 = 15,999 days.

How many years is this? 15,999 / 365.25 = 43.8 years. Even their cookie won't last that long!
________________

But there's another army of temp workers at Michigan. Let's look at the Michigan figures. According to University of Michigan librarian John Wilkin, as reported in the Detroit Free Press on December 14 by columnist Mike Wendland:

7,000,000: Volumes in the U-M library to be digitized.
2,380,000,000: Estimated number of pages.

Hold it right there, Mr. Librarian! Are you saying that each volume has an average of 340 pages? Well okay, you're the librarian!

I have to adjust my Stanford figures. I assumed 200 pages per volume for 8 million volumes. If it's really 340 pages per volume, then the Stanford project will take 1.7 times longer. Instead of 43.8 years, Stanford will take 74.46 years! (Two back-to-back cookies are needed!)

Then Mr. Wilkin goes on to say, "Going as fast as we can with the traditional means of doing this, it would take us about 1,600 years to do all 7 million volumes," he said. "Google will do it in six years."

Wow, I'm impressed. Google really is God. What's the scan rate for 7 million volumes over 6 years, if you run around the clock?

7 million x 340 pages per volume = 2,380,000,000 pages
6 years = 365.25 x 24 x 6 = 52,596 hours
scan rate = 2,380,000,000 / 52,596 = 45,251 per hour

For 24 hours, that comes to 1,086,024 pages per day. Now remember at Stanford, Google will "eventually" double the rate of 50,000 per day, which means 100,000 per day when they do this. Recall from above that this means 4,167 pages per hour.

In other words, even running full-speed 24 hours per day, the scan rate Google will have to achieve at Michigan in order to pull it off in six years, is 10.86 times greater than the rate they will "eventually" achieve at Stanford.

But of course, the Mike Wendland column also says this:

"The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail -- the task would be impossible, says John Wilkin, the U-M associate librarian who is heading the project."

Wait a minute, the NYT piece said this:

"At least initially, Google's digitizing task will be labor intensive, with people placing the books and documents on sophisticated scanners whose high-resolution cameras capture an image of each page and convert it to a digital file."
...
"The company refused to comment on the technology that it was using to digitize books, except to say that it was nondestructive. But according to a person who has been briefed on the project, Google's technology is more labor-intensive than systems that are already commercially available."

So their secret sauce isn't even ready for tasting! Better hurry, the clock is ticking....

Is it possible that the NYT piece dropped a zero and the rate is really ten times the figure they reported? I doubt it, from what I know about the technology. If anyone thinks this is possible, the NYT will probably be happy to check out their source again and run a correction if they goofed.
Everyman is offline  
Old 12-19-2004   #2
NFFC
"One wants to have, you know, a little class." DianeV
 
Join Date: Jun 2004
Posts: 468
NFFC is a splendid one to beholdNFFC is a splendid one to beholdNFFC is a splendid one to beholdNFFC is a splendid one to beholdNFFC is a splendid one to beholdNFFC is a splendid one to behold
Interesting article in the Sunday Times today, says that the "first phase" will take 10 years.

The article also talks about the possible dangers to the public from the move:

Quote:
There is, of course, a more worrying possibility. By the act of converting printed books to digital form Google will be creating a new copyright.

Works in the public domain will effectively be privatised. Whether or not Google chooses to exercise its rights, it and its library partners will be owners of the newly processed property. So the vast reservoir of material in the out-of-copyright public domain will become “proprietary”, or pay-per-view. If we get access, it will be because we are “allowed”, not because we have the right.
http://www.timesonline.co.uk/printFr...408268,00.html [seems to required registration for non-UK IP's]
NFFC is offline  
Old 12-19-2004   #3
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
That's an interesting link, NFFC. The libraries have said that they will get copies of the scanned images. I'm assuming at this point, and attempting to confirm, that there will be a uncorrected OCR layer that is invisible to the user. No proofreading required. It's probably 90 percent clean, which is good enough for keyword searching. The searcher always sees the raster image, which is error-free, of course. But searching won't be 100 percent accurate.

I'm also assuming that this OCR layer, and the software for searching it, will remain proprietary to Google. The libraries may have to run their own OCR on the images if they want their own search engine. This is speculation at this point, but it seems reasonable to me that Google would do it this way.

I've gotten some feedback from John Markoff at the NYT and from librarian John Wilkin at the University of Michigan. Markoff said "My numbers are probably incomplete. I assume they will scale beyond the 100,000 number, or else it will be a long long time..."

In other words, it's not a typo in the NYT. Markoff confirms that this is what his source told him about the Stanford project.

Markoff's number is 100,000 pages per day, or 4,167 pages per hour if you run three shifts.

A Kirtas APT BookScan 1200 can do 1,200 pages per hour. This means four Kirtas machines running around the clock for the Stanford project. The books will be trucked from Stanford to the Googleplex. One employee can run two machines simultaneously, according to Kirtas. The reason I think Google will be using Kirtas is because the only other automated page-turner weighs ten times more (1,600 pounds), costs twice as much ($240,000) and is only slightly faster at 1,500 pages per hour. It's made by 4DigitalBooks in Switzerland. The Kirtus weighs 170 pounds and costs $120,000 for a single unit. I'd be surprised if Google had their own machine that was better than this. Kirtas and Google were both finalists for the prestigious 2004 World Technology Awards. The Kirtas machine has only been in production for a year.

The Kirtas machines require two employees on duty around the clock for these presumed four machines for the Stanford project. People need sleep, so this means about 8 employees per 4 machines.

I'm willing to accept Markoff's number of 100,000 pages per day, because it sounds reasonable.

However, John P. Wilkin at the University of Michigan confirmed to me that in his negotiations and experiments with Google, they arrived at a figure of 750,000 volumes a year. Wilkin's number is 6.98 times greater than Markoff's, and would require 25 Kirtas machines running around the clock at the University of Michigan, and 50 employees to tend them. Wilkin said that Google was willing to scale this up to double the number of stations.

I don't buy Wilkin's number. If I did buy it, then his estimate of six years to do the 7 million volumes at Michigan is okay, assuming that Google does indeed double the operation at some point to 50 machines and 100 employees.

But if I don't buy Wilkin's number and go with Markoff's number, then I'd have to insist that the Stanford project will take 74 years, and the simultaneous Michigan project will take 65 years.

Yes, technology will improve in coming years. The pages will probably get turned faster with better robots. But already I'm assuming that there is zero proofreading of the OCR layer used for keyword searching, and I'm not assuming any great amounts of time for camera work and processing. Improvements in speed would be all mechanical, and perhaps at the cost of damaging some books -- which Google cannot afford to do because these libraries are rightfully protective of their collections.
Everyman is offline  
Old 12-19-2004   #4
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
I heard back from Mr. Wilkin.

1) The University of Michigan signed a non-disclosure agreement with Google.

2) The OCR layer is as I described it.

3) I am wrong in the sense that the University of Michigan will also get a copy of the OCR for the images.

4) I am wrong about the page-turning process. Google is using something competely unlike the Kirtas machine.

Quote:
The technology that Google is using is nothing like the Kirtas and is entirely their own. I'm not able to provide any details on the nature of the technology, but because we did provide extensive review going into the project, and have occasional checks of systems and processes, I know it well enough (i.e., very well) to say that it's nothing like the Kirtas.
5) Markoff may be right about Stanford, but his numbers are more speculative. "No production is taking place at Stanford (or any of the other institutions) at the present time."

Okay, does anyone know what sort of technology is being used? This is like a mystery now!
Everyman is offline  
Old 12-20-2004   #5
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
Quote:
Okay, does anyone know what sort of technology is being used? This is like a mystery now!
I guessed it would involve highly-skilled Google interns scanning the books so it would be "non-destructive" in nature -- or buying a copy of a book already in the collection, then debinding that for scanning, leaving the original intact.

That's guessing. Google's simply not saying. My feeling has been to sit back for a few weeks and watch what leaks out of the libraries. I have no doubt we'll see them discuss the project in much more depth, NDAs regardless. The universities may have signed those NDAs. Those in the libraries, from workers to researchers, did not. They'll hear and leak details.

Quote:
I am wrong in the sense that the University of Michigan will also get a copy of the OCR for the images.
Oh, every library is supposed to get a copy of what's indexed. Google has said that such as in our own story on the project.

What Google has NOT said are any specifics of the format. My assumption has been this material will be indexed in a way that only Google's own technology can deal with. So great that you get a copy, but can you read it.

In other words, say all the books were scanned to be made into Microsoft Word documents. Fine for the library, as long as they have Microsoft Word.

In Google's case, say the books are scanned into Google Library Format. Can you get a reader for that down at CompUSA? No. But I suspect that if you purchase a copy of Google's enterprise appliance, that might be able to help you with your needs.

So is this perhaps a nice way for Google to get libraries into its own technology? And those using libraries a stronger tie to Google web search? After all, I can imagine in the future, someone wants to search the collections of a particular library, they hit a page that lets them search the library itself or perhaps the entire web with, I dunno, Google?

Overall, I'm glad the project is going ahead -- but I'd feel better if it were happening with more public disclosure and perhaps some overall plan or some open source effort so that efforts to digitize print aren't going to be duplicated.
dannysullivan is offline  
Old 12-20-2004   #6
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
Quote:
...buying a copy of a book already in the collection, then debinding that for scanning, leaving the original intact
Not with 7 million volumes, almost all of which are out of print. The main issue in nondestructive book scanning is that the book cannot be opened flat because it hurts the binding. Robot scanners and page-turners work with partially-opened books, and use mirrors and automated platform adjustments before snapping the camera. Google is not using Kirtas, so the real question is, "How is Google going to turn the pages?" Maybe with more temp help and a slower scan rate than Kirtas? I don't know.


Quote:
Oh, every library is supposed to get a copy of what's indexed. Google has said that such as in our own story on the project.
The OCR file is distinct from the raster-scanned image file that preceded it. Michigan will get both. I wasn't sure until Wilkin said so that Michigan would get both.


Quote:
What Google has NOT said are any specifics of the format. My assumption has been this material will be indexed in a way that only Google's own technology can deal with. So great that you get a copy, but can you read it.
The issue of whether there are any proprietary APIs that require licensing from Google before the library can efficiently use their copies, is of course a burning issue. Despite the non-disclosure agreement, this is of paramount importance to groups such as the American Library Assn. We will find out, and sooner rather than later.


Quote:
...I'd feel better if it were happening with more public disclosure and perhaps some overall plan or some open source effort so that efforts to digitize print aren't going to be duplicated.
Amen to that.
Everyman is offline  
Old 12-20-2004   #7
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
Google is primarily using humans to turn pages, it would appear from this San Francisco Chronicle article.

Production is already underway at Michigan. Wilkin estimates that an operator gets through about 50 books in a workday. If you operate seven days a week, in order to meet Michigan's six-year time frame for 7 million volumes, this would mean 64 operator workstations at Michigan. If you work three shifts, you could cut this down to 22 workstations.

This scan rate for a single operator is 2,125 pages per hour. The Kirtas robot does 1,200 per hour. There must be some robotics involved to at least help out the operator, but it still sounds like a sweatshop.

Anyone want a job with Google in Michigan?
Everyman is offline  
Old 12-21-2004   #8
Dave Hawley
Please remove heart from sleeve before replying
 
Join Date: Nov 2004
Location: Australia
Posts: 573
Dave Hawley will become famous soon enoughDave Hawley will become famous soon enough
This is a massive job and well worth the effort. Good on ya Google!
Dave Hawley is offline  
Old 12-21-2004   #9
DaveN
 
DaveN's Avatar
 
Join Date: Jun 2004
Location: North Yorkshire
Posts: 434
DaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to all
aren't most books kept on micro fiche ??

DaveN
DaveN is offline  
Old 12-27-2004   #10
femtoid
 
Posts: n/a
Quote:
Production is already underway at Michigan. Wilkin estimates that an operator gets through about 50 books in a workday. If you operate seven days a week, in order to meet Michigan's six-year time frame for 7 million volumes, this would mean 64 operator workstations at Michigan. If you work three shifts, you could cut this down to 22 workstations.

This scan rate for a single operator is 2,125 pages per hour. The Kirtas robot does 1,200 per hour. There must be some robotics involved to at least help out the operator, but it still sounds like a sweatshop.
there are 3,600 seconds in an hour.
If google is using some sort of photography, and it takes a human a second to turn the page, and the "camera" can see 2 pages at a time, then the max throughput would be 7,200 pages an hour, right?

I'm thinking that google would "offload" the processing of the images - they do run a very "large" computer - with software written by some of the best minds in CS.
 
Old 06-10-2005   #11
snitchybitch
 
Posts: n/a
Technical sweatshop

Quote:
Originally Posted by Everyman

This scan rate for a single operator is 2,125 pages per hour. The Kirtas robot does 1,200 per hour. There must be some robotics involved to at least help out the operator, but it still sounds like a sweatshop.

Its all human, at least at the headquarters. The scan rate for de-bined books is around 700,000 pages a day.

As far as it being a sweatshop- that's exactly what it has turned into. The temps are no longer allowed in main Google buildings, nor allowed to attend any of the functions. You'd think an 80 billion dollar company would treat its pawns a little better.
 
Old 06-10-2005   #12
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
snitchybitch, thanks for your input. The only way the public will find out what they have a right to know about Google, will be from whistleblowers.

I had a reporter tell me months ago that he didn't get to square one in trying to coax information out of Harvard about their Google indexing.

The Freedom of Information office at the U. of Michigan says that they will respond to my request on June 17. I expect to be refused on the grounds that they signed a nondisclosure agreement, and then I'll have to appeal to the president of the University. I just read that Google is looking for lots of office space in Michigan in connection with the project. If they were allowed to ship the books to El Salvador, I'm sure Google would prefer to do it there for the cheap labor. Throw up a factory in the garment district and hire a bunch of children.

For the world's largest media/information company, you'd think that by now the public sector would know more about what's going on at the Googleplex, and be in a position to ask questions. But no, all we get is hype from Wall Street.
Everyman is offline  
Old 06-16-2005   #13
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
Everyman's request comes through, and the agreement is published. Closing this thread so fresh discussion about that can begin over here: Google Library Agreement With University Of Michigan Published
dannysullivan is offline  
Closed Thread


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off