IndustryBuilding the Universal Library

Building the Universal Library

What will it take for Google or another search engine to truly assemble a library of all of the world's information? A thought-provoking essay by Wired magazine's 'senior maverick' takes a fascinating look at the challenges.

What will it take for Google or another search engine to truly assemble a library of all of the world’s information? A thought-provoking essay by Wired magazine’s “senior maverick” takes a fascinating look at the challenges.

The various book scanning projects underway throughout the world don’t snare as much media coverage as higher-profile products and services introduced by the search engines, but they’re nonetheless important initiatives. As Wired co-founder Kevin Kelly writes in a recent New York Time Magazine article, “The dream is an old one: to have in one place all knowledge, past and present. All books, all documents, all conceptual works, in all languages.”

Building a Universal Library a huge undertaking, and not just because the physical effort of scanning tens of millions of books is in itself such a massive task. Once scanned, the books must be indexed and made searchable, all the while respecting the copyrights of books not yet in the public domain.

Kelly offers some interesting stats about the current progress of various large-scale book scanning projects that we’ve written about at Search Engine Watch, such as Google Print, the Yahoo and Microsoft-backed Open Content Alliance, The Internet Archive’s Million Books Project and others.

He says these projects are scanning about a million books a year. Although this sounds like an impressive pace, it amounts to just 5% of all books currently in print. Fortunately, much of the new information created by humans is now in digital format, so it can more easily be included in the Universal Library without the extensive physical effort of scanning books.

And let’s not forget the web. Although the search engines have become fairly proficient at creating comprehensive indexes of the surface web, they’re still missing massive amounts of content located in databases or other dynamic sources (the Invisible web)—not to mention web pages that have disappeared.

“The grand library naturally needs a copy of the billions of dead Web pages no longer online and the tens of millions of blog posts now gone—the ephemeral literature of our time.”

Including this “ephemeral literature” could prove to be a major challenge. Various studies have put the “half-life” of an average web page at just under two years, with the half-life of a typical web site being just over two years.

The most complete publicly accessible archive of the web, the Internet Archive, contains just a fraction of all content that has been posted to the web—some 55 billion pages in all.

But I think it’s a fair bet to say that Google and Yahoo haven’t thrown away the pages they’ve crawled through the years. And there’s a precedent for digital restoration on a massive scale: Google’s painstaking effort to build an archive of the Usenet.

Assembling archives stored on magnetic tape, CD-ROM and other sources, Google restored a comprehensive archive of Usenet, dating back to 1981, and made this available to users in December 2001. Although still not totally complete, the renamed Google Groups now likely contains more than 99 percent of all Usenet postings ever made.

It’s not unthinkable that Google and Yahoo, the longest surviving crawler-based engines, could collaborate to restore a comprehensive archive of the web. Surely there are data archives from search engines now long-gone that could also be mined to build out an archive.

Apart from the challenges of simply creating the Universal Library and making it searchable, Kelly thinks the entire paradigm of how we consume information must change. He envisions the emergence of Wikipedia-like directories where fans of particular types of information can write reviews, or create pointers to obscure works for other fans. In essence, we will all become librarians in the Universal Library, helping each other navigate the vast amount of information that’s difficult for us to cope with today.

And, just as we do with our digital music now, we’ll be able to mix and mash content to create “playlists” (Kelly calls them “bookshelves”) to share with others.

Ah, but what about copyright? How can we create mashups without violating existing laws? Kelly spends a lot of time analyzing the current state of copyright laws, and how it poses a major barrier to the creation and fluid operation of the Universal Library.

These are just a few of the topics Kelly touches on in his terrific intellectual romp mulling the issues with a Universal Library, Scan This Book! It’s a fascinating and thoughtful read, well worth the time of anyone who spends a lot of time consuming digital information and is impatiently awaiting the arrival of the Universal Library.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

Yahoo: Our ads are better
CNET News.com May 17 2006 9:31PM GMT
The next Yahoo: social search, user content
InfoWorld May 17 2006 9:28PM GMT
Yahoo sees no financial gain from ad system in ’06
Reuters May 17 2006 7:11PM GMT
ShopWiki Launches Mobile Shopping Search Engine
Electronic Commerce Guide May 17 2006 6:55PM GMT
Google, Microsoft and Adobe – The battle for the new operating system
ZDNet May 17 2006 4:58PM GMT
Yahoo to Personalize Search
Red Herring May 17 2006 4:48PM GMT
DoD to use Microsoft Virtual Earth; Google China controversy continues – 05/12/2006
ITworld.com May 17 2006 2:29PM GMT
Google is most loved brand
Guardian Unlimited reg May 17 2006 10:32AM GMT
Riya heading into Web search
ZDNet May 17 2006 6:17AM GMT
Digging into Google Notebook javascript
ZDNet May 17 2006 5:36AM GMT
What’s Next for Yahoo?
iMedia Connection May 17 2006 5:05AM GMT
Google fine-tunes video service
CNET News.com May 17 2006 4:50AM GMT
Study Finds RSS Ad CTR Leveling Off
ClickZ Today May 17 2006 4:08AM GMT
Is Google God?
Time May 17 2006 12:44AM GMT
The Aristocracy Of Relevance
Media Post May 16 2006 9:57PM GMT

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y