How Will Wikia Grow The Index?
When Wikia Search was released last night, Jimmy Wales explained they used a “placeholder index” for the search. While this may be appropriate for the alpha search, I’d like to ask Jimmy exactly how Wikia plans to crawl and index a significant portion of the web.
The Grub distributed crawler, which was acquired from LookSmart, appeared to provide most of the solution. At the O’Reilly Open Source Convention, Wales announced that he would immediately release the crawler to the open source community.
By downloading the client, Grub allows “the site owners the option of crawling their own data, with their own bandwidth. The client…is designed to connect to a central coordinating server, grab a batch of URLs, and then proceed to crawl them.” It claims 20:1 savings in bandwidth for both Wikia and the hosting website.
Since the summer, I’m not sure how much progress Wikia has made here. Within Grub’s site stats, there’s a “Wikia Search” team that crawled around 918k URLs so far. That seems far too low.
Site stats about Grub members tell a more complete story, as the top 100 members crawled 350 million URLs so far. The remaining 293 members aren’t shown, but if we assume 250k on average, then 425 million URLs would have been crawled in total.
There are other planning considerations too, regarding what belongs in the index. Will they be able to include the “right” domains or exclude the “wrong” domains? Will they be able to crawl some domains more or less frequently? Will video, images or other media be included?
We would be interested in knowing the game plan for developing a substantial index over time. It’s not just about numbers, although a billion or two could help with a 2009 launch.
More about:
The Merkle B2B 2023 Superpowers Index outlines what drives competitive advantage within the business culture and subcultures that are critical to success. It is the indispensable guide for B2B marketers to deliver world-class experiences and keep pace with the dynamic environment. Download Now
The ClicData survey found that various challenges exist that prevent organizations from achieving such gains. These challenges included inaccessible data formats and limited flexibility in displaying data in dashboards. Download Now
The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now
The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now