Hack Your Own Search Engine Crawler
Want to build your own customized search tool that can search the web, explore online databases, and mine virtually any other type of internet resource? Spidering Hacks shows you how.
Want to build your own customized search tool that can search the web, explore online databases, and mine virtually any other type of internet resource? Spidering Hacks shows you how.
Want to build your own customized search tool that can search the web, explore online databases, and mine virtually any other type of internet resource? Spidering Hacks shows you how.
Search engines rely on spiders (also called crawlers or web robots) to discover web pages for indexing. Spiders are one of the three fundamental technologies underlying all search engines.
Spidering Hacks, by Kevin Hemenway and Tara Calishain, offers “100 Industrial Strength Tips and Tools” for creating and running your own spiders. Among these tips and tools, of course, are instructions for creating your own personal web crawler that works much like those used by the major search engines.
But there are dozens of other “hacks” that allow you to go far beyond the simple discovery and retrieval of web pages. Among the more interesting hacks are those that allow you to combine and aggregate information from multiple resources, including invisible web databases that search engines have problems accessing.
These hacks let you build some really interesting, unique search tools. Want your own media library of audio, video, or images? Hacks 33-42 show you how. Other hacks show you how to automatically find weblogs of interest, do interesting things with Amazon’s database, aggregate multiple search engine results… the list of hacks is wide and varied.
Like co-author Calishain earlier book, Google Hacks, the book is well written, and the examples use code that has already been tested for your use. Most of these hacks require a decent understanding of the perl programming language to use effectively, but if you’re technically inclined that’s not a major obstacle. In fact, the first few chapters serve as a respectable introduction to web programming.
Importantly, the book leads off with a chapter called “Walking Softly” that stresses the importance of using best practices and respectful coding — in other words, making sure that your hacks do their intended job without causing unintended negative consequences once your spiders are unleashed on the web.
These introductory hacks also provide useful insight into how crawlers run by the major search engines do their job. Understanding crawler technology — even at this rudimentary level — can help improve your searching skills, by showing you both the strengths and limitations of the technology.
Personally, I found it fascinating just to see the wide range of creative tasks you can accomplish with just a little bit of programming effort. If you really get into this sort of “hacking,” you can get even more examples from O’Reilly’s Hacks web site.
Spidering Hacks
100 Industrial-Strength Tips & Tools
By Kevin Hemenway, Tara Calishain
O’Reilly, ISBN: 0-596-00577-6
424 pages, $24.95 US, $38.95 CA, #17.50 UK
We’re looking for one panelist to participate on the Cashing Out: The Preparation and Implications panel at the upcoming Search Engine Strategies conference in New York. The panel is from 3:45 – 5:15 pm on Monday, March 1. To be selected, you must have sold your company to another company, and be willing to discuss the implications of the sale to yourself, colleagues and the company’s customers. If you’re interested, please send an email with the subject “Cashing Out Panelist” to Chris Elwell no later than the end of business Friday.
Yesterday’s SearchDay omitted the URL for downloading the free NeedlePoint toolbar. You can get more information and download the toolbar by clicking here.
NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.