IndustryHack Your Own Search Engine Crawler

Hack Your Own Search Engine Crawler

Want to build your own customized search tool that can search the web, explore online databases, and mine virtually any other type of internet resource? Spidering Hacks shows you how.

Want to build your own customized search tool that can search the web, explore online databases, and mine virtually any other type of internet resource? Spidering Hacks shows you how.

Search engines rely on spiders (also called crawlers or web robots) to discover web pages for indexing. Spiders are one of the three fundamental technologies underlying all search engines.

Spidering Hacks, by Kevin Hemenway and Tara Calishain, offers “100 Industrial Strength Tips and Tools” for creating and running your own spiders. Among these tips and tools, of course, are instructions for creating your own personal web crawler that works much like those used by the major search engines.

But there are dozens of other “hacks” that allow you to go far beyond the simple discovery and retrieval of web pages. Among the more interesting hacks are those that allow you to combine and aggregate information from multiple resources, including invisible web databases that search engines have problems accessing.

These hacks let you build some really interesting, unique search tools. Want your own media library of audio, video, or images? Hacks 33-42 show you how. Other hacks show you how to automatically find weblogs of interest, do interesting things with Amazon’s database, aggregate multiple search engine results… the list of hacks is wide and varied.

Like co-author Calishain earlier book, Google Hacks, the book is well written, and the examples use code that has already been tested for your use. Most of these hacks require a decent understanding of the perl programming language to use effectively, but if you’re technically inclined that’s not a major obstacle. In fact, the first few chapters serve as a respectable introduction to web programming.

Importantly, the book leads off with a chapter called “Walking Softly” that stresses the importance of using best practices and respectful coding — in other words, making sure that your hacks do their intended job without causing unintended negative consequences once your spiders are unleashed on the web.

These introductory hacks also provide useful insight into how crawlers run by the major search engines do their job. Understanding crawler technology — even at this rudimentary level — can help improve your searching skills, by showing you both the strengths and limitations of the technology.

Personally, I found it fascinating just to see the wide range of creative tasks you can accomplish with just a little bit of programming effort. If you really get into this sort of “hacking,” you can get even more examples from O’Reilly’s Hacks web site.

Spidering Hacks
100 Industrial-Strength Tips & Tools
By Kevin Hemenway, Tara Calishain
O’Reilly, ISBN: 0-596-00577-6
424 pages, $24.95 US, $38.95 CA, #17.50 UK

Search Engine Strategies Speaking Opportunity

We’re looking for one panelist to participate on the Cashing Out: The Preparation and Implications panel at the upcoming Search Engine Strategies conference in New York. The panel is from 3:45 – 5:15 pm on Monday, March 1. To be selected, you must have sold your company to another company, and be willing to discuss the implications of the sale to yourself, colleagues and the company’s customers. If you’re interested, please send an email with the subject “Cashing Out Panelist” to Chris Elwell no later than the end of business Friday.

NeedlePoint Toolbar Download

Yesterday’s SearchDay omitted the URL for downloading the free NeedlePoint toolbar. You can get more information and download the toolbar by clicking here.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

Search Engine Keyphrases and the Power of the Modifier…
Search Engine Guide Feb 4 2004 12:01PM GMT
Preserving peer-to-peer networks is essential…
SiliconValley.com Feb 4 2004 12:00PM GMT
P2P vs RIAA heads back to court…
vnunet.com Feb 4 2004 11:49AM GMT
How Many Pop-ups Can a Pop-up Stopper Stop?…
Internet.com Feb 4 2004 0:44AM GMT
Spam haters are shopping less online, says a new consumer group survey…
InternetRetailer.com Feb 3 2004 11:02PM GMT
Yahoo composing music download plan…
CNET Feb 3 2004 9:42PM GMT
Is Ad-Supported RSS the Next Big Thing?…
Internet News Feb 3 2004 8:56PM GMT
Morpheus upgrade aims for P2P unity…
CNET Feb 3 2004 3:26PM GMT
Google crowned highest-impact brand…
CNET Feb 3 2004 2:36PM GMT
Chinese Internet portal Sohu reports profit in 2003…
SiliconValley.com Feb 3 2004 2:08PM GMT
Google plans Swiss R&D centre, taps European skill…
Forbes Feb 3 2004 1:29PM GMT
Major ISPs Ponder ‘Postage’ To Stem Spam…
Internet News Feb 3 2004 1:04PM GMT
Study Shows Web Searches Getting More Complex…
dmnews.com Feb 3 2004 7:52AM GMT
powered by Moreover.com

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y