Hack Your Own Search Engine Crawler

Author

Chris Sherman

Date published February 4, 2004 Categories

Industry

Want to build your own customized search tool that can search the web, explore online databases, and mine virtually any other type of internet resource? Spidering Hacks shows you how.

Search engines rely on spiders (also called crawlers or web robots) to discover web pages for indexing. Spiders are one of the three fundamental technologies underlying all search engines.

Spidering Hacks, by Kevin Hemenway and Tara Calishain, offers “100 Industrial Strength Tips and Tools” for creating and running your own spiders. Among these tips and tools, of course, are instructions for creating your own personal web crawler that works much like those used by the major search engines.

But there are dozens of other “hacks” that allow you to go far beyond the simple discovery and retrieval of web pages. Among the more interesting hacks are those that allow you to combine and aggregate information from multiple resources, including invisible web databases that search engines have problems accessing.

These hacks let you build some really interesting, unique search tools. Want your own media library of audio, video, or images? Hacks 33-42 show you how. Other hacks show you how to automatically find weblogs of interest, do interesting things with Amazon’s database, aggregate multiple search engine results… the list of hacks is wide and varied.

Like co-author Calishain earlier book, Google Hacks, the book is well written, and the examples use code that has already been tested for your use. Most of these hacks require a decent understanding of the perl programming language to use effectively, but if you’re technically inclined that’s not a major obstacle. In fact, the first few chapters serve as a respectable introduction to web programming.

Importantly, the book leads off with a chapter called “Walking Softly” that stresses the importance of using best practices and respectful coding — in other words, making sure that your hacks do their intended job without causing unintended negative consequences once your spiders are unleashed on the web.

These introductory hacks also provide useful insight into how crawlers run by the major search engines do their job. Understanding crawler technology — even at this rudimentary level — can help improve your searching skills, by showing you both the strengths and limitations of the technology.

Personally, I found it fascinating just to see the wide range of creative tasks you can accomplish with just a little bit of programming effort. If you really get into this sort of “hacking,” you can get even more examples from O’Reilly’s Hacks web site.

Spidering Hacks
100 Industrial-Strength Tips & Tools
By Kevin Hemenway, Tara Calishain
O’Reilly, ISBN: 0-596-00577-6
424 pages, $24.95 US, $38.95 CA, #17.50 UK

Search Engine Strategies Speaking Opportunity

We’re looking for one panelist to participate on the Cashing Out: The Preparation and Implications panel at the upcoming Search Engine Strategies conference in New York. The panel is from 3:45 – 5:15 pm on Monday, March 1. To be selected, you must have sold your company to another company, and be willing to discuss the implications of the sale to yourself, colleagues and the company’s customers. If you’re interested, please send an email with the subject “Cashing Out Panelist” to Chris Elwell no later than the end of business Friday.

NeedlePoint Toolbar Download

Yesterday’s SearchDay omitted the URL for downloading the free NeedlePoint toolbar. You can get more information and download the toolbar by clicking here.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

Search Engine Keyphrases and the Power of the Modifier…
Search Engine Guide Feb 4 2004 12:01PM GMT

Preserving peer-to-peer networks is essential…
SiliconValley.com Feb 4 2004 12:00PM GMT

P2P vs RIAA heads back to court…
vnunet.com Feb 4 2004 11:49AM GMT

How Many Pop-ups Can a Pop-up Stopper Stop?…
Internet.com Feb 4 2004 0:44AM GMT

Spam haters are shopping less online, says a new consumer group survey…
InternetRetailer.com Feb 3 2004 11:02PM GMT

Yahoo composing music download plan…
CNET Feb 3 2004 9:42PM GMT

Is Ad-Supported RSS the Next Big Thing?…
Internet News Feb 3 2004 8:56PM GMT

Morpheus upgrade aims for P2P unity…
CNET Feb 3 2004 3:26PM GMT

Google crowned highest-impact brand…
CNET Feb 3 2004 2:36PM GMT

Chinese Internet portal Sohu reports profit in 2003…
SiliconValley.com Feb 3 2004 2:08PM GMT

Google plans Swiss R&D centre, taps European skill…
Forbes Feb 3 2004 1:29PM GMT

Major ISPs Ponder ‘Postage’ To Stem Spam…
Internet News Feb 3 2004 1:04PM GMT

Study Shows Web Searches Getting More Complex…
dmnews.com Feb 3 2004 7:52AM GMT

More about:

Resources

Analytics The 2023 B2B Superpowers Index

The Merkle B2B 2023 Superpowers Index outlines what drives competitive advantage within the business culture and subcultures that are critical to success. It is the indispensable guide for B2B marketers to deliver world-class experiences and keep pace with the dynamic environment. Download Now
Analytics Data Analytics in Marketing

The ClicData survey found that various challenges exist that prevent organizations from achieving such gains. These challenges included inaccessible data formats and limited flexibility in displaying data in dashboards. Download Now
Digital Marketing The Third-Party Data Deprecation Playbook

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now
Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Information

Follow us

Hack Your Own Search Engine Crawler

Search Engine Strategies Speaking Opportunity

NeedlePoint Toolbar Download

Search Headlines

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

SEO takeaways from 2020: A review of the most unusual year for search

Interview with Lior Davidovitch, the founder of PUBLC

Search engine results: The ten year evolution

Alternatives to Google: Mojeek believes a truly independent and tracking-fr...

Search trends 2018: what can marketers learn?

SEW Interview: Clark Boyd on visual search

The future of search

Where we’re going, we won’t need websites

Follow us

Hack Your Own Search Engine Crawler

Search Engine Strategies Speaking Opportunity

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

NeedlePoint Toolbar Download

Search Headlines

Get the Latestdaily news and insights about search engine marketing, SEO and paid search.

Resources

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

SEO takeaways from 2020: A review of the most unusual year for search

Interview with Lior Davidovitch, the founder of PUBLC

Search engine results: The ten year evolution

Alternatives to Google: Mojeek believes a truly independent and tracking-fr...

Search trends 2018: what can marketers learn?

SEW Interview: Clark Boyd on visual search

The future of search

Where we’re going, we won’t need websites

Get the Latest
daily news and insights about search engine marketing, SEO and paid search.