PDA

View Full Version : Create custom search engine spider?


adamclark
12-01-2004, 06:39 AM
Hi,

I'm toying with the idea of creating my own search engine spider. Now all I'm interested in using it for is my client sites to check if any spider barriers may be in place, so I'm not interested in indexing huge volumes of results, just can the spider get through the site, if not where does it fail? I know that there is a number of search engine spider simulator sites out there, but I guess I just want more :)

Is there a program already out there that can be used? I'm really only looking for something that is going to be used in house so anything open source that could be used would be great. (Or if there is a tutorial on how to build your own search engine crawler out there even better).

Thanks

Adam

Mikkel deMib Svendsen
12-01-2004, 07:33 AM
The problem is not so much building a simple spider. There are plenty of decent spiders around and also some Open Source projects, I believe. However, turning it into a tool that can detect most indexing barriers is not so easy. There ar so many things to check for and some of it are pretty complex if you want to automate it.

I have found that a combination of manual testing and a few different tools to check specific things (such as grabbing all titles and META-tags off an entire site) works the best.

adamclark
12-01-2004, 08:35 AM
Hi Mikkel,

Thanks for the reply. I guess at this point I'm really after a starting block to grow from. I have just seen a raft a poorly designed spider-incapable client sites recently and rather than trawl through every page of every site if I could construct a search engine spider to do a crawl and point of problems like security issues that bar crawling, spider traps etc... it would really help.

From your post are you implying that it may just be too much hassle for the benefits it would provide?

Just out of interest you mentioned that there were plenty of decent spiders out there including some open source ones, but I have only really managed to come across one, would yo be kind enough to name a couple that I at least could look at?

Thanks again

Adam

orion
12-01-2004, 04:01 PM
I agree with Mikkel. You don’t need a fancy crawler. This is more about security strategies than about designing a crawler. Check the

Robots.txt & Security Issues (http://forums.searchenginewatch.com/showthread.php?t=2786) thread.

Use a reverse crawling approach.

1. Check the robots.txt file of a target web property. This should give you a crude idea of their root tree, levels and architecture.

2. Discover secondary and hidden paths as described in the above thread.

Often, not always, secondary paths lead you to other paths not blocked or exposed to crawls. Often these are paths forgotten or intentionally left on the web by webmasters. These could also be paths appended during upgrades or by new applications, toolbars, plug-ins, etc. See the above thread for details.

Use the discovered paths as seeds for your crawler. If you can grab links in this way, chances are you could grab more links from upper levels from the architecture. Bingo!

This “backward crawling” (you starts at a low-level path and ends in an upper level) is well known by hackers. However, there are many ways of avoiding backward crawls. Unfortunately, not all webmasters are that savvy or really care about security issues.

Orion