Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 12-11-2009   #1
sitetruth
Member
 
Join Date: Feb 2008
Posts: 138
sitetruth is on a distinguished road
Technical question: know any sites with restrictive "robots.txt" files?

For testing purposes, I'd like the URLs of some sites that disallow crawling of their home page by almost anything, by using very restrictive "robots.txt" files. Unindexed pages, of course, are something you can't search for.

(I'm working on the SiteTruth web crawler, and need some real-world test cases. It obeys robots.txt, but took it so literally that it would list blocked sites as nonexistent.)
sitetruth is offline   Reply With Quote
Old 12-31-2009   #2
NewKidOnTheBlock
Member
 
Join Date: Oct 2006
Location: Germany
Posts: 563
NewKidOnTheBlock is a glorious beacon of lightNewKidOnTheBlock is a glorious beacon of lightNewKidOnTheBlock is a glorious beacon of lightNewKidOnTheBlock is a glorious beacon of lightNewKidOnTheBlock is a glorious beacon of light
Re: Technical question: know any sites with restrictive "robots.txt" files?

Maybe try to search for related terms such as people complaining about not being able to look at the homepage of a certain website,etc. ?
NewKidOnTheBlock is offline   Reply With Quote
Old 08-12-2010   #3
BrianCosgrove
Hire the agency I work for, TPG Direct!
 
Join Date: Mar 2007
Location: Philadelphia
Posts: 85
BrianCosgrove is on a distinguished road
Re: Technical question: know any sites with restrictive "robots.txt" files?

Just looking for a follow up. How did your testing and work turn out?
BrianCosgrove is offline   Reply With Quote
Old 08-12-2010   #4
sitetruth
Member
 
Join Date: Feb 2008
Posts: 138
sitetruth is on a distinguished road
Re: Technical question: know any sites with restrictive "robots.txt" files?

Fixed that months ago.

There are a few interesting cases. One is the case where "example.com" and "www.example.com" have the same content, but the site operator wants only one crawled. Sometimes "www.example.com" redirects to "example.com", sometimes there's a redirect the other way. Sometimes one has a restrictive "robots.txt" file and the other doesn't. We've seen "robots.txt" files which are themselves redirected, raising the question of which site the "robots.txt" file controls.

Our current solution is that we will read the HEAD part of the home page of the site regardless of "robots.txt", looking only for redirects. If we get a redirect, then we know which site to examine. Other than that, we strictly obey "robots.txt".

We rate some sites as "blocked", because their "robots.txt" file doesn't allow us to read any pages. We also down-rate some sites because their "robots.txt" file keeps us from reading the information that identifies the business behind the site. (So don't block your "about" page with "robots.txt".)

The search engine behind Google Maps seems to be looking much more intensely at site ownership information than the main Google web search engine. Try doing some business searches in Google Maps. Google may be testing out some new strategies on the Maps side before they deploy them on the main engine.
sitetruth is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Crosslinking My Two Sites (help!), crosslinking .nl and/.com hardcase2 Link Building 0 05-06-2007 12:49 AM
Technical question about ability to access WebmasterWorld Dj Morri Search Engine Optimization 7 05-23-2006 03:24 PM
Question about adding sites to new free web directory seoinfosys Other Search Engines & Directories 1 05-12-2006 02:16 PM
Session Two - Day Two; Fun with Dynamic Sites rustybrick SEM Related Organizations & Events 0 08-09-2005 01:50 PM
Scraper Site Question ephricon Google Web Search 11 06-14-2005 10:22 AM