View Full Version : Google fails to abide by robots.txt
PhilC
09-16-2005, 06:08 PM
I have a site that was penalised almost 2 months ago. It had some grey bits in it which I cleaned out completely. Part of the clean-up was to disallow Googlebot from indexing the content a particular sub-directory, and the modified robots.txt file was uploaded on the 8th of August. It is now the 16th of September (5 weeks later), and Googlebot has got the robots.txt file many times since then, so all the pages that they had indexed from that sub-directory should have been long gone from the index by now.
The penalty caused the claimed ~60k pages in the index to steadily decrease until ~700 were left, and all in the normal index. Almost all of them were from the denied sub-directory. The next time I looked, maybe a week later, there were no pages left in the normal index, but there were 5,900 in the Supplemental index for a site: search with "www.", and 17,800 in the Supplemental index for a non-www. site: search. And they have stayed around that number for a while.
There shouldn't be any of those pages in any index. Google's site states that the way to have parts of a site removed from the index is to use the robots.txt file, but they ignore that file for the Supplemental index, and yet they do use Supplemental pages for obscure searchterms. That's not on.
projectphp
09-16-2005, 09:47 PM
Phil, robots.txt says "Do not download these pages". It says nothing about indexing. Read Alan Perkin's excellent article on the difference between robots.txt and robots meta tag for the semantic difference. Google is doing nothing wrong, IMHO, as you have not shown that tehy actually downloaded said pages, they just haven't removed them. If they only remove pages after the time has come to recrawl them, then pages may stay in teh index for significant periods of time.
Oh, and to remove them absolutely, use this: http://services.google.com:8882/urlconsole/controller?cmd=reload&lastcmd=login
It is a tool that will, within 24 hours, remove pages based upon a robots.txt file. I have used it several times, and each time it has worked a treat! This works because it is active, not passively waiting to crawl a page only to find it is dissallowed and removing it.
PhilC
09-16-2005, 10:57 PM
Huh. All these years I'd understood it the other way round - that robots.txt disallows files from being indexed, but not from being downloaded. We live and learn - thanks :)
I used the removal form. It states 5 days, but that doesn't matter.
It's possible that the site's thousands of pages in the index would never have been removed because I don't think they would come up for spidering again, but I may be wrong about that as well.
projectphp
09-17-2005, 05:56 AM
It says 5 days now? I am sure the first time I used it it said 24 hours, and then 48 hours...
That tool is something I wish all the Crawler based engines had. Would really help when files are "accidentally" indexed, or in the case of legal disputes (the "take my page down or feel the Wrath of Lawyer" stuff).
PhilC
09-17-2005, 08:34 AM
It's fortunate that all the pages are dynamic. Imagine having to add the necessary meta tag to thousands of different pages :(
I just had a thought. My robots.txt file still bans Googlebot from getting any of the pages. Google's instructions don't suggest that it should be changed to allow the pages to be accessed, so that the meta tags can be seen. Any thoughts on that?
I've removed the disallow line, just in case.
PhilC
09-17-2005, 08:46 AM
It says 5 days now? I am sure the first time I used it it said 24 hours, and then 48 hours...
That tool is something I wish all the Crawler based engines had. Would really help when files are "accidentally" indexed, or in the case of legal disputes (the "take my page down or feel the Wrath of Lawyer" stuff).
Yep - "within 5 days".
There was a legal dispute discussed here recently, where the lawyer was insisting that a snippet was removed/changed from the search results. It can be done with this tool, but nobody suggested it in the thread.
projectphp
09-17-2005, 09:41 AM
It can be done with this tool, but nobody suggested it in the thread.
Good point! Ppl are usually obsessed with ranking and getting indexed, not getting stuff removed, so I often forget that tool is there, and exactly what it does. As I said, a VERY helpful set of tools, and one that, IMHO, could and should be expanded upon (with some of the notification stuff one would hope...)
PhilC
09-17-2005, 12:44 PM
... could and should be expanded upon (with some of the notification stuff one would hope...)Now that would be a really nice addition.
PhilC
09-18-2005, 10:23 PM
They removed all the pages from the Supplemental index, and it took 1 day. The only problem is that they removed the few pages that weren't in the Supplemental index, and that weren't requested to be removed. Strange. Still, I can now work towards getting the site back in the index.