PDA

View Full Version : robots.txt and google's supplemental index


kool aussie
06-21-2005, 01:59 AM
Does anyone know a way to stop google adding pages to its supplemental index?

Google is adding pages from directories which have been specifically disallowed via robots.txt file to its supplemental index and now says I have over 80,000 pages in my site (site:domain.com) when it should be only about half that?

In many cases these pages are database created duplicates of pages which no longer exist on my site but are still showing in Google's supplemental index.

I suspect this may be causing a 'duplicate page' penalty for the rest of my site as it suffered a major drop in the burbon update.

A "Disallow: /directory/" in robots.txt should mean DO NOT index at all not add to the supplementary index.

Chris_D
06-21-2005, 03:27 AM
Hi Kool Aussie, and welcome to the SEW forums!

Using a Robots.txt file won't stop a URL being displayed in a search engine index – in practice, it only stops the content of the page being indexed. i.e they still retrieve the URL - and list it. They just don't parse the page contents.

Try it - you should find that no pages come up for a search on a unique content string on a page which listed in the robots.txt exclusion. The 'other' pages should come up - but not the pages restricted by robots.txt

If the pages no longer exist - i.e. return a 404 - you can remove them: http://www.google.com/remove.html

If you don't want pages indexed or listed - meta robots 'noindex' on the template for the dynamically generated pages you don't want indexed is often a better option - but often means much more work.

BTW - you can't use both robots.txt and meta robots 'noindex' - as a compliant robot, following the robots.txt directive - won't parse the restricted pages (and therefore won't read the meta robots 'noindex' tag...) so it will still end up in the index as a 'url only' result.

kool aussie
06-21-2005, 03:56 AM
thanks... that clears things up a bit.

Any idea how long it should take for google to naturally drop old pages from it's supplemental index. These pages haven't existed for about 6 weeks, google has done a deep scan of my site a couple of weeks ago but the old non existant pages still show up with description but as a supplemental result and probably causing a duplicate penalty.

Unfortunately it would take a while for me to manually remove several thousand pages using googles manual removal tool.

Chris_D
06-21-2005, 05:20 AM
Are the old pages definately returning a 404 HTTP header? Check the returned header to make sure that it is returning a 404 - not returning eg a 200 ok HTTP header with a 'custom' page that says the page can't be found.

If the pages are returning a 404 - then Google will keep trying for a while - if it keeps getting a 404 - it will 'eventually' delist them - usually weeks rather than days...

kool aussie
06-21-2005, 08:50 AM
Thanks Chris, definately 404.
Guess I'll just have to be patient and hope they delete soon.

AussieWebmaster
06-21-2005, 11:45 AM
BTW - you can't use both robots.txt and meta robots 'noindex' - as a compliant robot, following the robots.txt directive - won't parse the restricted pages (and therefore won't read the meta robots 'noindex' tag...) so it will still end up in the index as a 'url only' result.
Kool Aussie welcome aboard and I am definitely jealous of your location... family has a place on the beach in Southport but am in NYC.

The above is the biggest tip to keep in mind when doing no index etc. Many people miss this and waste a lot of time until someone puts them wise.