Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 02-19-2006   #1
critter
Underpromise; Overdeliver
 
Join Date: Jun 2004
Location: London, UK
Posts: 286
critter is a jewel in the roughcritter is a jewel in the roughcritter is a jewel in the roughcritter is a jewel in the rough
What's The Point of A Robots.txt File If Google Ignores It?

Hello All,

A few days ago, Rand Fishkin of SEOMOZ pointed out how Google and other engines often ignore the robots.txt files we place in the root of our sites.

My question today revolves around this same issue.

I noticed today Google indexing my images folder, even though I explicity prevent ALL SEARCH ENGINE SPIDERS from indexing that folder from various reasons. I have had this robots.txt file in the root of my site since the day it was launched and am quite annoyed and frustrated with Google for ignoring it and indexing the contents of the folder anyways.

I am curious first of all if this is something that might warrant a violation of copyright laws as some of the contents in the images folder are copyright material to the website owner and does not want his images being displayed in Google Images.

Furthermore, does anyone have any idea why Google continues to do this and how one can actually prevent the spiders from not indexing folders you specificy in the robots.txt file?

Cheers

Critter
critter is offline   Reply With Quote
Old 02-19-2006   #2
Papadoc
Member
 
Join Date: Jun 2004
Posts: 79
Papadoc is a jewel in the roughPapadoc is a jewel in the roughPapadoc is a jewel in the rough
They obey any of them that I construct. Perhaps take a look at it again and make sure that you have done it properly www.robotstxt.org
Papadoc is offline   Reply With Quote
Old 02-19-2006   #3
critter
Underpromise; Overdeliver
 
Join Date: Jun 2004
Location: London, UK
Posts: 286
critter is a jewel in the roughcritter is a jewel in the roughcritter is a jewel in the roughcritter is a jewel in the rough
Quote:
Originally Posted by Papadoc
They obey any of them that I construct. Perhaps take a look at it again and make sure that you have done it properly www.robotstxt.org
Not to be too blunt but Ive been in the SEO industry for over 5 years now and alon with many seo professionals seem to encounter this quite often.

My robots.txt file is perfect - just amazing how Google does what they want.

Critter
critter is offline   Reply With Quote
Old 02-20-2006   #4
Wail
Another member
 
Join Date: Jun 2004
Posts: 247
Wail will become famous soon enoughWail will become famous soon enough
Google doesn't check the robots.txt file very day. If you change it on them it can take some time for them to notice.

Also; are you sure it's actually Google on your site and not a user-agent spoofer?
Wail is offline   Reply With Quote
Old 02-20-2006   #5
AussieWebmaster
Forums Editor, SearchEngineWatch
 
AussieWebmaster's Avatar
 
Join Date: Jun 2004
Location: NYC
Posts: 8,153
AussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant futureAussieWebmaster has a brilliant future
I have to agree that the Google bot has a tendecy to ignore the robots text file... and when you are trying to save bandwidth and stop them from going through image directories etc. it gets hard to have that exclusion incuded in a head tag.
AussieWebmaster is offline   Reply With Quote
Old 02-20-2006   #6
critter
Underpromise; Overdeliver
 
Join Date: Jun 2004
Location: London, UK
Posts: 286
critter is a jewel in the roughcritter is a jewel in the roughcritter is a jewel in the roughcritter is a jewel in the rough
Quote:
Originally Posted by AussieWebmaster
I have to agree that the Google bot has a tendecy to ignore the robots text file... and when you are trying to save bandwidth and stop them from going through image directories etc. it gets hard to have that exclusion incuded in a head tag.
agreed!

They are indexing my flash files which i dont want

:-(

Critter
critter is offline   Reply With Quote
Old 02-20-2006   #7
Alan Perkins
Member
 
Join Date: Jun 2004
Location: UK
Posts: 155
Alan Perkins will become famous soon enough
robots.txt prevents compliant robots from reading content at the URL you specify. It does not, in and of itself, stop the URL being indexed - just the content at that URL.

Google supports the indexing of URLs without content. So you will sometimes see results in SERPs that contain no title, no snippet, no cache, no date, no size ... just a link to a URL. robots.txt may prevent the content at these URLs from being read, but it does not prevent the URLs being indexed.

In practice, Google will remove a protected URL shortly after it tries to retrieve the content at that URL and is prevented from doing so by robots.txt.

Googlebot is totally in compliance with the robots.txt protocol when behaving like this. Robots.txt is actually nothing to do with stopping search engine robots indexing URLs, and everything to do with preventing robots of all kinds (search engine or otherwise) reading content. HTH.
Alan Perkins is offline   Reply With Quote
Old 02-20-2006   #8
projectphp
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 449
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
To take Alan's comments a step further:

robots.txt says "don't download content that matches this", and a robots metatag says "Don't have this content in your index". I think we need to be clear on whethe Google shows a link to a file, or actually comes and downloads a file. a robots.txt disallow makes the former acceptable, but not the latter.

Google does have, however, a tool that allows you to delete URLs you don't want indexed by reusing your robots.txt file: http://services.google.com:8882/urlc...&lastcmd=login (may have to hit refresh).
projectphp is offline   Reply With Quote
Old 02-21-2006   #9
smallfry
 
Posts: n/a
Quote:
Originally Posted by Wail
Google doesn't check the robots.txt file very day. If you change it on them it can take some time for them to notice.

Also; are you sure it's actually Google on your site and not a user-agent spoofer?
How could you tell the difference? Most of us just assume if cpanel says its google that it is google. If I was to take a guess looking at the little website I play with I would say google might come more often then once a day.
"Googlebot 419+37 15.94 MB 21 Feb 2006 - 10:44"
What determines how often the google bot shows up anyway? Is this spoofed?
  Reply With Quote
Old 02-21-2006   #10
BradBristol
Has-Been "SEO Expert"
 
Join Date: Jan 2006
Posts: 107
BradBristol will become famous soon enoughBradBristol will become famous soon enough
Quote:
How could you tell the difference?
Try resolving the IP.
BradBristol is offline   Reply With Quote
Old 02-22-2006   #11
azhar5i
Member
 
Join Date: Feb 2006
Posts: 5
azhar5i is on a distinguished road
Arrow I still think it works

whether you guys talking alot about the robot.txt files and speaking ill of that but i still think it works for me, it has been 3 months since i launched my robot.txt file and it is working very well as googlebot couldn't have crawled forbidden directories at all.

azhar5i is offline   Reply With Quote
Old 02-22-2006   #12
BradBristol
Has-Been "SEO Expert"
 
Join Date: Jan 2006
Posts: 107
BradBristol will become famous soon enoughBradBristol will become famous soon enough
Quote:
Originally Posted by azhar5i
...googlebot couldn't have crawled forbidden directories at all...
WRONG!

robots.txt does not restrict access. A robots.txt file can be ignored by bots and it very often it is.

After you have more than three months experience, you will figgure this out.
BradBristol is offline   Reply With Quote
Old 02-22-2006   #13
projectphp
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 449
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
Hey? Can be ignored? Under what circumstances? What bots? Good or bad?

IMHO, the fact bots do ignore it is not evidence they can, just evidence that some programmers are no good at their job!

That said, there really isn't a need to make pointed comments to a noobie. This isn't Search Engine 1337 Watch, and I think you can make a point well without the need to resort to personal remarks.
projectphp is offline   Reply With Quote
Old 02-22-2006   #14
BradBristol
Has-Been "SEO Expert"
 
Join Date: Jan 2006
Posts: 107
BradBristol will become famous soon enoughBradBristol will become famous soon enough
Quote:
there really isn't a need to make pointed comments to a noobie.
Not aware I did. Just pointing out that a little experience does not go a long way in the webmaster field.

php check your PMs
BradBristol is offline   Reply With Quote
Old 02-27-2006   #15
feltonSEO
ugmSEO
 
Join Date: Oct 2005
Posts: 2
feltonSEO is on a distinguished road
From Google:

To save bandwidth, Googlebot only downloads the robots.txt file once a day or whenever we've fetched many pages from the server. So, it may take a while for Googlebot to learn of changes to your robots.txt file. Also, Googlebot is distributed on several machines. Each of these keeps its own record of your robots.txt file.

We always suggest verifying that your syntax is correct against the standard at http://www.robotstxt.org/wc/exclusion.html#robotstxt. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.myhost.com/robots.txt); placing the file in a subdirectory won't have any effect.

Also, there's a small difference between the way Googlebot handles the robots.txt file and the way the robots.txt standard says we should (keeping in mind the distinction between "should" and "must"). The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do. For example, consider the following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin
It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.
feltonSEO is offline   Reply With Quote
Old 02-27-2006   #16
BradBristol
Has-Been "SEO Expert"
 
Join Date: Jan 2006
Posts: 107
BradBristol will become famous soon enoughBradBristol will become famous soon enough
Hi feltonSEO,

Do you have a URL where you got this information from?
BradBristol is offline   Reply With Quote
Old 02-27-2006   #17
mick g
Member
 
Join Date: Sep 2004
Posts: 126
mick g is a jewel in the roughmick g is a jewel in the roughmick g is a jewel in the rough
Its from this page
Google Information for Webmasters
mick g is offline   Reply With Quote
Old 02-27-2006   #18
BradBristol
Has-Been "SEO Expert"
 
Join Date: Jan 2006
Posts: 107
BradBristol will become famous soon enoughBradBristol will become famous soon enough
Thanks Mick, I thought I had seen that info before, just could not remember where.

In the context of this thread - What google has said in public several years ago has to be taken with a grain of salt...

When that google quote was written, BigDaddy was just a gleam in Larry's eye...

I don't know for sure yet, but I think BD handles googlebot different than they did in the past.
BradBristol is offline   Reply With Quote
Old 06-18-2013   #19
webpartner
 
Posts: n/a
Re: What's The Point of A Robots.txt File If Google Ignores It?

Like most things, it may appear useless if you don't know what it's for or how to use it!

You can block CRAWLING by using a Disallow in the robots.txt.

You can very easily and reliably block INDEXING by using a robots meta "noindex" in the head section of your document.

You can also use X-Robots-Tag in the htaccess to block indexing e.g. of pdf files or other non-HTML resources (as they do not have a head section like and HTML page does).
  Reply With Quote
Old 06-18-2013   #20
Jazajay
 
Jazajay's Avatar
 
Join Date: Jul 2007
Location: Leicester, England
Posts: 713
Jazajay has disabled reputation
Re: What's The Point of A Robots.txt File If Google Ignores It?

Yeah the robots.txt wont necessarily stop indexed - it will stop caching of the page. You just need to - as webpartner says: use the page level meta to stop full indexing.
__________________
Connect with the amazing Jaz at: http://uk.linkedin.com/in/jamesjohnsonshapecreative
Jazajay is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off