PDA

View Full Version : Is My Robots.txt Block Right?


onedodd
11-22-2005, 08:13 PM
Google Robots not following directions.

User-agent: *

Disallow: /angler_files/angler_print.html
Disallow: /resources/general/etiquette_print.html
Disallow: /resources/fishingtips/lure_tips_pf.html
Disallow: /business/print_me.php


I am trying to block a few printer friendly pages and some "print me" forms.
This robots.txt has been in my root for over 1 solid month. Today I check and find that all of these pages were crawled and cached yesterday.

Also I do see in my sitemaps control panel that it does say "URL blocked by Robots.txt" - Why does it follow some directions and not others?

Why would the bots not honor my robots? It must be wrong.

Just so you know - Those files are one level from root like this: *ww.mysite.com/business/print_me.php
*ww.mysite.com/resources/fishingtips/lure_tips_pf.html


Thank You, Joe

maildeepak
11-22-2005, 10:51 PM
what have u given in the Robots Meta tags ?

suppose if one gives contradictory rules in robots.txt and robots meta tags...which one will the spider bot consider ? can anyone clarify this...:confused:

onedodd
11-22-2005, 11:38 PM
My meta tags say index, follow,but my header is an includes file so its the same for the rest of my directory. Its the same header for every page.
In order to remove that from my header on the pages I wish to block I' have to remove it from the pages I want crawled regularly.

I can take the 'index, follow' out of some pages but that directory of mine is just too many headers and would take forever and a day.

So I can't really go in and change the header because we are talking a thousand or more pages.

I was under the impression the robots.txt was always followed first or was the default instructions for the bots. But sounds like that just may be my problem.

Anybody have solid info on this?

maildeepak
11-23-2005, 01:15 AM
My meta tags say index, follow,but my header is an includes file so its the same for the rest of my directory. Its the same header for every page.
In order to remove that from my header on the pages I wish to block I' have to remove it from the pages I want crawled regularly.

I can take the 'index, follow' out of some pages but that directory of mine is just too many headers and would take forever and a day.

So I can't really go in and change the header because we are talking a thousand or more pages.

I was under the impression the robots.txt was always followed first or was the default instructions for the bots. But sounds like that just may be my problem.

Anybody have solid info on this?

was breaking my head for the past one hour on the above issue....could not find the solution..:o

so waiting for any person who has knowledge of this issue to help out....

i believe that the whole problem has arisen bcos of your using Include file. :D

all i can do now is to wait until someone good in robots gives us a reply...:o

vincentBrown
11-23-2005, 01:54 AM
I am facing the same problem.

i) My robots.txt has been in existence ever since the creation of my site.

ii) I dont have meta tags that override what's specified in the robots.txt

But still Google doesn't obey the robots.txt and crawls all "Disallowed URLs".

mcanerin
11-23-2005, 04:06 AM
Here is what happens:

When a search engine first visits a site, it loads the robots.txt file as the basis for crawling the rest of the site for that session.

Now, you would think that would cover you, but it doesn't. Google apparently follows the last instruction it was given, or most explicit (granular) which would be the metatag, in this case.

This should only happen if there is a direct link from somewhere to that page from outside your site. It might not be, but in theory that's the only time.

As a general rule, it's best to use robots.txt for directories, robots metatag for pages, and nofollow for links, in terms of granularity of results.

My suggestion would be to:

a) remove the explicit allow in the metatag - for one thing it's a completely useless command - all robots index, follow as a default. It's only use is to override other disallow directives, and you can see that it works fine for that.

b) Then do one of 2 things - put all your pages you don't want spidered in a single directory and disallow it using robots.txt, OR change the index, follow to noindex, nofollow (you can also use "none" as a substitute for "noindex, no follow")

There is almost never an SEO reason to have an index, follow tag. There is only one exception and that's an override of a general directive.

More information (and examples): http://www.mcanerin.com/search-engine/robots-meta-tag.htm

Ian

mcanerin
11-23-2005, 04:20 AM
i) My robots.txt has been in existence ever since the creation of my site

ii) I dont have meta tags that override what's specified in the robots.txt

But still Google doesn't obey the robots.txt and crawls all "Disallowed URLs".

Can you PM me an example page? I just quickly checked your robots.txt and site, and didn't find any instances of this.

Ian

maildeepak
11-23-2005, 06:38 AM
Can you PM me an example page? I just quickly checked your robots.txt and site, and didn't find any instances of this.

Ian

dear mcanerin...

a small doubt...i also checked his robots.txt file.

but how r u able to say that google did not visit the files ? will googlebot make an entry there whenever it hits the file or am i missing something ?

regards
deepak

mcanerin
11-23-2005, 03:48 PM
Without access to his logs, it's impossible to know for certain (and just because a robot lands on a page doesn't mean it indexes it). So what I did was look at the pages that were excluded from the robots.txt (that I could find) and checked Googles cache. Nothing.

Ian

maildeepak
11-23-2005, 10:06 PM
Without access to his logs, it's impossible to know for certain (and just because a robot lands on a page doesn't mean it indexes it). So what I did was look at the pages that were excluded from the robots.txt (that I could find) and checked Googles cache. Nothing.

Ian

thanx...that's what i wanted to know...

with regards,
deepak.c

maildeepak
11-23-2005, 10:28 PM
Google Robots not following directions.

User-agent: *

Disallow: /angler_files/angler_print.html
Disallow: /resources/general/etiquette_print.html
Disallow: /resources/fishingtips/lure_tips_pf.html
Disallow: /business/print_me.php


I am trying to block a few printer friendly pages and some "print me" forms.
This robots.txt has been in my root for over 1 solid month. Today I check and find that all of these pages were crawled and cached yesterday.

Also I do see in my sitemaps control panel that it does say "URL blocked by Robots.txt" - Why does it follow some directions and not others?

Why would the bots not honor my robots? It must be wrong.

Just so you know - Those files are one level from root like this: *ww.mysite.com/business/print_me.php
*ww.mysite.com/resources/fishingtips/lure_tips_pf.html


Thank You, Joe

hai onedodd..

just went thru. your site...i feel ur robots.txt is fine.

but when i checked ur source code...i found the following lines..

<META NAME="robots" CONTENT="index, follow">
<META NAME="GOOGLEBOT" CONTENT="INDEX, FOLLOW">

as mcanerin said in the above posts...remove off those lines as they are useless bcos by default most of the robots follow and index the pages. also after giving robots you have specified googlebot which is again a waste ....

so as far as now....my suggestion will be to remove these codes from all the pages as this is the only option that i can find for now...mayb giving the above two codes in meta tags and then giving opposite rules in robots.txt would be the reason (not sure though..).

if u cud not remove for all the around 46k pages that u have...atleast remove them for the pages that u have specified in robots.txt...

will get back to u when i find some more...

ur turn to reply to this..:)

PhilC
11-23-2005, 10:28 PM
hmm...

I'm not suggesting that you are mistaken, Ian, but that isn't the way it should work. A spider shouldn't even look at a page if it's excluded from being indexed in the robots.txt file. There's no reason to examine the page for any overriding instuctions.

Google gets the robots.txt file once a day, so they say, and they should obey it immediately. I know that the potocol is only concerned with indexing, and not spidering, but if someone has excluded a page in the robots.txt file, then there's no need for any further looking for overriding instructions.

Something similar came up in the WMW banning spiders thread. It seemed to me that no engine should be listing linked URLs, as Google does with its URL only listings, without checking the site's robots.txt file first. The robots.txt file protocol is for people to *not* have pages listed in the serps, and URL only listings that send people to disallowed pages isn't in the spirit of the protocol, even though it is technically allowed. I don't know that Google doesn't check first - it came up in one of Danny's articles about WMW.

Sorry....I digressed a bit.

Chris_D
11-23-2005, 10:28 PM
Hi onedodd

I think a much better solution is to forget robots.txt, and solve your specific issue a different way. It looks like it only affects a few pages....

For example, on the page:

http://www.fintalk.com/angler.html make the link to http://www.fintalk.com/angler_files/angler_print.html a Javascript link

eg make the link on http://www.fintalk.com/angler.html in the form <a href="javascript:void(0)" onclick="javascript:openLookup('/angler_files/angler_print.html','appt',700,500)">Printer Friendly Version
</a>

and then on the page http://www.fintalk.com/angler_files/angler_print.html add a print link like <a href='javascript:void(0)' onClick='window.print();return false'>Print</a>

Another alternative is to just use CSS to make a printer friendly page.

That should solve the issue.... without complicating things too much.

best

Chris

maildeepak
11-23-2005, 10:37 PM
hmm...

I'm not suggesting that you are mistaken, Ian, but that isn't the way it should work. A spider shouldn't even look at a page if it's excluded from being indexed in the robots.txt file. There's no reason to examine the page for any overriding instuctions.



exactly my thoughts PhilC...

but i could not find any hard evidence yesterday when i was searching the www for any documentation. that is y i didn't post it here..thought i may be wrong...:p

PhilC
11-23-2005, 10:42 PM
Technically, the robots.txt protocal disallows the indexing of files, and not the spidering of them, but the intention is to disallow it all, imo.

mcanerin
11-23-2005, 10:57 PM
I'm not suggesting that you are mistaken, Ian, but that isn't the way it should work. A spider shouldn't even look at a page if it's excluded from being indexed in the robots.txt file. There's no reason to examine the page for any overriding instuctions.

You are correct - but there is one exception. If there is a link from outside the site directly to that page, then what happens is that the spider lands on it, then, now that it knows the domain, grabs the robots.txt. But since it also knows that there is an explicit instruction to index (and in this case, it's very explicit - it names the googlebot directly and allows index and follow) it obeys the most granular one.

This may also explain why only some of the pages excluded are indexed - only the ones Google landed on from the outside should be hit.

This is actually proper behavior, though a weird result. The idea is that if you did this on a page (ignoring the robots.txt for now):

<META NAME="robots" CONTENT="noindex, nofollow">
<META NAME="GOOGLEBOT" CONTENT="INDEX, FOLLOW">

The result would not be that googlebot goes away as soon as it sees the directive for all robots to leave - it loads the rest of the page, and upon seeing a more specific directive that applies, obeys it instead. The above code would result in all robots except google are excluded.

Some people actually do this with all 4 major search engines, and exclude everything else. This is proper and expected behavior.

The robots.txt would be treated exactly like that first line of code - a general directive. This is an odd case because if the robots.txt prevents following, then Google normally would never be in a position to visit and therefore see the more granuar directive. The only thing I can think of is a direct link from the outside, or something similar.

Some really wierd things can happen when you mess around with various levels of permissions - I once locked myself out of an NT box totally (had to re-format the drive) by accidently setting an express "disallow" for the system on "users", never thinking that it would (in that one case) override the express "allow" for "Administrators" (who were also "users" in this case). It's been around 10 years and I still feel my chest hurt when I think about it. It's an NT 4 thing due to Netware compatibility vs standard MS file permissions - I don't think it would happen on a *nix box, but I've been careful not to try to find out ever since ;)

Ian

PhilC
11-23-2005, 11:24 PM
I like the anecdote :)

Technically, that's not quite what happens. Google's spider doesn't look at the pages - it just gets and stores them for another programme to examine later. But when getting the first page of a site, the spider also gets the robots.txt file, so both files are in Google's 'possession' before either of them are examined. That being the case, the robots.txt file should be examined first. It may not be, but it should be.

I may be mistaken about Google getting the robots.txt file immediately when the first file in a site is fetched, but I'm pretty sure it does. In fact, I'm pretty sure it's the first thing that it gets. It certainly should be the first thing.

In this case, the pages had previously been allowed, and it could be that Google is following the on-page directives rather than the robots.txt ones. I don't think they should, but they may be. Removing the on-page directives should fix it. As you pointed out, they are useless, because following and indexing is the default for all pages.

maildeepak
11-23-2005, 11:58 PM
In this case, the pages had previously been allowed, and it could be that Google is following the on-page directives rather than the robots.txt ones. I don't think they should, but they may be. Removing the on-page directives should fix it. As you pointed out, they are useless, because following and indexing is the default for all pages.

i go with you. as of now, we dont have anything else to work on. remove the on page robots meta tags and wait until googlebot comes again...

mcanerin
11-24-2005, 01:59 AM
I definitely agree to removing the on-page directives :)

I was reading the Yahoo crawler information recently and they said something that caught my attention, and relates to your mention of "getting" the documents and having them in possession:

Yahoo! Slurp obeys the Robot Exclusion Standard. Specifically, Yahoo! Slurp adheres to the 1994 Robots Exclusion Standard (RES).
Yahoo! Slurp will obey the first entry in the robots.txt file with a User-agent containing "Slurp". If there is no such record, it will obey the first entry with a User-agent of "*".

Disallowed documents, including slash (the home page of the site), are not indexed, nor are links in those documents followed. Yahoo! Slurp does read the home page at each site and uses it internally, but if it is disallowed it is neither indexed nor followed. If a page has robots.txt standards disallowing it to be crawled, Yahoo! will not read or use the contents of that page. The URL of a disallowed page may be included in Yahoo! Search Technology as a "thin" document with no text content. Links and reference text from other public web pages may provide identifiable information about a URL and may be indexed as part of web search coverage.

source: http://help.yahoo.com/help/us/ysearch/slurp/slurp-02.html

The point about needing to read (but not use) the home page, and the reference to "thin" documents leads me to believe (I have to guess, not having a Yahoo engineer handy) that the search engine lands on the page and reads it, by necessity of making the call to the URL in the first place. It would have to, since with all the redirects (301, 302, meta-refresh etc) that are possible the search engine would have to fully follow the page until the trail ends before knowing what robots.txt it should use.

Not having a Google engineer handy, I can only hazard a guess and say that Google would do something similar. Not a rock solid proof, I'll grant, but the best I've got at the moment :)

Imagine a site with a disallow all in it's robots.txt, with a page that is 301'd to a site that had index, follow on everything. You now have 2 possible robots.txt files to use - one from the source domain and one from the target domain.

Since the way a 301 is handled is that the source domain is not indexed, it would seem that the proper robots.txt to use would be the one from the target, since the page is technically there. But if the first robots.txt kicks the spider out, then that could not happen.

This implies that pages can be read and directives in them acted upon independently of the robots.txt. They would have to. A robots.txt file can't be absolute by definition, else the search engine would not be able to access the robots.txt file in the first place - technically, "disallow: /" excludes the robots.txt file itself, which causes a bit of a chicken/egg issue.

Ian

maildeepak
11-24-2005, 03:14 AM
This implies that pages can be read and directives in them acted upon independently of the robots.txt. They would have to. A robots.txt file can't be absolute by definition, else the search engine would not be able to access the robots.txt file in the first place - technically, "disallow: /" excludes the robots.txt file itself, which causes a bit of a chicken/egg issue.

Ian

chicken....


.
.
.
.

egg...

.
.

.

umm....ok...i give up....:D

Chris_D
11-24-2005, 03:42 AM
There has always been a huge degree of misinterpretation of what the robots.txt file, and what noindex,nofollow actually does - and what people would like them to do.

People want these to make their sites invisible and think that's what these files/ tags do.

Reality is - there is no 'invisibility' tag/ file. Google records the URL if there are links to it - but Google respects the metatags and robots.txt by not 'indexing' the content. The confusion arises because Google still puts the url in the Google index (noun) even if there is a robots.txt disallow.

The confusion is that 'index' can be a noun or a verb - this may sound really nitpickey - but it is crucial to understanding how this all works.

In summary:

1. Robots.txt won't stop a URL being displayed in Google serps.

2. meta 'noindex' will stop a URL being displayed in Google serps.

The Standard for Robot Exclusion says that robots.txt is intended to control robots' fetching/ visiting of pages. Disallow - The value of this field specifies a partial URL that is not to be visited" http://www.robotstxt.org/wc/norobots.html

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.
http://www.robotstxt.org/wc/exclusion.html

The basic idea is that if you include a tag like: <META NAME="ROBOTS" CONTENT="NOINDEX">
in your HTML document, that document won't be indexed.
If you do: <META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be parsed by the robot.
http://www.robotstxt.org/wc/faq.html#noindex

If you 'disallow' a page in robots.txt, then your on page metatags <meta name="robots" content="noindex"> tag can have no effect on a robot which obeys a robots.txt exclusion, because such a robot won't visit/ fetch the page.

Logically - if it does not visit the page, it can't parse the metatag.......

maildeepak
11-24-2005, 05:47 AM
If you 'disallow' a page in robots.txt, then your on page metatags <meta name="robots" content="noindex"> tag can have no effect on a robot which obeys a robots.txt exclusion, because such a robot won't visit/ fetch the page.

Logically - if it does not visit the page, it can't parse the metatag.......

chris...

you mean to say that...w.r.t the 1st post of this thread...wherever he doesen't want to get his page indexed, then remove off that link from robots.txt file and then add a META robots 'noindex' in those particular pages ?

is that correct ? bcos like u said, if there is robots.txt then that page will not be crawled and hence that page will show up in index....this is what u intend to say right ? now my doubt is if a particular page is not being crawled by a bot bcos of robots.txt, then how will it show up in serps like u said in ur post.....

or am i confusing two issues together ?

PhilC
11-24-2005, 08:03 AM
Ian.

My thoughts whilst reading that Yahoo! extract threw up another problem with Yahoo!. They say they follow the first Yahoo!-specific entry of the robots file (I'm fed up of typing that with ".txt" at the end), and only if there is no such entry do they follow an entry that encompasses all spiders. That's got to be wrong. For Yahoo!, you can't 'globally' disallow a certain directory, and also disallow Yahoo! form individual pages and directories. E.g. you can't globally disallow the cgi-bin and images directories, and also disallow a specific directory to Yahoo!. I don't think that that's the way the robots protocol was intended to be. It can be easily handled, but I think they do it wrong.

301s and 302s muddy it. I only got out of bed a few minutes and my brain hasn't warmed up to the day yet, so I'll skip that aspect for now :cool:

I don't agree that an engine has to get the linked-to page before it requests the robots file from the site. The programming is such that it arbitrarily gets the robots file, because it is never (rarely) linked to, so, imo there's no reason not to get it first if the engine hasn't yet crawled any of the site's pages. The logfiles show that fetching the robots file is programmatically part of the spidering operations each day, so I can imagine the programme to be written something like:-

1. Get the next file to fetch from the pile (the pile of URLs from millions of sites)

2. Have we got the robots file for this site today? If yes, fetch the file, else fetch the robots file and act on it.

Alternatively:-

1. Get the next file to fetch from the pile (the pile of URLs from millions of sites)

2. Fetch the file.

3. Have we got the robots file for this site today? If no, fetch the robots file and act on it.


I don't see any reason why the first alternative shouldn't be done. The robots file has to be fetched at some stage, and the logfiles show that it is done at the same time (part of the spider's programming), so I don't see any reason why it shouldn't be the first file to fetch. Remember that the engines don't immediately follow new links that they find. The links are simply placed on the pile of links to crawl sometime in the future.

I disagree with Yahoo! and Google about "thin" results in the serps. I just read in another thread that GG said:-

First, I think there's a definite value to returning a link to a page even if we can't crawl that page. Quick example: the New York Times used to disable all bots from crawling them. That's fine, and we respected their robots.txt. But if a user comes to Google and types "ny times" into our search box, the best result to give them is nytimes.com. If a site says 'don't index that page', it means don't put it in the search results. That's what we mean when we disallow something in the robots file. I believe that that was the intention of the protocol, and imo, Google and Yahoo! are going against the protocol when they do things like that. They are out of order by going against the protocol just because they think it's more useful to searchers.

PhilC
11-24-2005, 08:25 AM
In summary:

1. Robots.txt won't stop a URL being displayed in Google serps.

2. meta 'noindex' will stop a URL being displayed in Google serps.I don't follow your reasoning, Chris. In both cases, the engine has the URL of the disallowed page, and from what both Yahoo! and Google have said, they will display it in the serps. I don't see how disallowing a page by a meta tag stops the URL from being displayed as a "thin" result.

Chris_D
11-24-2005, 07:58 PM
maildeepak - yes.

Here's my interpretation.

I think the difference is that if you have a robots.txt -as GG said (the earlier post) - Google will list the domain name - the 'thin' result.

The actual wording from Robots.txt (see my earlier reference) is
Disallow - The value of this field specifies a partial URL that is not to be visited

It only says 'don't visit'. Doesn't use the word 'index'. That's the important part.

If you have a meta "noindex" - Google won't even list it as a thin.

The actual wording is (see earlier reference)

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.

The difference is - it says 'don't index'

If you have both - the robots.txt trumps the meta noindex - as the bot won't parse/ visit the page, to be able to read the meta noindex.... so you get a 'thin' result.....

Robots.txt = thin result
Meta noindex = no thin result - nothing
Use both = thin result

At least that's what my testing of these two methods has shown over the past 4 years....... Try it and see.

:)

PhilC
11-24-2005, 08:22 PM
Now I see what you mean, Chris. I'd never noticed the difference between them before - don't visit and don't index.

So my reasoning that the robots.txt file should always the first file to fetch (I'm sure it is), results in a thin listing, regardless of the meta tag instruction. I suppose that's the way it is, if the quotes you provided are strictly followed by the engines. The robots protocol really does need to be updated in view of the links-based engines. I've been looking at it from the point of view that disallows and nofollows meant don't show in the search results, and I'm sure that's what most of us intend by both of them.

Marcia
11-25-2005, 03:21 AM
Thanks, Chris_D. Very good explanation!

Alan Perkins
11-27-2005, 03:05 PM
Google Robots not following directions.

User-agent: *

Disallow: /angler_files/angler_print.html
Disallow: /resources/general/etiquette_print.html
Disallow: /resources/fishingtips/lure_tips_pf.html
Disallow: /business/print_me.phpThere should not be a blank line between the User-agent field and the disallow field. The blank line indicates the end of a record. So you have an invalid robots.txt file there. You need:User-agent: *
Disallow: /angler_files/angler_print.html
Disallow: /resources/general/etiquette_print.html
Disallow: /resources/fishingtips/lure_tips_pf.html
Disallow: /business/print_me.php

For the avoidance of doubt:
robots.txt takes precedence over the robots meta tag, since a resource has to be read in order for the robots meta tag to be seen; and a resource won't be read if robots.txt prevents it being read.
theoretically robots.txt does not prevent a URL from being indexed, it only prevents the content at a URL from being indexed. Some search engines (e.g. Google, Yahoo) index URLs without indexing the content at those URLs. In practice such search engines will remove the indexed URL from once they see that it is protected by robots.txt, but it may reappear later. In a nutshell you can't stop URLs being indexed, you can only stop content being indexed. You can't use the robots.txt protocol to prevent a robots.txt file being read

There have been lots of errors in this thread. I will clear up a few, just for the sake of posting accurate information. No argument intended. :)

Now, you would think that would cover you, but it doesn't. Google apparently follows the last instruction it was given, or most explicit (granular) which would be the metatag, in this case.Wrong. Google will not read a metatag if the content in which it resides is protected by robots.txt.
Technically, the robots.txt protocal disallows the indexing of files, and not the spidering of them, but the intention is to disallow it all, imo.Wrong. The robots.txt protocol disallows the retrieval of content. It's actually nothing to do with indexing - it's designed for all types of robots, not just search engine spiders.Not having a Google engineer handy, I can only hazard a guess and say that Google would do something similar. Not a rock solid proof, I'll grant, but the best I've got at the moment.No it does not.Imagine a site with a disallow all in it's robots.txt, with a page that is 301'd to a site that had index, follow on everything. You now have 2 possible robots.txt files to use - one from the source domain and one from the target domain.No you don't. "index, follow" is a meta tag directive, not a robots.txt directive. One site cannot have two robots.txt files. There can be only one. ;) The 301 in this example would not be read and would not be followed, unless by a robot that breaches the 1994 protocol. Slurp breaches this protocol by reading "/" even if "/" is protected, but Googlebot does not - and the original question was about Googlebot.

Fort Cake
11-27-2005, 05:42 PM
The robots.txt protocol disallows the retrieval of content. It's actually nothing to do with indexing - it's designed for all types of robots, not just search engine spiders.Precisely - This can bee seen in log files if you keep a history. Google used to spider everything on my site until I put a robots file in / about a year ago.

Within a day or 2 Google stopped visiting everything thing listed in my robots file and continues to obey the robots file - Google does not even get the content if it's in the robots file.

On the other hand, for those into log analysis, it's interesting to see which bots don't obey the robots file... And it's interestng to watch site scrapers and such come and go.

PhilC
11-27-2005, 07:22 PM
Some of that had already been cleared up by Chris_D, Alan.

vincentBrown
11-28-2005, 06:49 AM
Missed out on the thread for the last 3-4 days due to some urgent personal work. But when I read through all your posts , especially that of "Alan Perkins" my doubts got cleared. (Thanks Alan)

My problem was that when I did a site:www.rankquest.com search on Google I found links that I had disallowed Google from indexing(using robots.txt). But Alan's post makes it clear that it is not the links that we disallow but the content in those links. That kinda explains why the site: search displays those links.

cheers,
Vincent S Brown ;-)

Alan Perkins
11-29-2005, 08:45 AM
Thanks AlanYou are welcome. :)

GoogleGuy
12-02-2005, 08:30 PM
Alan knocked that one out of the park; if you have a robots.txt, you may see url references (just the url), but you won't see snippets or fully crawled urls.

You can make it so that the url doesn't even show up in a couple ways:
1. (well-known) Let Google crawl the page, but add noindex as a meta tag on that page, and nofollow as a meta tag if you don't want Google to follow any outlinks from that page.
2. (not well-known) Forbid the pages in the robots.txt and then use the url removal tool that we provide. The url removal tool will remove pages for six months, but you should be very careful with it (because the pages will be gone for six months). In particular, if you have a www. site and you're trying to get rid of non-www urls, I wouldn't use the url removal tool. If Google is smart enough to know that the www and non-www urls are the same, requesting to remove your own non-www urls could also remove your www urls, for example.

Anyway, very few people bother with #2, but it's good to know about it.

onedodd
12-02-2005, 08:46 PM
Alan helped me via Private Messages and I thanked Alan but would like to thank him publicly.

Alan helped me, a novice at best understand quite a few things going on with my robots.txt and website. He explained some items to me that I was worried about in easy to understand terms and didn't make me feel like the amateur I am.
Kudos Alan and Thanks Again, Joe

Alan Perkins
12-04-2005, 06:30 PM
Kudos Alan and Thanks AgainYou're welcome, Joe. You have a nice site there. I hope all goes well for you. :)

Cortney
12-08-2005, 02:02 PM
I created a Google sitemap, and it worked with every one of my customers, except 1. Here is the error and description that i'm getting:

We've detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.
This configuration presents a security risk for site verification and therefore, we can't verify your site. If your web server is configured to return a status of 200 in the header of 404 pages, and we enabled you to verify your site with this configuration, others would be able to take advantage of this and verify your site as well. This would allow others to see your site statistics. To ensure that no one can take advantage of this configuration to view statistics to sites they don't own, we only verify sites that return a status of 404 in the header of 404 pages.

Please modify your web server configuration to return a status of 404 in the header of 404 pages. Note that we do a HEAD request (and not a GET request) when we check for this. Once your web server is configured correctly, try to verify the site again. If your web server is configured this way and you receive this error, click Check Status again and we'll recheck your configuration.

Apparently Google support is too busy to answer my support question directly. Therefore, I was wondering if anyone has experienced the same issue, or knew how to resolve it?

Thanks.

Alan Perkins
12-08-2005, 04:08 PM
Hi Cortney

It sounds like you have implemented a 404 handler, and the 404 handler never issues a 404 response (even when resources truly do not exist). This is a bad idea, as it makes your site infinitely large.

Get your developer(s) to take a look at you 404 handler and modify it so that it will return a HTTP 404 response when the request is for a resource that truly does not exist, such as /gvhbfvhbvdcjhb. At the moment, I would guess that a request for /gvhbfvhbvdcjhb generates a HTTP 200 response on your site, and that's what Google is complaining about.

PhilC
12-08-2005, 04:15 PM
The resolution is to get the server to return a 404 header when requested files don't exist. It's not uncommon for servers to return other codes instead, such as a 403 (forbidden) or a 200 when a custom 404 page is returned.

Google needs to know that the person who is requesting the verification actually has authority in the site. They do it by asking that for file with a particular filename to be uploaded to the site. Then they request the file, and if they get a 200 code returned, it is assumed that the file exists, and the site is verified.

But there's a problem when servers don't return 404 codes when a file doesn't exist. If a 200 is returned when a file doesn't exist, as in the case of your customer's server, then Google would get a 200 (ok) even if the special file didn't exist. Because of that, they can't trust a 200 code without also checking that the server actually returns 400s for non-existant files. And that's what they do. They also request a file that couldn't be in the site, and they are looking for a 404 to be returned. If they don't get a 404, they can't verify that the person requesting verification has authority in the site.

Another solution is to use the .htaccess file to return a 404 when the 404-checking file request is made, but it's better to get the server to return 404s.

Fort Cake
12-08-2005, 04:18 PM
How can you check to see that a 404 header is being returned?

PhilC
12-08-2005, 04:21 PM
Request a file that you know doesn't exist, and see what page is shown in the browser. Also the returned code is in the Title and is shown in the blue bar at the top of the browser.

Alan Perkins
12-08-2005, 04:49 PM
How can you check to see that a 404 header is being returned?Here's something that you can do on any PC... Create your HTTP request. Here is an example for searchenginewatch.com:telnet searchenginewatch.com 80
HEAD /grihwdofdw HTTP/1.1
Host: searchenginewatch.com

Copy all of that to the clipboard (including the blank lines at the end)
Open a DOS prompt (Start->All Programs->Accessories->Command Prompt)
Right click in the DOS prompt and select Paste
Your request will be sent and you'll get a response something like this:HTTP/1.1 404 Not Found
Date: Thu, 08 Dec 2005 21:35:15 GMT
Server: Apache
Content-Type: text/html
You are looking for that "404" on the first line of the response - that means the server is sending the correct "Not Found" response for the request
If you see any other response, e.g. "HTTP/1.1 200 OK" then your 404 handler is not returning a 404 when you want it to
To test this on your own site, substitute your domain for both instances of "searchenginewatch.com" (note: include the www. if your domain is referenced with that, as most are - unlike searchenginewatch.com) and substitute the URL you want to test, relative to your root URL (i.e. beginning with a "/") for /grihwdofdw
Request a file that you know doesn't exist, and see what page is shown in the browser. Also the returned code is in the Title and is shown in the blue bar at the top of the browser.That may not work if you have a 404 handler on your site, as the 404 handler may not output the standard response. e.g. the above example from searchenginewatch.com gives the correct 404 response, but if you check the URL (http://searchenginewatch.com/grihwdofdw) you'll see that 404 response is not mentioned in title or text.

Fort Cake
12-08-2005, 05:13 PM
Ah - OK, thanks. I did that in my Mac terminal window and got this returned: HTTP/1.1 404 Not Found
Date: Thu, 08 Dec 2005 22:07:04 GMT
Server: Apache/2.0.53 (FreeBSD) PHP/4.3.10
Accept-Ranges: bytes
Content-Length: 11953
Content-Type: text/html; charset=ISO-8859-1

Connection closed by foreign host.
EV-ESR1:~ marcsmith$ So my server is set up correctly.

I asked because I set a custom 404 page and wanted to make sure I didn't screw up something when I made that change to the httpd.conf

Alan Perkins
12-08-2005, 05:25 PM
So my server is set up correctly.
Correct. :)

PhilC
12-08-2005, 06:11 PM
That's an excellent tip Alan. I've never come across it before.

Yes - I'd forgotten that a custom 404 won't show the code in the browser.

Cortney
12-09-2005, 10:46 AM
You all gave very good information. I will pass it along. Thank you very much for your time and assistance.