PDA

View Full Version : 301 on robots.txt


Portran
08-07-2008, 07:50 PM
Hi

Hope someone can help. I have an old website redirected to a new one through page by page 301 .htaccess redirects. That's all fine, with the exception of the old robots.txt. The redirects include one for the root page / and this is effectively causing Google to follow the redirect when looking to download robots.txt.

This results in them displaying a 301 status in Webmaster Tools for robots text. The search string they follow to is "domain.comrobots.txt" which is not a domain, so they can't then find a 404. Don't think I can redirect from one robots.txt to the other,doesn't seem rational and they are not exactly the same anyway, the new site is similar but not identical.

Their seems no way to stop them looking for a robots.txt on the old site, can't even try URL removal with the 301, although this may not apply anyway. So what are you supposed to do about robots.txt when you redirect fom one site to another. No idea if the 301 on the old robots.txt will cause a problem but Google have been known to simply not crawl a site when there is a doubt, or oddity in robots.txt, which would make the redirects pointless.

I appreciate this must be a common situation but I have searched pretty well, including this forum and can't find an answer. Hope you can help.

JohnW
08-07-2008, 11:42 PM
>The redirects include one for the root page / and this is effectively causing Google to follow the redirect when looking to download robots.txt.

I'm really not sure what you mean but it sounds wrong. But anyhow Google will be using the robots.txt at the new site, regardless of what is or isn't redirected to it. As long as you have redirected every page of your site that google knows about, G will eventually stop looking for the old robots.txt.

Portran
08-08-2008, 08:53 AM
Hi JohnW

Thanks very much for the reply. What I meant by / is the redirection of the home page e.g.

Redirect 301 / http://www.newdomain.com (www.example.com)

The home page has always been just the root and indexed as such, with index.htm redirected to the root to avoid canonicalisation problems. So there's no option I know of to avoid that line, which then also redirects any pages that are not themselves individually redirected. This all works fine but I can't figure out what to do about the robots.txt.

As you said, the only relevant one is the robots.txt at the new site but I imagined there must be a way to stop Google constantly trying to download a file they can never access, an unchangeable 301. Hope you are right that they'll just get bored in the end, although still surprised there isn't a standard solution for something tha must occur quite a bit. Webmaster Tools give the option to remove a Sitemap but not robots.txt

Thanks again

AussieWebmaster
08-08-2008, 02:00 PM
I would replace the file or delete it

Portran
08-08-2008, 04:00 PM
Hi AussieWebmaster

Thanks very much for your reply. The trouble is, they are not looking for the robots.txt file on the site it relates to. I have tried deleting for sufficient time and putting up a new file, just in case but the .htaccess directives are taking priority, as they should I guess. Neither is it possible to add a directive for the robots.txt, where would I direct this to.

So without a directive e.g Redirect 301 /robots.txt http://.... and because the server will not add a trailing slash in this circumstance, the 301 redirect sends the bot to look for domain.comrobots.txt. There is no such domain suffix so no return is given, the file just shows as 301 and this apparently means they will carry on looking. I do not know whether this will interfere with reindexing the old URLs but bearing in mind Google's recent utterances on robots.txt, that is possible.

Others must have seen this before, although I can't find any relevant posts anywhere. Will keep on looking and post back if I find an answer, thanks again for your logical suggestions, just seems they are not logical to Google.

JohnW
08-08-2008, 04:31 PM
>they are not looking for the robots.txt file on the site it relates to

Have you checked the log files? If they are spidering the new site they should be pulling the new robots file.

Portran
08-08-2008, 05:21 PM
Hi JohnW

Kind of you to reply again. I agree that just redirecting one robots.txt to the other would have made sense, although the two sites are not exactly the same, so this would not be correct. They would also have the wrong sitemap entry. This could of course be taken out, not vital but I avoided that because of the only relevant section I did find in Google. Where they state a robots.txt is site specific and should not be drawn from another site. So the only robots.txt that can be used is the original one, or none at all, no obligation to have one.

I have checked the logs and there are no errors at the new site, as they are not looking there. They are searching for a file at domain.comrobots.txt which can't exist, or return a 404. The new site is being crawled well by Google and Yahoo, probably due to links to the new site but there is no evidence of crawling the old site.

Your original statement that in the end they'll give up trying may well be right. Just concerned that they may shy away from the site, because they regard the robots.txt situation as impossible to decipher. As it happens, a friend had recently redirected an old site to a new one in a similar way. He's not someone to bother with checking Webmasters Tools, or much else but I went round to have a look this afternoon. He is in exactly the same situation, with the identical error. I've no idea if this is some thing new, maybe Google have made changes but I can not find anything on their site that deals with this. May well be there's no harm but strange the situation is not covered.

JohnW
08-09-2008, 09:44 PM
G will keep looking for robots txt on old domain as long as it sees non-redirected pages existing on that domain. You could do a site: check in WMT and see if you missed any 301s. Likewise, G should be requesting robots txt from the new domain each time it crawls, if not then you need to track this down and deal with it.

>301 redirect sends the bot to look for http://www.domain.comrobots.txt

I don’t understand why you can’t do the proper redirect for this. A robots txt page can be redirected just like any other page. Something is not right in your htaccess. A redirect is not really needed, but who knows what else could be messed up so I suggest get someone that knows how to fix it.

>I agree that just redirecting one robots.txt to the other would have made sense, although the two sites are not exactly the same, so this would not be correct.

I don’t necessarily think it does make sense - I have never considered 301 for a robots file for sites that have moved. Just 404 the file and fix your htaccess.

Just so you know, when a page is redirected via a 301 Google no longer reads the code of the old page because it will never see the old page code again. What's found at the new location is what gets indexed and there is no requirement that it be the same exact content as a page(s) that redirects to it.

>They would also have the wrong sitemap entry.

This simply can’t be right – see above. Just create a xml site map for the new domain and enter its location in the new domains robots.txt.

Hope this will get you heading in the right direction.

Portran
08-09-2008, 10:49 PM
Hi JohnW

Thanks for coming back yet again. I can confirm that according to WMT, there are no problems with 404s or 301s. I think my main problem is that I have failed to explain the problem.

I no longer have access to the server, only to a .htaccess file in the public folder. Agreed, there is a problem but I think this is server side. Past problems with the server/host played their part in deciding to change/move the site. Two people have separately checked the .htaccess file and this is working fine for every URL.

The new sitemap is in the new robots.txt file, both are being downloaded and show no errors, neither is there any problem with the new site being crawled, 95% of pages already indexed.

What is the problem is the old site, robots.txt is being requested regularly but nothing else. A site that was crawled at worst every couple of days, hasn't seen a bot for 10/11 days. Every other possibility has been checked and does seem that the 301 on robots.txt is keeping Googlebot away. The server has a flaw I can't override whatever I seem to try, so there is no way to get a / added before the robots.txt, that's why they are looking for domain.comrobots.txt, which can never be found.

I accept that changing the old robots.txt to a 404 would be ideal but the only access I have is ftp to .htaccess in the public folder. I equally appreciate what you said about bots not going near the old code, so the only route I can think of would be a line in the .htaccess making the robots.txt return a 404.

To be honest, I don't know how to do this. I have searched on the apache site, along with quite a few other places but not found the answer, most entries seem to relate to sending 404s to the custom error page, which isn't possible. Did occur to me to simply redirect the old robots.txt to a non existent page on the new site, which would presumably return a 404 but that didn't feel right. If you can help with how this could be achieved, that would be great.

JohnW
08-11-2008, 11:09 AM
> line in the .htaccess making the robots.txt return a 404. \

That's one solution.

It sounds to me like you are at the point of hiring someone to deal with this for you. Having someone on a forum give you the snippet of htaccess code for this without seeing what else is in there (and looking at a few other things)may not be a good idea.

Portran
08-11-2008, 05:45 PM
Hi JohnW

No, I'm never on the point of hiring anyone. Can handle most things and have generally found that good forums are able to offer help, as I do when I can and you kindly chose to here.

In any event, the problem is solved. As you imagine, I did a lot of searching on this but only came up with half a dozen cases of a similar problem, no solution. I had also sent quite a few emails, to people I hoped would help and got lucky, with an admittedly short reply from a conclusive source.

Whilst my situation was unusual, this has proved not unique, others will perhaps lose full server access at the wrong time. So I will lay out what I was told. Unusual responses from a robots.txt, such as a redirect that never resolves, or apparently a 500, may result in a search engine feeling they should not crawl a site. A 404 response, however achieved, is taken to indicate that no robots.txt applies.

All I did, was insert:

Redirect 301 /robots.txt http://www.mydomain.com/nilpage.html

in the .htaccess file. May be better ways but I was lucky in terms of the cycle and within hours, the rogue robots.txt file was showing as 404 in Webmaster Tools. The reason I am now replying here, is because the site is finally being visited again.

Hope this helps someone else in the same situation and to JohnW plus others, thanks very much for all your help.