View Full Version : Search Engines are not Obeying Robots.txt ?
azhariqbal5i
08-26-2006, 09:09 AM
Hello,
I have set up a robots.txt such as:
useragent: *
Disallow: /a
Disallow: /b/home.php
I really wanted all biggies Google, Yahoo and Msn to not access my PHP version, but damn! they are keep accessing my PHP pages.
The question is i have studied somewhere that to stop dynamically generated pages we must use,
useragent: *
Disallow: /a
Disallow: /b/home.php/*?
is (/*?) it better?
Or would it (/*?) help me stop search engines any more?
I am really confused, please share your reviews and answers, am desperately waiting..
Thanks
jimbeetle
08-26-2006, 01:19 PM
If your robots.txt file is exactly how you have it in your post then this is simply the result of a malformed record. It should read:
User-Agent: *
Disallow: /a/
Disallow: /b/home.php
Also, I would always hesitate about using wild-card characters such as "/*?" in any Disallow directive as Google is the only SE that recognizes them.
Look to the Robots Exclusion Protocol (http://www.robotstxt.org/wc/exclusion-admin.html) for the standard usage.
azhariqbal5i
08-28-2006, 03:31 AM
I have studied somewhere that (/*?) wild characters are helpful spacially to disallow dynamically generated pages, yeah i have my site generated dynamically generated pages.
but main PHP Pages are home.php, products.php etc whom may be disallowed when bot is accessing the site.
what if i include (/*?) with my PHP Pages? like:
Useragent: Googelbot
Disallow: /b/home.php/*?
Disallow: /b/products.php/*?
etc ?
jimbeetle
08-28-2006, 11:15 AM
Well, first you have to fix this: "useragent" should be "User-Agent".
Then you have to decide if you want your robots.txt directives to only apply to Googlebot or to all bots. Keep in mind that the wildcard notation (*, $, ?) only applies to Googlebot. If you use it no other bots will obey it and the results might not be what you want.
As for the "/*?". Google says (http://www.google.com/support/webmasters/bin/answer.py?answer=35303):
To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*?
I'm not sure exactly what that would do, but to me it looks dangerous. And again, as Googlebot would be the only one to obey it I really don't see the overall usefulness.
If you do decide to use it, instead of disallowing "/b/home.php/*?" I think maybe what the entry should be is "/b/home.php*?" (without the third backslash which would indicate another directory). But, as I am not sure, don't use that unless somebody else comes along to either confirm or correct it.
g1smd
08-28-2006, 07:31 PM
Fix "useragent" to be "User-Agent" instead. Add the hyphen.
The Disallow: /b/home.php
stops access to any URL that begins with exactly :
/ b / h o m e . p h p
g1smd
08-28-2006, 07:37 PM
Fix "useragent" to be "User-Agent" instead. Add the hyphen.
The Disallow: /b/home.php
stops access to any URL that begins with exactly :
/ b / h o m e . p h p
azhariqbal5i
08-29-2006, 02:34 AM
Thanks JIM,
u r 100% right.. http://forums.searchenginewatch.com/images/icons/icon14.gif
Thumbs up
:cool:
http://www.highrankings.com/forum/index.php?showtopic=24887&st=0&p=218390&#entry218390
jimbeetle
08-29-2006, 11:14 AM
Thanks for the link to the highrankings thread, it cleared up the Disallow: /*? for me. I was looking at it as if the ? was an operator or qualifier of some sort and not as a literal character. Things are just too simple sometimes.