Special thanks to:
|
|||||||
| View Poll Results: Should there be an update to the robots.txt rules? | |||
| No - it's working fine, and it's too difficult to fix |
|
5 | 41.67% |
| Yes - the robots.txt needs to reflect modern issues |
|
7 | 58.33% |
| Voters: 12. You may not vote on this poll | |||
![]() |
|
|
Thread Tools |
|
#1
|
||||
|
||||
|
WebmasterWorld Off Of Google & Others Due To Banning Spiders
I have had a few emails about webmasterworld and google
Ok : Webmasterworld.com is not banned from Google Webmasterworld.com has not had all its Pagerank taken away for being naughty If you look at Webmasterworlds robots.txt file you will find that Brett has blocked the search engines .. in a bold move to stop Bad Bots . yer I know baby bath water blah blah .. I also think that a site removal request had been submitted to google and they promptly took action, now if you are thinking cool this should get rid of my supplement pages and if i ask for a reinclusion it should come back in 5 days ... WRONG : it will be out for 180 days, so don't do it .. DaveN |
|
#2
|
||||
|
||||
|
Wow.
The biggest issue with that move is that the only way I can ever seem to find stuff I've read at WmW is via search engines. You know the drill site: blah blah. So unless Brett has a plan to fix the internal site search - which hasn't ever found what I've been looking for - then a lot of good info has now been buried...... <edit>Hey Dave - stop 'bad bots' with robots.txt? Isn't that a bit optimistic? Bad bots don't obey.... thats what makes 'em bad.... ![]() Last edited by Chris_D : 11-23-2005 at 10:49 AM. Reason: bad bots.. |
|
#3
|
|||
|
|||
|
WebmasterWorld Out Of Google & MSN and WebmasterWorld Bans Spiders From Crawling on the SEW Blog has a lot of background on the ban WMW put into place and some of the fallout.
|
|
#4
|
||||
|
||||
|
Hmm I can't believe how slow Yahoo are with this
Added : found it : http://help.yahoo.com/help/us/ysearc...dexing-13.html know this is interesting yahoo have to cycle though everything, it seems that they don't have the ability too do what Google and MSN do .. so when banning the spiders you need to give yahoo a little time to delete your site completely.. so if you a had a site that was someone posted a slanderous comment on ... and yahoo cached it .. you would need some time to get rid of the evidence. DaveN Last edited by DaveN : 11-23-2005 at 11:14 AM. |
|
#5
|
|||
|
|||
|
Blocking bad bots?
So the reasoning is to block bad bots? There are many better ways to do it. Robots.txt alone certainly won't help. Most bad bots don't even obey it.
|
|
#6
|
|||
|
|||
|
#7
|
||||
|
||||
|
fair point : Matts buttons most be bigger than Tim's
heheheDaveN Tim i'm only joking mate ![]() |
|
#8
|
|||
|
|||
|
WMW doesn't need Google's traffic.
Must be nice... |
|
#9
|
|||
|
|||
|
Quote:
Seems to me that Yahoo already has something called Yahoo Subscriptions that pretty much does just this. Google has patents and is surely developing technology to do the same thing. http://blog.searchenginewatch.com/blog/050616-000001 Quote:
|
|
#10
|
|||
|
|||
|
Brett was talking about doing it about 3 years ago, but it wasn't because of bad spiders, which won't take any notice anyway. If I remember correctly, it was to do with getting too much unwanted traffic from the engines.
I've read your articles, Danny, and I disagree with you on one point. You said that Google could list the URL of a page because it knows about the page from links to it. They do that all the time, but it raises an issue. The robots.txt protocol is that it disallows files from being indexed. Technically, a URL only page isn't indexed - that is, its content isn't indexed - and so listing it in the serps, and providing a link for people to visit the page, isn't going against the robots.txt protocol. But that's only technically. The meaning of the protocol is "leave this page alone", and it means that because there some are very good reasons why people don't want other people to view certain pages - sites under development, for instance. So putting a link to it in the serps is against the robots.txt protocol, imo, and I disagree with you that Google could have listed WMW's homepage URL because of the DMOZ link to it. Search engines shouldn't show any URL only links if they haven't checked with the site's robots.txt file. |
|
#11
|
||||
|
||||
|
Late last night I found this thread http://www.webmasterworld.com/forum9/9593.htm
Strange that it's in Foo - I though Community Centre was the place to discuss WmW itself? |
|
#12
|
||||
|
||||
|
If you are trying to only get rid of "bad" bots, then this is a really dumb idea, as has been mentioned before, bad bots don't obey robots.txt directives.
But if you were optimistic and wanted to, this would work: Code:
User-agent: Googlebot Disallow: User-agent: MSNBot Disallow: User-agent: Slurp Disallow: User-agent: Teoma Disallow: User-agent: * Disallow: / Ian
__________________
International SEO |
|
#13
|
||||
|
||||
|
Quote:
Problem with Googlebot and robots.txt? Google indexing links to blocked urls even though it's not following them http://www.webmasterworld.com/forum3/11621.htm On this thread, GoogleGuy responded with the following... Quote:
Quote:
When Does Google Really Index a Page? http://forums.searchenginewatch.com/...8876#post28876 Note that if you try any of the WMW links, you may have to paste them into your address bar after you log in. Quote:
|
|
#14
|
||||
|
||||
|
Robert_Charlton,
I'm impressed that you could find such an old WmW post, without an SE to help. ![]() I'll see you - and raise you a 2002 GG WmW post.... Quote:
![]() |
|
#15
|
|||
|
|||
|
Chris_D - Great link. I'm impressed too. Thanks. Fascinating to see the precursor of other such discussions, and to see where everybody was just a year before.
In light of the current discussion about WmW, by the way, I think the next two sentences from the GoogleGuy post you quoted are really apropos... Quote:
|
|
#16
|
||||
|
||||
|
so wmw is banned then !!
because when i search for webmasterworld or webmasterworld.com i don't get webmasterworld has a clickable link .. DaveN |
|
#17
|
|||
|
|||
|
It seems that I'm being very disagreeable in this thread, but I disagree completely with GG about the New York Times example.
Technically he is correct, but we all know that the disallow and noindex directives mean "don't store this page in your database, and don't list it in the serps". That's what the robots.txt/meta tag protocol was always intended to mean. When the protocol was written, search engines didn't use off-page factors to rank pages, so it wasn't necessary to include the words "don't list it in the search results". Since Google came along, all the engines have become links-based, and they are able to include URL only listings because of the link text data. But it doesn't mean they are right to do it. Imo, they are going against the protocol with those URL only listings, unless they've checked the page's meta tags and the site's robots.txt file. Searchers undoubtedly want to see the NYT page listed when they search on "ny times", but the NYT site instructed the engines not to list it, and Google is wrong to ignore the protocol in favour of the desires of searchers. Last edited by PhilC : 11-24-2005 at 10:57 AM. |
|
#18
|
|||
|
|||
|
Actually a bit of common sense comes into play.
You all seem to forget the Internet is based on link popularity, as well as an algorithim that favors links, and uses the links as a basis for voting pages to the top of their search results. Also when the bot is following links it goes to the linked to page ( most often index page) and then follows the page instuctions..the bot then pulls robots.txt file if there is one. Since it follows a link to the page it will index and cache the page, unless the page has instructions not to. And all of this is only if you believe robots follow robots.txt rules 100% of the time.. I do not Clint PS and thats only if you are not using the Google sitemap implementation. |
|
#19
|
|||
|
|||
|
You are mistaken, SEO1. Search engine spiders don't follow links to anywhere. They fetch files and put them in a store for another programme to look at later, and that's all they do. They don't actually follow anything. They get the next URL from the pile to be fetched, and they fetch and store the file - that's all they do.
Another programme comes along later and finds links on the page. It puts the links on the pile for the spider to fetch later. The engine has knowledge of the URLs, therefore it has knows about the file, and it has some link text/alt text data that relates to the file. So it can include the URL in the serps without ever needing to visit the page or site - and that's what both Google and Yahoo! do - wrongly, imo. |
|
#20
|
|||
|
|||
|
phil
Prove it! Show me a 3rd party souce that verifies your miscontrued thoughts and I may pay attention. Otherwise the bot has to follow a link to find the page. Clint |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|