Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

View Poll Results: Should there be an update to the robots.txt rules?
No - it's working fine, and it's too difficult to fix 5 41.67%
Yes - the robots.txt needs to reflect modern issues 7 58.33%
Voters: 12. You may not vote on this poll

Reply
 
Thread Tools
Old 11-23-2005   #1
DaveN
 
DaveN's Avatar
 
Join Date: Jun 2004
Location: North Yorkshire
Posts: 434
DaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to all
WebmasterWorld Off Of Google & Others Due To Banning Spiders

I have had a few emails about webmasterworld and google

Ok :

Webmasterworld.com is not banned from Google
Webmasterworld.com has not had all its Pagerank taken away for being naughty

If you look at Webmasterworlds robots.txt file you will find that Brett has blocked the search engines .. in a bold move to stop Bad Bots . yer I know baby bath water blah blah ..

I also think that a site removal request had been submitted to google and they promptly took action, now if you are thinking cool this should get rid of my supplement pages and if i ask for a reinclusion it should come back in 5 days ...

WRONG : it will be out for 180 days, so don't do it ..


DaveN
DaveN is offline   Reply With Quote
Old 11-23-2005   #2
Chris_D
 
Chris_D's Avatar
 
Join Date: Jun 2004
Location: Sydney Australia
Posts: 1,099
Chris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud of
Wow.

The biggest issue with that move is that the only way I can ever seem to find stuff I've read at WmW is via search engines. You know the drill site: blah blah.

So unless Brett has a plan to fix the internal site search - which hasn't ever found what I've been looking for - then a lot of good info has now been buried......

<edit>Hey Dave - stop 'bad bots' with robots.txt? Isn't that a bit optimistic? Bad bots don't obey.... thats what makes 'em bad....

Last edited by Chris_D : 11-23-2005 at 09:49 AM. Reason: bad bots..
Chris_D is offline   Reply With Quote
Old 11-23-2005   #3
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
WebmasterWorld Out Of Google & MSN and WebmasterWorld Bans Spiders From Crawling on the SEW Blog has a lot of background on the ban WMW put into place and some of the fallout.
dannysullivan is offline   Reply With Quote
Old 11-23-2005   #4
DaveN
 
DaveN's Avatar
 
Join Date: Jun 2004
Location: North Yorkshire
Posts: 434
DaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to all
Hmm I can't believe how slow Yahoo are with this

Added :

found it : http://help.yahoo.com/help/us/ysearc...dexing-13.html know this is interesting yahoo have to cycle though everything, it seems that they don't have the ability too do what Google and MSN do ..

so when banning the spiders you need to give yahoo a little time to delete your site completely.. so if you a had a site that was someone posted a slanderous comment on ... and yahoo cached it .. you would need some time to get rid of the evidence.

DaveN

Last edited by DaveN : 11-23-2005 at 10:14 AM.
DaveN is offline   Reply With Quote
Old 11-23-2005   #5
volatilegx
Newbie
 
Join Date: Jun 2004
Posts: 3
volatilegx is on a distinguished road
Blocking bad bots?

So the reasoning is to block bad bots? There are many better ways to do it. Robots.txt alone certainly won't help. Most bad bots don't even obey it.
volatilegx is offline   Reply With Quote
Old 11-23-2005   #6
rcjordan
There are a lot of truths out there. Just choose one that suits you. -Wes Allison
 
Join Date: Jun 2004
Posts: 279
rcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to allrcjordan is a name known to all
> being naughty

well -technically- wmw was being a little naughty, dave
rcjordan is offline   Reply With Quote
Old 11-23-2005   #7
DaveN
 
DaveN's Avatar
 
Join Date: Jun 2004
Location: North Yorkshire
Posts: 434
DaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to all
fair point : Matts buttons most be bigger than Tim's hehehe

DaveN

Tim i'm only joking mate
DaveN is offline   Reply With Quote
Old 11-23-2005   #8
neatorama
Member
 
Join Date: Oct 2005
Posts: 40
neatorama is on a distinguished road
WMW doesn't need Google's traffic. Must be nice...
neatorama is offline   Reply With Quote
Old 11-23-2005   #9
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Quote:
Originally Posted by Chris_D

Wow.

The biggest issue with that move is that the only way I can ever seem to find stuff I've read at WmW is via search engines. You know the drill site: blah blah.

So unless Brett has a plan to fix the internal site search - which hasn't ever found what I've been looking for - then a lot of good info has now been buried......
Yeah, wow. I've been thinking about this a bit. How do you block all the bad bots and have the material accessible to the engines?

Seems to me that Yahoo already has something called Yahoo Subscriptions that pretty much does just this. Google has patents and is surely developing technology to do the same thing.

http://blog.searchenginewatch.com/blog/050616-000001

Quote:
Originally Posted by SearchEngineWatch Blog

Yahoo Search Subscriptions Brings Premium Content Into Web Search
Yahoo has released a new Yahoo Search Subscriptions (beta) service that unites regular web search results found from crawling the open web with listings from free and fee-based database services and publishers such as Factiva, LexisNexis, and Consumer Reports.

These databases have content typically "invisible" to web crawlers. The move should help many people who assume the open web has all the research material they need discover additional content they'd otherwise miss.

To view the full text of premium content, searchers will either have to have a subscription to the fee-based database providing it or take advantage of pay-per-article options, when offered.
What could Brett be thinking? Could this be a trial balloon?
Robert_Charlton is offline   Reply With Quote
Old 11-23-2005   #10
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
Brett was talking about doing it about 3 years ago, but it wasn't because of bad spiders, which won't take any notice anyway. If I remember correctly, it was to do with getting too much unwanted traffic from the engines.

I've read your articles, Danny, and I disagree with you on one point. You said that Google could list the URL of a page because it knows about the page from links to it. They do that all the time, but it raises an issue. The robots.txt protocol is that it disallows files from being indexed. Technically, a URL only page isn't indexed - that is, its content isn't indexed - and so listing it in the serps, and providing a link for people to visit the page, isn't going against the robots.txt protocol. But that's only technically. The meaning of the protocol is "leave this page alone", and it means that because there some are very good reasons why people don't want other people to view certain pages - sites under development, for instance. So putting a link to it in the serps is against the robots.txt protocol, imo, and I disagree with you that Google could have listed WMW's homepage URL because of the DMOZ link to it.

Search engines shouldn't show any URL only links if they haven't checked with the site's robots.txt file.
PhilC is offline   Reply With Quote
Old 11-23-2005   #11
Chris_D
 
Chris_D's Avatar
 
Join Date: Jun 2004
Location: Sydney Australia
Posts: 1,099
Chris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud of
Late last night I found this thread http://www.webmasterworld.com/forum9/9593.htm

Strange that it's in Foo - I though Community Centre was the place to discuss WmW itself?
Chris_D is offline   Reply With Quote
Old 11-23-2005   #12
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
If you are trying to only get rid of "bad" bots, then this is a really dumb idea, as has been mentioned before, bad bots don't obey robots.txt directives.

But if you were optimistic and wanted to, this would work:

Code:
User-agent: Googlebot
Disallow: 
User-agent: MSNBot
Disallow: 
User-agent: Slurp
Disallow: 
User-agent: Teoma
Disallow: 
User-agent: *
Disallow: /
This gets rid of all bots except the big 4.

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 11-24-2005   #13
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Quote:
Originally Posted by PhilC
So putting a link to it in the serps is against the robots.txt protocol, imo, and I disagree with you that Google could have listed WMW's homepage URL because of the DMOZ link to it.
Phil - If WMW were using only robots.txt to block spiders, but links existed on spiderable pages elsewhere, then Google might in fact list the home page. I'd raised the same question on this 2003 thread on WMW

Problem with Googlebot and robots.txt?
Google indexing links to blocked urls even though it's not following them
http://www.webmasterworld.com/forum3/11621.htm

On this thread, GoogleGuy responded with the following...

Quote:
Originally Posted by GoogleGuy
If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.
Jim Morgan clarified with this...

Quote:
Originally Posted by jdMorgan
I went around and around with this, trying to find a way to tell them "don't mention my contact forms pages at all, please", and here's what I ended up with:

For Google, don't Disallow the page in robots.txt, but place a <meta name="robots" content="noindex"> tag in the head section of the page itself.
The topic again got considerable discussion on WMW when Google doubled the size of its index and started returning "references" to all sorts of pages that webmasters had thought were hidden. Here's a post on SEWF where I lay out the history of all these discussions...

When Does Google Really Index a Page?
http://forums.searchenginewatch.com/...8876#post28876

Note that if you try any of the WMW links, you may have to paste them into your address bar after you log in.


Quote:
Originally Posted by mcanerin
If you are trying to only get rid of "bad" bots, then this is a really dumb idea, as has been mentioned before, bad bots don't obey robots.txt directives.

But if you were optimistic...
Ian - Brett isn't being optimistic. The site now requires a log in for access. In fact, he's apparently just reset all the cookies. If you try the above WMW links, they may not work until you first log in again.
Robert_Charlton is offline   Reply With Quote
Old 11-24-2005   #14
Chris_D
 
Chris_D's Avatar
 
Join Date: Jun 2004
Location: Sydney Australia
Posts: 1,099
Chris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud of
Robert_Charlton,

I'm impressed that you could find such an old WmW post, without an SE to help.

I'll see you - and raise you a 2002 GG WmW post....

Quote:
First, I think there's a definite value to returning a link to a page even if we can't crawl that page. Quick example: the New York Times used to disable all bots from crawling them. That's fine, and we respected their robots.txt. But if a user comes to Google and types "ny times" into our search box, the best result to give them is nytimes.com.
http://www.webmasterworld.com/forum3/4008-5-10.htm Msg 43

Chris_D is offline   Reply With Quote
Old 11-24-2005   #15
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Chris_D - Great link. I'm impressed too. Thanks. Fascinating to see the precursor of other such discussions, and to see where everybody was just a year before.

In light of the current discussion about WmW, by the way, I think the next two sentences from the GoogleGuy post you quoted are really apropos...

Quote:
Originally Posted by GoogleGuy
By returning the link to nytimes.com--even though we never actually crawled that page--we were giving a better search result to users. Luckily, most sites have realized that being visible in search engines is a good thing.
Robert_Charlton is offline   Reply With Quote
Old 11-24-2005   #16
DaveN
 
DaveN's Avatar
 
Join Date: Jun 2004
Location: North Yorkshire
Posts: 434
DaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to all
so wmw is banned then !!

because when i search for webmasterworld or webmasterworld.com

i don't get webmasterworld has a clickable link ..

DaveN
DaveN is offline   Reply With Quote
Old 11-24-2005   #17
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
It seems that I'm being very disagreeable in this thread, but I disagree completely with GG about the New York Times example.

Technically he is correct, but we all know that the disallow and noindex directives mean "don't store this page in your database, and don't list it in the serps". That's what the robots.txt/meta tag protocol was always intended to mean.

When the protocol was written, search engines didn't use off-page factors to rank pages, so it wasn't necessary to include the words "don't list it in the search results". Since Google came along, all the engines have become links-based, and they are able to include URL only listings because of the link text data. But it doesn't mean they are right to do it. Imo, they are going against the protocol with those URL only listings, unless they've checked the page's meta tags and the site's robots.txt file.

Searchers undoubtedly want to see the NYT page listed when they search on "ny times", but the NYT site instructed the engines not to list it, and Google is wrong to ignore the protocol in favour of the desires of searchers.

Last edited by PhilC : 11-24-2005 at 09:57 AM.
PhilC is offline   Reply With Quote
Old 11-24-2005   #18
SEO1
Member
 
Join Date: Jan 2005
Location: Philadelphia, PA. USA
Posts: 221
SEO1 can only hope to improve
Actually a bit of common sense comes into play.

You all seem to forget the Internet is based on link popularity, as well as an algorithim that favors links, and uses the links as a basis for voting pages to the top of their search results.

Also when the bot is following links it goes to the linked to page ( most often index page) and then follows the page instuctions..the bot then pulls robots.txt file if there is one.

Since it follows a link to the page it will index and cache the page, unless the page has instructions not to.

And all of this is only if you believe robots follow robots.txt rules 100% of the time..

I do not

Clint

PS and thats only if you are not using the Google sitemap implementation.
SEO1 is offline   Reply With Quote
Old 11-24-2005   #19
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
You are mistaken, SEO1. Search engine spiders don't follow links to anywhere. They fetch files and put them in a store for another programme to look at later, and that's all they do. They don't actually follow anything. They get the next URL from the pile to be fetched, and they fetch and store the file - that's all they do.

Another programme comes along later and finds links on the page. It puts the links on the pile for the spider to fetch later. The engine has knowledge of the URLs, therefore it has knows about the file, and it has some link text/alt text data that relates to the file. So it can include the URL in the serps without ever needing to visit the page or site - and that's what both Google and Yahoo! do - wrongly, imo.
PhilC is offline   Reply With Quote
Old 11-24-2005   #20
SEO1
Member
 
Join Date: Jan 2005
Location: Philadelphia, PA. USA
Posts: 221
SEO1 can only hope to improve
phil

Prove it! Show me a 3rd party souce that verifies your miscontrued thoughts and I may pay attention.

Otherwise the bot has to follow a link to find the page.

Clint
SEO1 is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off