PDA

View Full Version : Nocrawl instead Nofollow Pros and Cons


Webnauts
03-09-2009, 02:13 PM
I would to ask your opinion about a possible alternative to the "nofollow attribute, which I will call it here "bots=nocrawl".

I have for example a page linking to a page called example.html

The URL looks like this:
http:// www. whateveryouwantocallthat. com/example.html?bots=nocrawl

In the robots.txt I add this:

User-agent: Googlebot
Disallow: *bots=nocrawl
Noindex: *bots=nocrawl

In addition I add in the .htaccess file X-Robots directives to prevent the robots.txt of being indexed, followed,etc.:

<FilesMatch "\.(txt)$">
Header set X-Robots-Tag "noindex,nofollow,noarchive,nosnippet"
</FilesMatch>

What difference do you see between the use of the "nofollow" attribute and the "bots=nocrawl" as setup this way.

What are the possible pros and cons using "bots=nocrawl" instead of the "nofollow" attribute?

---
P.S. To go a step further, I was thinking what would be if using "bots=nocrawl" in destination URLs and adding on the targeted web pages the new "canonical element" (where applicable, i.e duplicated pages or with similar content).

AussieWebmaster
03-09-2009, 03:54 PM
You may want to look at this from Matt Cutts http://www.mattcutts.com/blog/googlebot-keep-out/

and his cavaet there:
Obscure note #1: using the ‘googlebot=nocrawl’ technique would not be the preferred method in my mind. Why? Because it might still show ‘googlebot=nocrawl’ urls as uncrawled urls. You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. There’s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? We’d look pretty sad if we didn’t return www.dmv.ca.gov as the first result. But remember: we weren’t allowed to fetch pages from www.dmv.ca.gov at that point.

Webnauts
03-12-2009, 06:35 AM
Thanks for the quick reply. I was just looking around and I read this:

Cutts stated explicitly that Google does not crawl nofollow links in July 2006, in his Bot Obedience: Herding Googlebot post: "At a link level, you can add a nofollow tag on the granularity of individual links to prevent Googlebot from crawling individual links (you could also make the link redirect through a page that is forbidden by robots.txt).
But I do that already, but instead using the "nofollow" attribute I use the keep out Googlebot menthod. Where do you see a difference.

Bear in mind that if other pages link to a url, Googlebot may find the url through those other paths."
Because I bear that in mind, that is why I implement the "noindex" so that will not happen.

Lasnick stepped in again last night to further clarify the issue in another post, If rel="nofollow" is becoming the norm. He notes that "nofollow links aren't listed any differently than other links in our Webmaster Tools backlinks section," and said that nofollow links will show up in search resulsts using the "link:" operator.
Exactly! The "nofollow" links aren't listed any differently than other links in our Webmaster Tools. But with the "bots=nocrawl" pages will never show up, if the "bot=nocrawl" is setup before adding new pages. If the pages are already picked up from Googlebot before you have implemented that method, then you have to request a deletion in the Webmaster Tools, and when they are deleted, they will never show again.

Now taking this to another level. Because as I said, the targeted pages have the meta robots tag directives "noindex,nofollow,noarchive,nosnippet" or that is achieved with X-Robots directives, you might tell that they are dangling pages. But the same happens if you use the "nofollow" attribute.

So am I missing something again?

---
Quotes of Matt Cutts and Adam Lasnik are found here: http://blog.searchenginewatch.com/070215-123945

rainborick
03-12-2009, 10:47 AM
It depends on what your goal is. If your goal is just to keep a document out of the search engines' index, then a robots <meta> tag set to noindex, a Disallow instruction in the robots.txt file, or an x-robots-tag directive in your .htaccess file is sufficient. The search engines will ignore those instructions only in extraordinary circumstances, as Aussie noted. Which one is better I think probably depends on whether or not you need the pages to be crawled, but not indexed. Beyond that, the distinctions seem pretty much without a difference. Its been such a long time since I've seen a URL-only entry appear in the search results, that I think it's pretty innocuous unless you don't even want the URL to appear in the results of the site: operator.

I wouldn't be concerned that the URL for these pages is in the link database which is only used for calculating PageRank etc., and for feeding the crawl queue. Once you've blocked them from being in the index, using appropriate <meta> tags on the blocked documents and/or including a nofollow on any links that point to them will allow you to control the flow of PageRank to and from these documents as needed.

tpraja
12-16-2009, 01:08 PM
According to me nofollow is better than nocrawl

Palanivel Raja