SEOThe Daily Con Job: Rogue Spiders in the Wild

The Daily Con Job: Rogue Spiders in the Wild

If you thought that only humans were capable of pulling a fast one on you, think again: quite a few crawlers out there are not at all what they make out to be.

Unless your website has been hiding under some virtual rock for an extended period of time, chances are you’ve got your work spelled out for you analyzing your daily traffic stats. Especially when you’re concerned with SEO: you want to know when, if, and how the crawler based search engines spider your pages, which search queries (a.k.a. keywords) are pulling visitors to your site, whether someone has linked to you out of the goodness of their heart, how frequently the engines are paying your pages a visit, and lots more.

In an ideal world, these crawlers would be quite open about what they’re doing: just like anyone with a modicum of good manners up their sleeve, they’d say something like “hello, I’m Googlebot from Google.com and I’ve come to check out what exciting new stuff you’ve put up recently.”

Regrettably, there’s more than a slight disconnect between this scenario and what’s going on in the real world. Because there’s a veritable onslaught of search engine spiders being all sneaky about what they actually are. And it would be utterly naive to assume that they aren’t doing it intentionally: being run by some of the smartest people in the world, you can safely bet the farm on the assumption that they’re doing it with a purpose.

So let’s look at some of the most common techniques you’ll encounter when taking a closer look at that Wild West that is currently spiderland.

Robotic Fakes: Search Engine Spiders Pretending to be Human Browsers

1. Bing

Here is one of many run-of-the-mill Bing spiders with their typical User Agent:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)
msnbot-207-46-12-236.search.msn.com
207.46.12.236

In this example, the User Agent is: “msnbot/2.0b (+http://search.msn.com/msnbot.htm)”

Don’t rely on their sticking to this format religiously, though: recently, we’ve detected an increasing number of MSN bots featuring the following, slightly modified User Agent:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)._
msnbot-65-55-3-138.search.msn.com
65.55.3.138

Notice the trailing “._” characters? Conceivably, Bing is leveraging this rudimentary change to trick simplistic scripts of the “poor man’s cloaking” kind. Such scripts will conduct their cloaking activities based on a visitor’s “User Agent” signature. Nothing sophisticated about this: if a given website is on the lookout for an exact User Agent match, the spider in question won’t be recognized anymore.

Nor is this the limit of Bing’s shenanigans. There are plenty of MSN bots featuring an entirely unobtrusive Internet Explorer User Agent — meaning, of course, that they cannot be determined for the crawlers they are via their User Agent alone: if that’s what you’re going by (e.g. when analyzing your traffic stats), there’s no way you can tell whether it’s actually a human visitor you’re dealing with or merely some bot.

Some examples:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)
msnbot-207-46-199-27.search.msn.com
207.46.199.27

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
msnbot-207-46-12-163.search.msn.com
207.46.12.163

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
msnbot-207-46-204-209.search.msn.com
207.46.204.209

Bing deploys these bots to spider both your standard HTML pages as well as “.js” and “.css” files.

2. Google

Used to be a time when every man and his dog believed that things were perfectly straightforward with Google — at least, after all, you know that Mountain View, California Googlebot from scratch, right?

Wrong: there are plenty of Googlebots out there sporting an inconclusive browser User Agent. More specifically, they’ll simply mimic a Firefox Browser.

Here’s a real world example:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.9) Gecko/20100315 Firefox/3.5.9 GTB7.1 (.NET CLR 3.5.30729)
74.125.63.33

This one is featuring the regular Googlebot User Agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
74.125.63.33

Here’s another Google spider, this time pretending to be Internet Explorer:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
74.125.74.195

Google uses this spider to crawl AdWords campaign Quality Pages.

Thus, if you’re cloaking your Quality Page by looking out for a regular Googlebot User Agent, more likely than not you’ll soon be found out and possibly penalized.

Human Visitors or Non-Search-Engine Spiders Pretending to Be Search Engine Crawlers

Frequently, you’ll hit on entries in your webserver’s log file that appear to be regular search engine spiders. Once you take a close hard look at their respective IPs, however, it becomes obvious that things aren’t quite what they seem to be.

An example:

Mozilla/5.0 (compatible; Yahoo! Slurp/3.0;
http://help.yahoo.com/help/us/ysearch/slurp)
208.76.240.226.reverse.crucialx.net
208.76.240.226

Not to go on a forensic tangent here, suffice it to say that you’ll have to make use of specialized services such as DomainTools and OS based tools such as nslookup or dig to detect that “crucialx.net” in this example is either a fake entry altogether or shares its IP with surf-anon.com. Doesn’t look like Yahoo’s Slurp after all, does it?

Finally, here’s a visitor from an entirely ordinary IP pretending to be Googlebot:

Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html) S0106c80aa9952f80.cg.shawcable.net
68.147.56.189

Chances are that non-SE spiders hiding behind search engine User Agents can crawl loads of webpages, generally remaining undetected while eating up your bandwidth and server resources without giving anything back in return — not good!

Perhaps you’re thinking, “I’m not cloaking my pages, so why should I be concerned?” Well, consider this: even if you’re not into black hat SEO, these spiders can muck up your traffic stats royally — something that actually affects every webmaster under the sun.

Detecting cloaked pages will only be an issue when operating with poor man’s cloaking. (Your competitors can easily find you out in this manner and snitch on you to the search engines, let’s not forget.) So for all serious, heavy duty deployment, you should always go with IP delivery rather than User Agent based cloaking. Like it or not, it’s the only reliable approach.

Bottom Line

Not everything is what it pretends to be on the Web, with even the big respected players going for stealth tactics on a grand scale — and your traffic stats will ill reflect what’s really happening on your website unless you apply plenty of effort and expertise to analyzing them. Simplistic stats tools, whether free or commercial, won’t normally be of much help and may arguably even make things worse by indicating certainty where some serious doubting would be far more to the point.

Save up to $250! Register now for SES Chicago 2010, the Leading Search & Social Marketing Event, taking place October 18-22! Early Bird rates expire Oct. 1!

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y