Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engine Marketing Strategies > Search Engine Optimization
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 09-11-2006   #1
mgcre
Member
 
Join Date: Mar 2006
Posts: 14
mgcre is on a distinguished road
Scrape prevention vs. SEO-friendly

I'm looking into adding some scrape prevention tactics (eg., randomly changing DIV tags and other HTML, etc.). However, I don't want to sacrifice any SEO measures I have in place, or even worse, get pages dropped or blacklisted.

Does anyone have any experience/war stories on this?
mgcre is offline   Reply With Quote
Old 09-11-2006   #2
evilgreenmonkey
 
evilgreenmonkey's Avatar
 
Join Date: Feb 2006
Location: London, UK
Posts: 703
evilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud of
Randomly changing CSS styling names will not negatively effect SEO. The person could still simply count down the div tags though or convert the HTML file into XML and extract the attributes. You'll need to use exact-positioning in CSS and then start changing the order in which content appears in the HTML (this may slightly effect SEO). A determined page scraper will not be put off though, and obfuscating/encoded/javascript-parsing your code is really not worth it.

Is your data really that valuable that such protection is needed?
evilgreenmonkey is offline   Reply With Quote
Old 09-11-2006   #3
SanDiegoSEO
 
SanDiegoSEO's Avatar
 
Join Date: Oct 2004
Location: San Diego, CA
Posts: 174
SanDiegoSEO is a jewel in the roughSanDiegoSEO is a jewel in the roughSanDiegoSEO is a jewel in the roughSanDiegoSEO is a jewel in the rough
There is no 100% way to beat scrapping software and keeping it SEO friendly. If the spiders can read it, the scrapping software will be able to as well.
SanDiegoSEO is offline   Reply With Quote
Old 09-11-2006   #4
mgcre
Member
 
Join Date: Mar 2006
Posts: 14
mgcre is on a distinguished road
Quote:
Originally Posted by evilgreenmonkey
Randomly changing CSS styling names will not negatively effect SEO. The person could still simply count down the div tags though or convert the HTML file into XML and extract the attributes. You'll need to use exact-positioning in CSS and then start changing the order in which content appears in the HTML (this may slightly effect SEO). A determined page scraper will not be put off though, and obfuscating/encoded/javascript-parsing your code is really not worth it.

Is your data really that valuable that such protection is needed?
Yes, it is that valuable. Granted, there is no sure fire way to prevent scraping, though you can make it difficult to do so. The challenge is to make acceptable for spiders, though.
mgcre is offline   Reply With Quote
Old 09-11-2006   #5
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
The issue with scrapers and bots isn't just stolen content (which really can lead to problems). It's also bandwidth and privacy. There are companies, not just scraper sites, mining the web for information.

There was an excellent session at SES San Jose called The Bot Obedience Course that discussed how to "teach good bots to behave better and send bad bots away for good." I got a feeling that the engines are seeing scrapers as enough of a problem that they might be talking among themselves about some sort of opt-in standards for bots.

Bill Atchison (aka as incrediBILL on various webmaster and SEO forums) gave what I thought was an exciting presentation about "a firewall for webpages" he's developing called CrawlWall... which is opt-in on steroids. He's using a combination of methods to separate the good bots from the bad bots. His technology page is fascinating reading, very easy to understand, and is an education in itself. I'm hoping the link is OK....

http://www.crawlwall.com/technology.html

Beta version coming soon.
Robert_Charlton is offline   Reply With Quote
Old 09-12-2006   #6
mgcre
Member
 
Join Date: Mar 2006
Posts: 14
mgcre is on a distinguished road
Quote:
Originally Posted by Robert_Charlton
Bill Atchison (aka as incrediBILL on various webmaster and SEO forums) gave what I thought was an exciting presentation about "a firewall for webpages" he's developing called CrawlWall... which is opt-in on steroids. He's using a combination of methods to separate the good bots from the bad bots. His technology page is fascinating reading, very easy to understand, and is an education in itself.
That sounds almost too good to be true; I'm looking forward to the beta version. Thanks!

Hope he does another presentation at SES Chicago this year
mgcre is offline   Reply With Quote
Old 09-12-2006   #7
leonus
Member
 
Join Date: Sep 2006
Posts: 9
leonus is on a distinguished road
When the whole world is providing rss feeds for their content if a content needs to be protected that badly it should be password protected which means allow spiders in let visitors out though ill rather provide rss
leonus is offline   Reply With Quote
Old 09-12-2006   #8
pokersearch
Member
 
Join Date: Apr 2006
Posts: 23
pokersearch is on a distinguished road
Wow, that crawlwall looks like the thing. Will he be doing it for IIS/dot.net and other technolgies?
Why didn't someone think of this before?
pokersearch is offline   Reply With Quote
Old 09-13-2006   #9
evilgreenmonkey
 
evilgreenmonkey's Avatar
 
Join Date: Feb 2006
Location: London, UK
Posts: 703
evilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud of
Quote:
Originally Posted by Robert_Charlton
Bill Atchison (aka as incrediBILL on various webmaster and SEO forums) gave what I thought was an exciting presentation about "a firewall for webpages" he's developing called CrawlWall
I was thinking of creating something similar to this in PHP, although not as in-depth as Bill is suggesting CrawlWall will be. Better not start the project now, last thing I want is to get sued



Rob
evilgreenmonkey is offline   Reply With Quote
Old 09-13-2006   #10
mgcre
Member
 
Join Date: Mar 2006
Posts: 14
mgcre is on a distinguished road
Quote:
Originally Posted by pokersearch
Wow, that crawlwall looks like the thing. Will he be doing it for IIS/dot.net and other technolgies?
Why didn't someone think of this before?
I second that!
mgcre is offline   Reply With Quote
Old 09-22-2006   #11
IncrediBILL
a legend in my own mind
 
Join Date: Jul 2005
Posts: 53
IncrediBILL will become famous soon enough
Glad to here there's an audience waiting for this thing

While I'm still working, here's some tips you can use today!

You can block many bots by simply changing your .htaccess files to OPT-IN instead of OPT-OUT, basically whitelisting instead of blacklisting. You let in Google, Yahoo, MSN, etc. and IE, Opera, Firefox, Netscape and bounce EVERYTHING else by default. The beauty here is you don't have to keep looking for bots anymore as anything that identifies itself as a bot will be bounced.

Changing to OPT-IN whitelist alone sends a lot of nonsense away, just make sure to check your log files to see where all your traffic is coming from to make sure all valid crawlers sending you traffic are whitelisted.

Then to fill in the gaps where your webserver filters (.htaccess) can't help you, you can deploy something like AntiCrawl (it's free) to stops some stealth crawlers/spoofers in real-time. AntiCrawl has a lot of limitations as well, but it's better than no protection at all.

Hope that helps for now.
IncrediBILL is offline   Reply With Quote
Old 09-22-2006   #12
SanDiegoSEO
 
SanDiegoSEO's Avatar
 
Join Date: Oct 2004
Location: San Diego, CA
Posts: 174
SanDiegoSEO is a jewel in the roughSanDiegoSEO is a jewel in the roughSanDiegoSEO is a jewel in the roughSanDiegoSEO is a jewel in the rough
Could you post an example of an an htaccess file that is an OPT-IN instead of the other?
SanDiegoSEO is offline   Reply With Quote
Old 09-22-2006   #13
seomike
Md_Rewrite Guru
 
Join Date: Jun 2004
Location: Dallas, Texas but forever a Floridian!
Posts: 627
seomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to beholdseomike is a splendid one to behold
I have htaccess protection code on my site http://www.webforgers.net/code-libra...apers-bots.php

I can post it here as well.

.htaccess code
Code:
RewriteEngine on
# (testing purposes) RewriteCond %{HTTP_USER_AGENT} ^.*Mozilla.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pockey.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*NetMechanic.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperBot.* [OR,NC]
RewriteCond %[HTTP_USER_AGENT} ^QRVA.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebMiner.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebCopier.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Web\ Downloader.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebMirror.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebReaper.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Anarchie.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Mass\ Down.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*BlackWidow.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebStripper.* [ORv
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebHook.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Scooter.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*[NC]
RewriteRule ^.*$ - [F,L]
All you really need to do is hunt down the user agents if they show up in your logs from a scrape and add it to the list at the top.

I saw the OPT-IN this is what it would look like

Code:
RewriteEngine on
# (testing purposes) RewriteCond %{HTTP_USER_AGENT} ^.*Mozilla.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} !^.*mozilla.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} !^.*google.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} !^.*slurp.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} !^.*msn.* [NC]
RewriteRule ^.*$ - [F,L]

Last edited by seomike : 09-22-2006 at 05:22 PM.
seomike is offline   Reply With Quote
Old 09-22-2006   #14
IncrediBILL
a legend in my own mind
 
Join Date: Jul 2005
Posts: 53
IncrediBILL will become famous soon enough
Here's a simple SAMPLE of OPT-IN .htaccess that gives anything not listed below a nice 403 Forbidden response.

You'll need to add everything else you want to pass into your site because this is very unforgiving, nothing slips past that isn't listed.

Code:
#allow just search engines we like, we're OPT-IN only

#a catch-all for Google
BrowserMatchNoCase Googlebot good_pass
BrowserMatchNoCase Mediapartners-Google good_pass

#a couple for Yahoo
BrowserMatchNoCase Slurp good_pass
BrowserMatchNoCase Yahoo-MMCrawler good_pass

#looks like all MSN starts with MSN or Sand
BrowserMatchNoCase ^msnbot good_pass
BrowserMatchNoCase SandCrawler good_pass

#don't forget ASK/Teoma
BrowserMatchNoCase Teoma good_pass
BrowserMatchNoCase Jeeves good_pass

#allow Firefox, MSIE, Opera etc., will punt Lynx, cell phones and PDAs, don't care
BrowserMatchNoCase ^Mozilla good_pass
BrowserMatchNoCase ^Opera good_pass

#Let just the good guys in, punt everyone else to the curb
#which includes blank user agents as well


<Limit GET POST PUT HEAD>
order deny,allow
deny from all
allow from env=good_pass
</Limit>
IncrediBILL is offline   Reply With Quote
Old 09-25-2006   #15
mgcre
Member
 
Join Date: Mar 2006
Posts: 14
mgcre is on a distinguished road
Thanks for the HTAccess advice, however, does anyone have suggestions for the IIS crowd that uses 3d party ISAPI filters like ISAPI_Rewrite?
mgcre is offline   Reply With Quote
Old 09-29-2006   #16
mandarseo
http://www.e-zest.net
 
Join Date: Jul 2006
Location: Pune, India
Posts: 10
mandarseo is on a distinguished road
It is great learning. I was having very little idea about .htaccess but now I know its strength. I am thinking of starting a site related to MBA education. Using .htaccess I can now block the auto scrappers but what about the humans who can pick the content by personally visiting the pages. I don't want to make it password protected site as I want to keep the content free for all. Is there any method through which I will be able to keep content copiers away. I know it is very ridiculous question but I want to do full efforts from my side. I know there still remains an option of copying content if nothing is happening (blocked right click, blocked Ctrl+C option, NO SELECT option etc). It is PrintScreen command. Anybody can use that command and then he can retrieve the content from the image simply by using any OCR software. But the process is tedious. I want to keep content copiers away. Is there any way?

With regards,
Mandar Thosar
mandarseo is offline   Reply With Quote
Old 09-29-2006   #17
JeremyL
Member
 
Join Date: Feb 2005
Location: Dallas, TX
Posts: 6
JeremyL is on a distinguished road
I also see some scrapers using these types of tricks to prevent services like copyscape from finding them. Not sure how effective it is.
JeremyL is offline   Reply With Quote
Old 10-01-2006   #18
Lucifer
The Prince of Darkness
 
Join Date: Feb 2006
Posts: 2
Lucifer is on a distinguished road
Spoofed user agent strings?

Quote:
Originally Posted by seomike
I have htaccess protection code on my site http://www.webforgers.net/code-libra...apers-bots.php

I can post it here as well.

.htaccess code
Code:
RewriteEngine on
# (testing purposes) RewriteCond %{HTTP_USER_AGENT} ^.*Mozilla.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pockey.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*NetMechanic.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperBot.* [OR,NC]
RewriteCond %[HTTP_USER_AGENT} ^QRVA.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebMiner.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebCopier.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Web\ Downloader.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebMirror.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebReaper.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Anarchie.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Mass\ Down.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*BlackWidow.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebStripper.* [ORv
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*WebHook.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Scooter.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*[NC]
RewriteRule ^.*$ - [F,L]
All you really need to do is hunt down the user agents if they show up in your logs from a scrape and add it to the list at the top.

I saw the OPT-IN this is what it would look like

Code:
RewriteEngine on
# (testing purposes) RewriteCond %{HTTP_USER_AGENT} ^.*Mozilla.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} !^.*mozilla.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} !^.*google.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} !^.*slurp.* [OR,NC]
RewriteCond %{HTTP_USER_AGENT} !^.*msn.* [NC]
RewriteRule ^.*$ - [F,L]

Correct me if I'm wrong but can a crawler not spoof the user agent string, thereby bypassing htaccess protection or crawlwalls?

-Lucifer
Lucifer is offline   Reply With Quote
Old 10-01-2006   #19
evilgreenmonkey
 
evilgreenmonkey's Avatar
 
Join Date: Feb 2006
Location: London, UK
Posts: 703
evilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud ofevilgreenmonkey has much to be proud of
Quote:
Originally Posted by Lucifer
Correct me if I'm wrong but can a crawler not spoof the user agent string, thereby bypassing htaccess protection or crawlwalls?
That's correct, a UserAgent can be set to anything (including Googlebot) and the IP's Reverse DNS record can also be spoofed.

Google have come up with the following solution, which could potentially be used for authenticating other major bots as well:
http://googlewebmastercentral.blogsp...googlebot.html

Code in order to implement this in PHP and Perl:
http://tinyurl.com/g5uo4



Rob

Last edited by evilgreenmonkey : 10-01-2006 at 05:26 AM.
evilgreenmonkey is offline   Reply With Quote
Old 10-01-2006   #20
IncrediBILL
a legend in my own mind
 
Join Date: Jul 2005
Posts: 53
IncrediBILL will become famous soon enough
Quote:
Correct me if I'm wrong but can a crawler not spoof the user agent string, thereby bypassing htaccess protection or crawlwalls?
That's why anti-crawl technology doesn't rely on the user agent.

The user agent is just an easy way to stop the garbage that still uses the old school tricks, so blocking stealth crawlers claiming to be Internet Explorer is beyong the ability of an .htaccess file to control.

Doesn't mean they can't be detected and stopped, just means it's not possible with just the tools provided by the webserver, it's needs additional scripts to analyze and filter.
IncrediBILL is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off