Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 11-08-2005   #1
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Link Spam Detection Based on Mass Estimation

Sir Gary Price with the SEW blog coverage on a new "technical research paper from the Stanford InfoLab that takes a look at link spam."

http://blog.searchenginewatch.com/blog/051108-144815

Quote:
Abstract: Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page's ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming.
http://dbpubs.stanford.edu:8090/pub/...me=2005-33.pdf


mmmmm...spam detection...mmmmm
rustybrick is offline   Reply With Quote
Old 11-09-2005   #2
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

The paper reafirms the notion that PageRank follows a power law distribution, which when replicating across scales are fractal in nature:

"A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Bencz´ur et al. [Bencz´ur et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with “unnatural” link patterns. However, they fail to recognize more sophisticated forms of spam,
when spammers mimic reputable web content. "


A possible solution to the mentioned failure consists in adding a time component to their model. In most cases, temporal detection should help in the discriminating process.

I meet Dr. Berkhin at SES, San Jose and spent some quality time with him. Pavel is such a great man. Happy to see Yahoo is well aware of link spamming techniques around the web and from whom.


Orion

Last edited by orion : 11-09-2005 at 12:55 PM.
orion is offline   Reply With Quote
Old 11-09-2005   #3
Marcia
 
Marcia's Avatar
 
Join Date: Jun 2004
Location: Los Angeles, CA
Posts: 5,476
Marcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond repute
Quote:
A possible solution to the mentioned failure consists in adding a time component to their model.
On a practical level, couldn't that help alleviate the possible problem caused to some sites by a huge surge of inbound links from scraper sites before they have enough normal, quality links? There's no controlling it, they're cranked out by the dozens before there's really a chance to get regular links, so the percentage is hugely weighted toward those garbage links.
Marcia is offline   Reply With Quote
Old 11-09-2005   #4
NuevoJefe
Member
 
Join Date: Feb 2005
Location: San Diego, CA
Posts: 18
NuevoJefe is on a distinguished road
Quote:
...whether the distribution of PageRank scores of pages pointing to x conforms a power law...
When they say "distribution of PageRank scores" do you know off-hand what they are assuming is normal if they're not already evaluating this in an age-based manner?

If possible can you give a brief example on what a flagged deviation might look like?

Thanks!
NuevoJefe is offline   Reply With Quote
Old 11-10-2005   #5
physics
Newbie
 
Join Date: Feb 2005
Posts: 2
physics is on a distinguished road
Just adding a time component in a simplistic way could lead to a sandbox and also as mentioned before a penalty for sites linked to by scrapers.
But adding a time component and studying the data and looking for 'outliers' while identifying sites that are the target of scrapers (and _not_ putting a penalty on them) could definitely help.
physics is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off