|
#1
|
||||
|
||||
|
Link Spam Detection Based on Mass Estimation
Sir Gary Price with the SEW blog coverage on a new "technical research paper from the Stanford InfoLab that takes a look at link spam."
http://blog.searchenginewatch.com/blog/051108-144815 Quote:
mmmmm...spam detection...mmmmm |
|
#2
|
||||
|
||||
|
The paper reafirms the notion that PageRank follows a power law distribution, which when replicating across scales are fractal in nature:
"A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Bencz´ur et al. [Bencz´ur et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with “unnatural” link patterns. However, they fail to recognize more sophisticated forms of spam, when spammers mimic reputable web content. " A possible solution to the mentioned failure consists in adding a time component to their model. In most cases, temporal detection should help in the discriminating process. I meet Dr. Berkhin at SES, San Jose and spent some quality time with him. Pavel is such a great man. Happy to see Yahoo is well aware of link spamming techniques around the web and from whom. Orion Last edited by orion : 11-09-2005 at 12:55 PM. |
|
#3
|
||||
|
||||
|
Quote:
|
|
#4
|
|||
|
|||
|
Quote:
If possible can you give a brief example on what a flagged deviation might look like? Thanks! |
|
#5
|
|||
|
|||
|
Just adding a time component in a simplistic way could lead to a sandbox and also as mentioned before a penalty for sites linked to by scrapers.
But adding a time component and studying the data and looking for 'outliers' while identifying sites that are the target of scrapers (and _not_ putting a penalty on them) could definitely help. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|