Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 05-31-2005   #1
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Identifying Link Farm Spam Pages

This thread is inspired by a similar one by Rusty: New Technical Report from Stanford Discusses Link Spam

At the recent W3C Conference my good colleague Brian Davison presented the paper Identifying Link Farm Spam Pages

Brian writes

"In this paper, we
present algorithms for detecting these link farms automatically
by frst generating a seed set based on the common link
set between incoming and outgoing links of Web pages and
then expanding it. Links between identified pages are reweighted,
providing a modified web graph to use in ranking
page importance. Experimental results show that we can
identify most link farm spam pages and the final ranking
results are improved for almost all tested queries."

Among other things an analytical description of the so-called BADRANK Algorithm is presented. This paragraph is enlightening (emphasis added)

"In many SEO discussion boards, participants discuss the
latest ranking and spam-finding techniques employed by
commercial search engines. One approach, called Bad-
Rank1, is believed by some to be used by a commercial engine
to combat link farms.2 The idea is similar in spirit to
our mechanism. BadRank is based on propagating negative
value among pages. The philosophy of BadRank is that a
page will get high BadRank value if it points to some pages
with high BadRank value. So it is an inversion of the PageRank
algorithm which believes that good pages will transfer
their PageRank value to its its outgoing links. The formula of
BadRank is given as...." [see equation(s) in the pdf document]

"where BR(A) is the BadRank of Page A. BR(Ti) is the
BadRank of page Ti, which is the outbound link of page A.
C(Ti) is the number of inbound links of page Ti and d is
the damping factor. E(A) is the initial BadRank value for
page A and can be assigned by some spam filters. Since no
algorithms of how to calculate E(A) and how to combine
BadRank value with other ranking methods such as PageRank
are given in [1], we cannot tell the e ectiveness of this
approach."

Brian then explains the following

the TKC Effect
Link farm effect on HITS
Comparison of ParentPenalty and BadRank

and present the Algorithm of ParentPenalty

"In this paper, we present ideas of generating a seed set
of spam pages and then expanding it to identify link farms.
First, we will use a simple but e ective method based on the
common link sets within the incoming and outgoing links
of Web pages for selecting the seed set. Then an expansion
step, ParentPenalty, can expand the seed set to include more
pages within certain link farms. This spamming page set can
be used together with ranking algorithms such as HITS or
PageRank to generate new ranking lists. The experiments
we have done show that this combination is quite resistant
to link farms and the TKC effect"."

Excellent research work, Brian. I'm happy that finally someone dissected BadRank and suggested some alternatives.


I invite members of SEWF to read this work. Its Conclusion and Discussion section is worth some thoughts.


Cheers


Orion

Last edited by orion : 05-31-2005 at 06:37 PM.
orion is offline   Reply With Quote
Old 05-31-2005   #2
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Orion - Thanks for pointing to this. I actually like it a bit better than the paper by the Stanford folks, but I do have some questions:

1. At what level of sophistication does the BadRank system fail? It seems that most link farms I see succeeding are using a large variety of inbound and outbound links in order to effectively "hide" themselves from being simply identified.

2. Couldn't this easily incorrectly tag sites in a close-knit community as being spam? It seems that reliance on a level of interlinking will kill a lot of legitimate sites - just think of the SEO community websites (we all link to each other like crazy!)?

I'm gald to see progress being made in these arenas and a real focus on it. I don't think you can solve spam any other way than technologically in the SE algos, and, being a white-hat, it can only help me
randfish is offline   Reply With Quote
Old 06-01-2005   #3
I, Brian
Whitehat on...Whitehat off...Whitehat on...Whitehat off...
 
Join Date: Jun 2004
Location: Scotland
Posts: 940
I, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of light
I can't help but wonder if any IR system that devotes itself to penalise use of information isn't somehow getting its priorities mixed up. Shouldn't IR be about getting the best possible information for users, through identification of the best sources of information, rather than identification of the worst?
I, Brian is offline   Reply With Quote
Old 06-01-2005   #4
Marketing Guy
Can't think of a witty title.
 
Join Date: Apr 2005
Posts: 193
Marketing Guy is a name known to allMarketing Guy is a name known to allMarketing Guy is a name known to allMarketing Guy is a name known to allMarketing Guy is a name known to allMarketing Guy is a name known to all
Good point I,Brian, but it could also be said that web design should be about placing the best possible information for the users as well! Some folks work this way and others don't. So SE's have to counteract.

The proliferation of poor quality information within SERPs lowers the quality of the overall "SERP" product for the user - which is the bottom line that SE's are responsible for. So it can be argued that removing "bad" (and I use this term lightly as it's obviously up for debate) information from SERPs would result in a better overall product on average.

Like tending a garden, it could be argued that if you ignored the weeds and spent time improving your flower bed, then it's better for visitors to the garden. But there would come a point when the weeds would grow out of control and it would negatively impact the overall quality garden.

Taking this analogy further, some people could argue that the odd weed is a natural occurance and should be left to grow, etc - add any number of varying arguments on "garden maintenance".

It depends where you are coming from really - are you bothered about who visits your garden or do you want to have the biggest, most popular garden there is?

MG
Marketing Guy is offline   Reply With Quote
Old 06-02-2005   #5
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by I, Brian
I can't help but wonder if any IR system that devotes itself to penalise use of information isn't somehow getting its priorities mixed up. Shouldn't IR be about getting the best possible information for users, through identification of the best sources of information, rather than identification of the worst?
I'm agreeing with you there Brian. Its a self-defeatist approach. IR is about extracting relevant information from an index, a corpus, a library, ...whatever...
xan is offline   Reply With Quote
Old 06-02-2005   #6
massa
Member
 
Join Date: Jun 2004
Location: home
Posts: 160
massa is just really nicemassa is just really nicemassa is just really nicemassa is just really nicemassa is just really nice
>any IR system that devotes itself to penalise use of information isn't somehow getting its priorities mixed up.<


It's kind of sad, but that is the way it's been from the beginning with spidering engines. Logic would dictate the purpose of an algorithm is to put the "good" stuff at the top, BUT, a spider does not know "good". Instead it relies on filters to not put good stuff at the top, rather it tries to put bad stuff at the bottom.

A spider can't apply any emotion. Good/bad and/or relevancy is completely subjective and as such relies on human emotion. Good to me may be bad to you. You and I can get mad, sad, happy, jubilant, excited, depressed. We see red and feel one way. Blue, another.

A spider sees numbers. It doesn't get mad. It doesn't feel sorry for someone. It doesn't have a need for love and it sees red as FF0000.

If you need information about cancer and find, (somehow), a page that has pleasing colors, easy, clear navigation and answers the question you asked, you would likely say, "this page is good".

A spider sees the same page and can only build a numerical system of points added or subtracted based on digital information. No emotion. Without emotion, there is no good or bad. Only numbers.

Again you would think even limited to no emotion, an algo would still attempt to define good and assign added points thereby placing "good" at the top. That is not how it works. Take just the basics that most of us reading here know. Are we familiar with bonus points, awards, accolades or are we more familiar with penalties and bans? Many of us chase links while thinking of that chase in terms of making the engines think our site is more relevant that it really is. That is a very common perception but completely bass-ackward. In the first place we can't "make" any engine think anything. In the second place, engines don't think and finally, no page, (or anything else for that matter), can possibly be more or less relevant than it really is.

My point is that we tend to see getting links or hitting the KW density just right as getting plus points or awards from search engines for our page or site. In other words, we're trying to build sites that make the engines think the site is better than another site. That is assuming they are trying to put the "good" stuff at the top.

The reality is, there is such a thing as trusted sites and getting the right link from the right site for the right keyword/phrase to the right page can help move your page higher in the SERPS for specific keywords/phrase. But those trusted sites can only support so many links before they are no longer trusted and they can get a penalty. It is far more likely that many of us here would run a much greater risk of getting too many links from the wrong sites and getting a penalty. Getting too many links too fast and getting "put in the sandbox". Getting a lot of links from within our own network of sites and NOT passing PR. Or stuffing keywords and getting hit with a penalty.

Without using emotion, (possibly experience as well), to determine "good", it is simply much easier telling the program, hidden text is bad. Subtract. Too many links too fast is bad. Subtract. All links from one IP address is bad. Subtract. See what I mean? If you can't tell a program that blue is good compared to red being bad, (depending on the personal preferences of the programmer), that makes creating filters to put bad at the bottom the path of least resistance.

It is changing. I'm no Orion by a long shot, but to me, spidering engines have always been about trying to duplicate the behavior of humans without having to acknowledge or compensate those humans for the contribution. There is no doubt that programs are getting better at that. But, for the time being, the majority of mechanical IR systems operate by trying to put the bad at the bottom rather than the "good" at the top. This whole "BadRank" thing illustrates my point.

This is whats wrong with the cliche:
"Just keep building good content and eventually, you'll get there"

To my mind, a more accurate statement would be:

"just keep building content that avoids the most filters and converts the best"
massa is offline   Reply With Quote
Old 06-04-2005   #7
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Good and practical post, Bob, as usual.

Regarding the concerns some have raised about the goal of this type research that attempts to fight abuse, as Bob mentioned, such attempts are not new and it has been in this way since the beginning. The author actually covers many of these attempts, among others

Lempel and Moran (SALSA)
Soumen Chakrabarti ( Document Object Models)
Li’s small-in-large-out revision of HITS
Gyongyi’s TRUSTRANK


Quote:
It's kind of sad, but that is the way it's been from the beginning with spidering engines. Logic would dictate the purpose of an algorithm is to put the "good" stuff at the top, BUT, a spider does not know "good". Instead it relies on filters to not put good stuff at the top, rather it tries to put bad stuff at the bottom.

Indeed, the algorithm described by professor Davison, is just that, an algorithm. It is not an IR system. It is an algorithm that works as a kind of filter and could be incorporated into IR systems or with other algorithms (such as PageRank) in which link weights are propagated.


Quote:
1. At what level of sophistication does the BadRank system fail? It seems that most link farms I see succeeding are using a large variety of inbound and outbound links in order to effectively "hide" themselves from being simply identified.

Rand, I think the effectiveness and validity of BadRank is addressed in the paper.

“E(A) is the initial BadRank value for
page A and can be assigned by some spam filters. Since no
algorithms of how to calculate E(A) and how to combine
BadRank value with other ranking methods such as PageRank
are given in [1], we cannot tell the effectiveness of this
approach."


Quote:
Couldn't this easily incorrectly tag sites in a close-knit community as being spam? It seems that reliance on a level of interlinking will kill a lot of legitimate sites - just think of the SEO community websites (we all link to each other like crazy!)?

Good question. In my view, there may be few cases, since the goal of this research is precisely to address TKC effects.

“The link farm is one example of the tightlyknit
community (TKC) [20]. Since TKCs can have
significant impact on ranking results [20, 7, 23], it is necessary
to detect link farms and ameliorate their effect on the
ranking process.”


How many innocent could be affected? That’s a different question. How far it could go? Obviously one would need a threshold.


When comparing BadRank and their ParentPenalty algorithm, they use this example


“As we mentioned before, BadRank uses the following philosophy:
a page should be penalized for pointing to bad
pages. However, it does not specify how far it should go. If
page A points to page B, and B points to some known bad
pages, it is intuitive to consider B to be bad, but should A
be penalized just for pointing to B? For example, a computer
science department's homepage points to a student's
homepage and the student may join some link exchange program
by adding some links within his homepage. It makes
sense that the student's homepage should be penalized, but
the department's homepage is innocent. In the BadRank algorithm,
the department's homepage will also realize some
non-zero badness value by propagation upward from one or
more other pages with non-zero badness values.

Our ParentPenalty idea is more resistant to this issue. A
threshold is used in Section 4.2 to decide whether the badness
of the child pages should be propagated to a parent.
If the number of bad children is equal to or larger than the
threshold, then the parent should be penalized. Further,
if the number of parents that should be penalized is meets
or exceeds the threshold, then the grandparents should be
penalized. This also makes sense in real life. So the threshold
plays an important role in preventing the badness value
from propagating upwards to too many generations.”


Overall they have also considered the problem of collateral damage.


“One issue to discuss is that it is appropriate to punish links
among link farms, but not good to punish some web pages.
For example, most of the URLs in Table 2(c) are well-known
companies. They are not good for the query IBM research
center, but they are quite good for a query like jupiter media.
So, we should not remove these web sites from the index
just because they link with each other, but it makes sense
to delete links among the pages in Table 2(c) to prevent
them from getting authority value from collaborators. These
pages should only get votes from pages outside their business
in the ranking system. If many other pages point to them,
they will still be ranked high and be hit for queries like
jupiter media.”



Obviously, the algorithm is not a silver bullet and is far from perfect.


“One weakness of the algorithms in this paper is that duplicate
pages cannot be detected. For example, often pages
in dmoz.com will be copied many times, but these pages
do not connect with each other so we do not mark them
as spam. As a result of the duplication, the targets cited
by these pages will be ranked highly. In general, duplicate
pages should be eliminated before using our algorithm to
generate better ranking (e.g., using methods from [9, 5]).”


I believe this type of research against spam abuse and link abuse
is a good step in the right direction and a hint at things to come.


Orion

Last edited by orion : 06-04-2005 at 01:03 AM.
orion is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off