PDA

View Full Version : Deceiving Relevancy


orion
04-08-2005, 02:49 PM
One of the AIRWeb papers on spamming, courtesy of Garcia-Molina's group (Stanford) and no strange to Google is
Web Spam Taxonomy (http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=2005-9&format=pdf&compression=&name=2005-9.pdf)

Here are some interesting lines

ABOUT WHAT IS SPAMMING

"We use the term spamming (also, spamdexing) to refer
to any deliberate human action that is meant to
trigger an unjustifiably favorable relevance or importance
for some web page, considering the page’s true
value. We will use the adjective spam to mark all those
web objects (page content items or links) that are the
result of some form of spamming. People who perform
spamming are called spammers."

"One can locate on the World Wide Web a handful of
other definitions of web spamming. For instance, some
of the definitions (e.g., [13]) are close to ours, stating
that any modification done to a page solely because
search engines exist is spamming. Specific organizations
or web user groups (e.g., [9]) define spamming by
enumerating some of the techniques that we present in
Sections 3 and 4."

ABOUT THE PERCEPTION OF OUR SEO INDUSTRY


"An important voice in the web spam arena
is that of search engine optimizers (SEOs), such
as SEO Inc. (***//***.seoinc.com) or Bruce Clay
(****//***.bruceclay.com). The activity of some SEOs
benefits the whole web community, as they help authors
create well-structured, high-quality pages. However,
most SEOs engage in practices that we call spamming.
For instance, there are SEOs who define spamming
exclusively as increasing relevance for queries not
related to the topic(s) of the page. These SEOs endorse
and practice techniques that have an impact on importance
scores, to achieve what they call “ethical” web
page positioning or optimization. Please note that according
to our definition, all types of actions intended
to boost ranking (either relevance, or importance, or
both), without improving the true value of a page, are
considered spamming."

ABOUT DECEIVING SEARCH ENGINES THAT IGNORE IDF TERM VECTOR MODELS

(Mostly the poorly programmed one).

"With TFIDF scores in mind, spammers can have two
goals: either to make a page relevant for a large number
of queries (i.e., to receive a non-zero TFIDF score), or
to make a page very relevant for a specific query (i.e.,
to receive a high TFIDF score). The first goal can be
achieved by including a large number of distinct terms
in a document. The second goal can be achieved by repeating
some “targeted” terms. (We can assume that
spammers cannot have real control over the IDF scores
of terms. Moreover, some search engines ignore IDF
scores altogether. Thus, the primary way of increasing
the TFIDF scores is by increasing the frequency of
terms within specific text fields of a page.)"

OTHERS SPAM TACTICS RELATED TO...

BODY
TITLES
META TAGS
URLS
ANCHOR TEXT/TAGS
DRIVE-BY TERMS DROPPING

et....

Enjoy it.

I was in communication with Baeza-Yates and Brian Davidson. More reaching out activities are coming. These may develop into a better understanding of the perception at both sides of the fence (IR and SEO colleages). Sorry for the redundancy and wordiness.

Orion

orion
04-08-2005, 03:04 PM
Forget to mention about deceiving PageRank and link farms renamed now as "link building" strategies.

Now this Acknowledgment is hilarious

"This paper is the result of many interesting discussions
with one of our collaborators at a major search engine
company, who wishes to remain anonymous. We would
like to thank this person for the explanations and examples
that helped us shape the presented taxonomy
of web spam."

HUM!, Stanford, Garcia-Molina, and some of the reference papers...?

C'mon, children.

Orion

NFFC
04-08-2005, 03:58 PM
I'm not sure what you think of SEO's but that paper is way old, been read and dissected at many meets.

>HUM!, Stanford, Garcia-Molina, and some of the reference papers...?

It was Yahoo.

orion
04-08-2005, 05:11 PM
FYI, NFFC

The paper in question was taken from Gary Price

http://blog.searchenginewatch.com/blog/050407-190947

Gary writes

Updated Research Paper: A Taxonomy of Web Spam

A week ago, Chris blogged about the First International Workshop on Adversarial Information Retrieval on the Web that will be part of the WWW2005 Conference next month in Japan.

One of the papers that will be presented at the conference: Web Spam Taxonomy, by Zolta Gyongyi and Hector Garcia-Molina from the Stanford Database Group has been updated and is now available full text (9 pages; PDF) online.

It's a very interesting read.

From the abstract:


Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.


Posted by Gary Price on Apr. 7, 2005 |


Apr 7, 2005 does not sound old to me. The material and issues at hand are indeed as old as the SEO industry.

Orion

PS

Furthermore, check following reference dates in the paper.

Monica Bianchini, Marco Gori, and Franco
Scarselli. Inside PageRank. ACM Transactions
on Internet Technology, 5(1), 2005.

Zolt´an Gy¨ongyi and Hector Garcia-Molina. Link
spam alliances. Technical report, Stanford University,
2005.

rcjordan
04-08-2005, 05:22 PM
Page 7: The second data set (DS2) was the result of a single breadth-first search started at the Yahoo! home page, conducted between July and September 2002.

orion
04-08-2005, 05:27 PM
This paper, as many to be presented at AIRWeb are recaps of how IRs perceive SEOS and these issues, which are old, of course.

I have spent good of my time last few weeks discussing with AIRWeb folks some of the material to be presented at the activity. They got stuck with IRs submitting old things to a new event. That explains everything.

Orion

NFFC
04-08-2005, 05:58 PM
>Apr 7, 2005 does not sound old to me.

In the SEO world its ancient history, tomorrow is all that counts.

Point taken though, it is a modern day rehash of any old paper, one I've read many times. Nothing in your summation suggested anything new.

>I have spent good of my time last few weeks discussing with AIRWeb folks some of the material to be presented at the activity.

I've spent a good part of mine discussing similar things, although we are looking at what we *think* will be in the 2006 one. Chess and draughts?

orion
04-08-2005, 06:40 PM
Ok, I let you save some face.

Going back to AIRWeb, the problem is those papers were suppose to address new issues. Nothing new has been presented, yet. We agree on that.

One more thing, what makes you think the Acknowledgement part of the paper quoted by Gary Price refers to Yahoo and not Google or how knows? On the other hand Jordan has a good point with the Yahoo data.

Orion

NFFC
04-08-2005, 07:13 PM
>Ok, I let you save some face.

I have no face, I only have rank. That is what defines me, learn to love it, as that is the SEO mindset..I rank therefore I am.

>One more thing, what makes you think the Acknowledgement part of the paper quoted by Gary Price refers to Yahoo and not Google

I just know.

orion
04-08-2005, 07:14 PM
Ok.

Orion