orion
04-08-2005, 02:49 PM
One of the AIRWeb papers on spamming, courtesy of Garcia-Molina's group (Stanford) and no strange to Google is
Web Spam Taxonomy (http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=2005-9&format=pdf&compression=&name=2005-9.pdf)
Here are some interesting lines
ABOUT WHAT IS SPAMMING
"We use the term spamming (also, spamdexing) to refer
to any deliberate human action that is meant to
trigger an unjustifiably favorable relevance or importance
for some web page, considering the page’s true
value. We will use the adjective spam to mark all those
web objects (page content items or links) that are the
result of some form of spamming. People who perform
spamming are called spammers."
"One can locate on the World Wide Web a handful of
other definitions of web spamming. For instance, some
of the definitions (e.g., [13]) are close to ours, stating
that any modification done to a page solely because
search engines exist is spamming. Specific organizations
or web user groups (e.g., [9]) define spamming by
enumerating some of the techniques that we present in
Sections 3 and 4."
ABOUT THE PERCEPTION OF OUR SEO INDUSTRY
"An important voice in the web spam arena
is that of search engine optimizers (SEOs), such
as SEO Inc. (***//***.seoinc.com) or Bruce Clay
(****//***.bruceclay.com). The activity of some SEOs
benefits the whole web community, as they help authors
create well-structured, high-quality pages. However,
most SEOs engage in practices that we call spamming.
For instance, there are SEOs who define spamming
exclusively as increasing relevance for queries not
related to the topic(s) of the page. These SEOs endorse
and practice techniques that have an impact on importance
scores, to achieve what they call “ethical” web
page positioning or optimization. Please note that according
to our definition, all types of actions intended
to boost ranking (either relevance, or importance, or
both), without improving the true value of a page, are
considered spamming."
ABOUT DECEIVING SEARCH ENGINES THAT IGNORE IDF TERM VECTOR MODELS
(Mostly the poorly programmed one).
"With TFIDF scores in mind, spammers can have two
goals: either to make a page relevant for a large number
of queries (i.e., to receive a non-zero TFIDF score), or
to make a page very relevant for a specific query (i.e.,
to receive a high TFIDF score). The first goal can be
achieved by including a large number of distinct terms
in a document. The second goal can be achieved by repeating
some “targeted” terms. (We can assume that
spammers cannot have real control over the IDF scores
of terms. Moreover, some search engines ignore IDF
scores altogether. Thus, the primary way of increasing
the TFIDF scores is by increasing the frequency of
terms within specific text fields of a page.)"
OTHERS SPAM TACTICS RELATED TO...
BODY
TITLES
META TAGS
URLS
ANCHOR TEXT/TAGS
DRIVE-BY TERMS DROPPING
et....
Enjoy it.
I was in communication with Baeza-Yates and Brian Davidson. More reaching out activities are coming. These may develop into a better understanding of the perception at both sides of the fence (IR and SEO colleages). Sorry for the redundancy and wordiness.
Orion
Web Spam Taxonomy (http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=2005-9&format=pdf&compression=&name=2005-9.pdf)
Here are some interesting lines
ABOUT WHAT IS SPAMMING
"We use the term spamming (also, spamdexing) to refer
to any deliberate human action that is meant to
trigger an unjustifiably favorable relevance or importance
for some web page, considering the page’s true
value. We will use the adjective spam to mark all those
web objects (page content items or links) that are the
result of some form of spamming. People who perform
spamming are called spammers."
"One can locate on the World Wide Web a handful of
other definitions of web spamming. For instance, some
of the definitions (e.g., [13]) are close to ours, stating
that any modification done to a page solely because
search engines exist is spamming. Specific organizations
or web user groups (e.g., [9]) define spamming by
enumerating some of the techniques that we present in
Sections 3 and 4."
ABOUT THE PERCEPTION OF OUR SEO INDUSTRY
"An important voice in the web spam arena
is that of search engine optimizers (SEOs), such
as SEO Inc. (***//***.seoinc.com) or Bruce Clay
(****//***.bruceclay.com). The activity of some SEOs
benefits the whole web community, as they help authors
create well-structured, high-quality pages. However,
most SEOs engage in practices that we call spamming.
For instance, there are SEOs who define spamming
exclusively as increasing relevance for queries not
related to the topic(s) of the page. These SEOs endorse
and practice techniques that have an impact on importance
scores, to achieve what they call “ethical” web
page positioning or optimization. Please note that according
to our definition, all types of actions intended
to boost ranking (either relevance, or importance, or
both), without improving the true value of a page, are
considered spamming."
ABOUT DECEIVING SEARCH ENGINES THAT IGNORE IDF TERM VECTOR MODELS
(Mostly the poorly programmed one).
"With TFIDF scores in mind, spammers can have two
goals: either to make a page relevant for a large number
of queries (i.e., to receive a non-zero TFIDF score), or
to make a page very relevant for a specific query (i.e.,
to receive a high TFIDF score). The first goal can be
achieved by including a large number of distinct terms
in a document. The second goal can be achieved by repeating
some “targeted” terms. (We can assume that
spammers cannot have real control over the IDF scores
of terms. Moreover, some search engines ignore IDF
scores altogether. Thus, the primary way of increasing
the TFIDF scores is by increasing the frequency of
terms within specific text fields of a page.)"
OTHERS SPAM TACTICS RELATED TO...
BODY
TITLES
META TAGS
URLS
ANCHOR TEXT/TAGS
DRIVE-BY TERMS DROPPING
et....
Enjoy it.
I was in communication with Baeza-Yates and Brian Davidson. More reaching out activities are coming. These may develop into a better understanding of the perception at both sides of the fence (IR and SEO colleages). Sorry for the redundancy and wordiness.
Orion