randfish
02-08-2005, 05:27 PM
In running some searches today, I came across a paper that I found fascinating, although I admit to having a very difficult time fully understanding it. The paper is Lexical and Semantic Clustering by Web Links (http://www.informatics.indiana.edu/fil/Papers/JASIST-04.pdf) by Filippo Menczer.
The paper describes a test of two popular methods for analyzing the information search engines retrieve and use from links.
The first, link-content conjecture, proposes that any given page is similiar in content to the pages that link to it.
The second, link-cluster conjecture, says that pages of similiar topics will be clustered together on the web. Which is to say (I think) that web pages on a particular topic will be seperated by very few links. Essentially, as a spider crawls the links from a specific page, it will encounter very simliar pages 1 and 2 links away, and more diverse and dis-similiar pages 3,4 and 5+ links away.
Some interesting conclusions (regarding link-content conjecture):academic Web pages are better connected to each other than commercial pages in that they do a better job at pointing to other similar pages.Not very surprising, but actually, the 'content-decay' across commercial pages was not that different - even on the commercial web, pages that link generally link on-topic.This result
can be useful in the design of general crawlers... as well as topical
crawling algorithms that prioritize links based on the
textual context in which they appear; one could weight a
link’s context based on its site domain. This is more interesting to SEOs. It suggests to me that IR research has empiracally shown that links from subject-specific URLs should be more highly weighted in a search engine's algo. This isn't to say you should choose on-topic pages for links, but rather, on-topic TLDs.
More (regarding link-clustering conjecture):The link-cluster conjecture has been empirically validated by showing that two pages are significantly more likely to be related if they are within a few links from each other. Relevance probability is preserved within a radius of about three links, then it decays rapidly. Link-clustering apparently is also valid. The closer you are in terms of links to a web page on a topic, the more likely you are to be on that same topic. This suggests that SEOs should continue the practice of getting links from sites & pages that link to their top competition.
I hope this is somewhat informative for everyone here. Perhaps someone more versed in the math & language of this paper can correct me if I'm wrong and point out any important points I may have missed.
P.S. This also suggests to me that when you link externally on your site/page, you should link on-topic. My guess is that SEs wouldn't like pages/sites that don't fit the link-content pattern. I'm not so sure about link-clustering though. I like linking to pages on-topic that no one has found before (or that are at least unpopular). It's hard for me to imagine the SEs would punish this...
The paper describes a test of two popular methods for analyzing the information search engines retrieve and use from links.
The first, link-content conjecture, proposes that any given page is similiar in content to the pages that link to it.
The second, link-cluster conjecture, says that pages of similiar topics will be clustered together on the web. Which is to say (I think) that web pages on a particular topic will be seperated by very few links. Essentially, as a spider crawls the links from a specific page, it will encounter very simliar pages 1 and 2 links away, and more diverse and dis-similiar pages 3,4 and 5+ links away.
Some interesting conclusions (regarding link-content conjecture):academic Web pages are better connected to each other than commercial pages in that they do a better job at pointing to other similar pages.Not very surprising, but actually, the 'content-decay' across commercial pages was not that different - even on the commercial web, pages that link generally link on-topic.This result
can be useful in the design of general crawlers... as well as topical
crawling algorithms that prioritize links based on the
textual context in which they appear; one could weight a
link’s context based on its site domain. This is more interesting to SEOs. It suggests to me that IR research has empiracally shown that links from subject-specific URLs should be more highly weighted in a search engine's algo. This isn't to say you should choose on-topic pages for links, but rather, on-topic TLDs.
More (regarding link-clustering conjecture):The link-cluster conjecture has been empirically validated by showing that two pages are significantly more likely to be related if they are within a few links from each other. Relevance probability is preserved within a radius of about three links, then it decays rapidly. Link-clustering apparently is also valid. The closer you are in terms of links to a web page on a topic, the more likely you are to be on that same topic. This suggests that SEOs should continue the practice of getting links from sites & pages that link to their top competition.
I hope this is somewhat informative for everyone here. Perhaps someone more versed in the math & language of this paper can correct me if I'm wrong and point out any important points I may have missed.
P.S. This also suggests to me that when you link externally on your site/page, you should link on-topic. My guess is that SEs wouldn't like pages/sites that don't fit the link-content pattern. I'm not so sure about link-clustering though. I like linking to pages on-topic that no one has found before (or that are at least unpopular). It's hard for me to imagine the SEs would punish this...