Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 02-08-2005   #1
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Lightbulb Lexical & Semantic Clustering by Web Links

In running some searches today, I came across a paper that I found fascinating, although I admit to having a very difficult time fully understanding it. The paper is Lexical and Semantic Clustering by Web Links by Filippo Menczer.

The paper describes a test of two popular methods for analyzing the information search engines retrieve and use from links.

The first, link-content conjecture, proposes that any given page is similiar in content to the pages that link to it.

The second, link-cluster conjecture, says that pages of similiar topics will be clustered together on the web. Which is to say (I think) that web pages on a particular topic will be seperated by very few links. Essentially, as a spider crawls the links from a specific page, it will encounter very simliar pages 1 and 2 links away, and more diverse and dis-similiar pages 3,4 and 5+ links away.

Some interesting conclusions (regarding link-content conjecture):
Quote:
academic Web pages are better connected to each other than commercial pages in that they do a better job at pointing to other similar pages.
Not very surprising, but actually, the 'content-decay' across commercial pages was not that different - even on the commercial web, pages that link generally link on-topic.
Quote:
This result
can be useful in the design of general crawlers... as well as topical
crawling algorithms that prioritize links based on the
textual context in which they appear; one could weight a
link’s context based on its site domain.
This is more interesting to SEOs. It suggests to me that IR research has empiracally shown that links from subject-specific URLs should be more highly weighted in a search engine's algo. This isn't to say you should choose on-topic pages for links, but rather, on-topic TLDs.

More (regarding link-clustering conjecture):
Quote:
The link-cluster conjecture has been empirically validated by showing that two pages are significantly more likely to be related if they are within a few links from each other. Relevance probability is preserved within a radius of about three links, then it decays rapidly.
Link-clustering apparently is also valid. The closer you are in terms of links to a web page on a topic, the more likely you are to be on that same topic. This suggests that SEOs should continue the practice of getting links from sites & pages that link to their top competition.

I hope this is somewhat informative for everyone here. Perhaps someone more versed in the math & language of this paper can correct me if I'm wrong and point out any important points I may have missed.

P.S. This also suggests to me that when you link externally on your site/page, you should link on-topic. My guess is that SEs wouldn't like pages/sites that don't fit the link-content pattern. I'm not so sure about link-clustering though. I like linking to pages on-topic that no one has found before (or that are at least unpopular). It's hard for me to imagine the SEs would punish this...

Last edited by randfish : 02-08-2005 at 05:45 PM.
randfish is offline   Reply With Quote
Old 02-09-2005   #2
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Hi again Randfish,

seems to me that you did a good job at figuring out what the paper was going on about!

Meczner says basically that the semantic content of a web page can be inferred from the pages that point to it. His semantic model involves connecting cosine-based semantic similarity and the shortest link distance. Maybe look at Davison's link-hop work and IDF*TF cosine similarity observations that suggest that pages adjacent in a hyperlink space to a given space are semantically related.

Basically for conjecture, the probability of a content word to appear more than once in a document,given that it already appears once, is significantly higher than the probability of the first occurrence. Pages linking between each other are similar.

link-cluster conjecture states that pages on the same topic are clustered together.

(This brings Hubs and Authorities to mind. Hubs have links to lots of relevant pages (authorities), authorities have meaningful, significant, etc... information and link to other authorities. )

Anyway ... it has some drawbacks: distance is not a good indicator of relevant, given the nature of the web today. You have pages being created on the fly, links pointing all over the place regardless of topic or anything like that. The idea of hubs was also brought up in HITS, a search algo that came up in 1998, same time as PR. Look at the outgoing links on some sites and you may find a good number not linking to relevant sites. They will link to some but not all. Some don't link to anything relevant.

This would work with a corpus of documents that have bibliographies etc...like a digital library.
Yes, you are quite right, getting links that are relevant will alsways help you, and it certainly can't hurt anyway. If everybody did that, there wouldn't be a problem from my end of things. The problem is that its not, so these hub techniques do have some mileage, but it's limited for now.

To be honest what have we got to go on? Text and links, and markup like html, xml. There's your answer.

(Menczer is well known for that paper, but it was 2002, which is a bit old considering that techniques advance and improve so quicky - good find though because it gives you some answers)

Forgive me if it all sounds like babble, it's late here!

Here's the original paper:

Menczer

Last edited by xan : 02-09-2005 at 08:51 PM.
xan is offline   Reply With Quote
Old 02-09-2005   #3
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Xan,

I'm glad I was able to accurately understand the general gist of things. I've been reading white papers/reports all day and am definitely struggling.

To me, the most valuable parts of this paper were the discussion about web communities and clustering, and I think it still holds true, I notice that a lot of the top n-ranked documents at Google often are only 1-3 links away from each other, creating a little link neighborhood. It's probably a good idea for SEOs to recognize these in their own SERPs and try to become part of them.

The other item that struck me as interesting was the idea about external linking to pages that were not in the neighborhood but were on-topic or on a related-topic. I'm not sure if the paper really gives us any idea about whether this behavior should or is rewarded in ranking, but it's something to think about.

Thanks again for your comments and help!
randfish is offline   Reply With Quote
Old 02-09-2005   #4
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Paper are notoriously dull to read and sometimes extra difficult. With practice it gets easier, but you forget how to spell and can only write equations...

You did pretty well here, so that should be really encouraging for you.

Papers are a major major part of research. Writing them takes for ever as well.
xan is offline   Reply With Quote
Old 02-09-2005   #5
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Filippo Menczer Research

Prof Filippo Menczer is a great guy and colleage, but haven't yet meet personally. Back in Oct 2004, he and I shared some research on semantic projects. See post 20 of On-Topic Analysis thread.

Back then, he kindly sent me this Nov 2004 research paper which interfaces with on-topic and semantics. Our results compare well.
http://informatics.indiana.edu/fil/P...ikm-04-326.pdf



Orion
orion is offline   Reply With Quote
Old 02-10-2005   #6
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Hey Orion,

interesting thanks. What environment do you test in?

(oh and without being rude...how old are you?)
xan is offline   Reply With Quote
Old 02-10-2005   #7
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

1. Proprietary software environment.
2. Big 5 coming in ~ 5 crawling weeks. Ready to FTP myself in binary or ASCII. Then try to find a c-index for my soul and EF-ratio for my spirit. Already done many find-and-replace in life. Still working on an EXACT match. Presents accepted, which then be ranked based on relevancy.

Orion
orion is offline   Reply With Quote
Old 02-10-2005   #8
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
I hope you will be using a state-of-the-art ranking method, after all the big 5 warants it!

Do you think there is something such as the exact match???
I'm not convinced. If there nothing such as "random" is anything "exact"?

Perhaps you should have a real party and go to sit in a google server. I hear zip files can hurt though.
xan is offline   Reply With Quote
Old 02-10-2005   #9
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
I like the Pac-10 over the Big-5...Wait... What's the Big 5?

Orion,

I read the paper you pointed to about the CmapTools from IHMC and Indiana University. It was an interesting read, although the meat of the math and algos was way over my head. I feel like I was able to take away some items from the paper, but I'm wondering - does this have a practical application for commercial search engines (using the web as their index)?

Also, could you see search optimization specialists using the CmapTools to analyze their own sites and possibly identify weaknesses in architecture? Or, even use the site they described - http://www.cmex.arc.nasa.gov - as a good "role-model" of sorts for architecting information?

Some notable items from the paper that did make sense to me:
Quote:
...central question addressed in this paper is how to formulate topic descriptors and discriminators to guide this search process
Quote:
(these) methods find new descriptors by searching for terms that tend to occur often in relevant documents and find good discriminators by identifying terms that tend to occur only in the context of the given topic.
Quote:
Terms are good topic descriptors if they answer the question "What is this topic about?"
Terms are good topic discriminators if they answer the question "What are good query terms to access similiar information?"
There is certainly a lot to understand beyond what I've pulled from this paper, but I'm not sure how much of it is related to web search. Thanks for pointing it out, though. I feel like I have a better and better grasp on the tasks of IR researchers the more I read.
randfish is offline   Reply With Quote
Old 02-10-2005   #10
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Yeah, xan. Just don't spam me or I'll remote-search you

Orion

Last edited by orion : 02-10-2005 at 03:59 PM.
orion is offline   Reply With Quote
Old 02-10-2005   #11
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Ouch!!!

I'd better get programming then...
xan is offline   Reply With Quote
Old 02-20-2005   #12
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
There's been so much talk of semantics that I wrote a primer on the search science blog. Its not horrendously hard to understand I hope. Its just describes what computational semantics are and how they work, formal semantics are very different. There's a lot of confusion all around!
xan is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off