|
#1
|
|||
|
|||
|
In running some searches today, I came across a paper that I found fascinating, although I admit to having a very difficult time fully understanding it. The paper is Lexical and Semantic Clustering by Web Links by Filippo Menczer.
The paper describes a test of two popular methods for analyzing the information search engines retrieve and use from links. The first, link-content conjecture, proposes that any given page is similiar in content to the pages that link to it. The second, link-cluster conjecture, says that pages of similiar topics will be clustered together on the web. Which is to say (I think) that web pages on a particular topic will be seperated by very few links. Essentially, as a spider crawls the links from a specific page, it will encounter very simliar pages 1 and 2 links away, and more diverse and dis-similiar pages 3,4 and 5+ links away. Some interesting conclusions (regarding link-content conjecture): Quote:
Quote:
More (regarding link-clustering conjecture): Quote:
I hope this is somewhat informative for everyone here. Perhaps someone more versed in the math & language of this paper can correct me if I'm wrong and point out any important points I may have missed. P.S. This also suggests to me that when you link externally on your site/page, you should link on-topic. My guess is that SEs wouldn't like pages/sites that don't fit the link-content pattern. I'm not so sure about link-clustering though. I like linking to pages on-topic that no one has found before (or that are at least unpopular). It's hard for me to imagine the SEs would punish this... Last edited by randfish : 02-08-2005 at 04:45 PM. |
|
#2
|
|||
|
|||
|
Hi again Randfish,
seems to me that you did a good job at figuring out what the paper was going on about! Meczner says basically that the semantic content of a web page can be inferred from the pages that point to it. His semantic model involves connecting cosine-based semantic similarity and the shortest link distance. Maybe look at Davison's link-hop work and IDF*TF cosine similarity observations that suggest that pages adjacent in a hyperlink space to a given space are semantically related. Basically for conjecture, the probability of a content word to appear more than once in a document,given that it already appears once, is significantly higher than the probability of the first occurrence. Pages linking between each other are similar. link-cluster conjecture states that pages on the same topic are clustered together. (This brings Hubs and Authorities to mind. Hubs have links to lots of relevant pages (authorities), authorities have meaningful, significant, etc... information and link to other authorities. ) Anyway ... it has some drawbacks: distance is not a good indicator of relevant, given the nature of the web today. You have pages being created on the fly, links pointing all over the place regardless of topic or anything like that. The idea of hubs was also brought up in HITS, a search algo that came up in 1998, same time as PR. Look at the outgoing links on some sites and you may find a good number not linking to relevant sites. They will link to some but not all. Some don't link to anything relevant. This would work with a corpus of documents that have bibliographies etc...like a digital library. Yes, you are quite right, getting links that are relevant will alsways help you, and it certainly can't hurt anyway. If everybody did that, there wouldn't be a problem from my end of things. The problem is that its not, so these hub techniques do have some mileage, but it's limited for now. To be honest what have we got to go on? Text and links, and markup like html, xml. There's your answer. (Menczer is well known for that paper, but it was 2002, which is a bit old considering that techniques advance and improve so quicky - good find though because it gives you some answers) Forgive me if it all sounds like babble, it's late here! Here's the original paper: Menczer Last edited by xan : 02-09-2005 at 07:51 PM. |
|
#3
|
|||
|
|||
|
Xan,
I'm glad I was able to accurately understand the general gist of things. I've been reading white papers/reports all day and am definitely struggling. To me, the most valuable parts of this paper were the discussion about web communities and clustering, and I think it still holds true, I notice that a lot of the top n-ranked documents at Google often are only 1-3 links away from each other, creating a little link neighborhood. It's probably a good idea for SEOs to recognize these in their own SERPs and try to become part of them. The other item that struck me as interesting was the idea about external linking to pages that were not in the neighborhood but were on-topic or on a related-topic. I'm not sure if the paper really gives us any idea about whether this behavior should or is rewarded in ranking, but it's something to think about. Thanks again for your comments and help! |
|
#4
|
|||
|
|||
|
Paper are notoriously dull to read and sometimes extra difficult. With practice it gets easier, but you forget how to spell and can only write equations...
You did pretty well here, so that should be really encouraging for you. Papers are a major major part of research. Writing them takes for ever as well. |
|
#5
|
||||
|
||||
|
Prof Filippo Menczer is a great guy and colleage, but haven't yet meet personally. Back in Oct 2004, he and I shared some research on semantic projects. See post 20 of On-Topic Analysis thread.
Back then, he kindly sent me this Nov 2004 research paper which interfaces with on-topic and semantics. Our results compare well. http://informatics.indiana.edu/fil/P...ikm-04-326.pdf Orion |
|
#6
|
|||
|
|||
|
Hey Orion,
interesting thanks. What environment do you test in? (oh and without being rude...how old are you?) |
|
#7
|
||||
|
||||
|
1. Proprietary software environment.
2. Big 5 coming in ~ 5 crawling weeks. Ready to FTP myself in binary or ASCII. Then try to find a c-index for my soul and EF-ratio for my spirit. Already done many find-and-replace in life. Still working on an EXACT match. Presents accepted, which then be ranked based on relevancy. Orion |
|
#8
|
|||
|
|||
|
I hope you will be using a state-of-the-art ranking method, after all the big 5 warants it!
Do you think there is something such as the exact match??? I'm not convinced. If there nothing such as "random" is anything "exact"? Perhaps you should have a real party and go to sit in a google server. I hear zip files can hurt though. |
|
#9
|
|||
|
|||
|
I like the Pac-10 over the Big-5...Wait... What's the Big 5?
Orion, I read the paper you pointed to about the CmapTools from IHMC and Indiana University. It was an interesting read, although the meat of the math and algos was way over my head. I feel like I was able to take away some items from the paper, but I'm wondering - does this have a practical application for commercial search engines (using the web as their index)? Also, could you see search optimization specialists using the CmapTools to analyze their own sites and possibly identify weaknesses in architecture? Or, even use the site they described - http://www.cmex.arc.nasa.gov - as a good "role-model" of sorts for architecting information? Some notable items from the paper that did make sense to me: Quote:
Quote:
Quote:
|
|
#10
|
||||
|
||||
|
Yeah, xan. Just don't spam me or I'll remote-search you
![]() Orion Last edited by orion : 02-10-2005 at 02:59 PM. |
|
#11
|
|||
|
|||
|
Ouch!!!
I'd better get programming then... |
|
#12
|
|||
|
|||
|
There's been so much talk of semantics that I wrote a primer on the search science blog. Its not horrendously hard to understand I hope. Its just describes what computational semantics are and how they work, formal semantics are very different. There's a lot of confusion all around!
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|