|
#1
|
|||
|
|||
|
Block-level Link Analysis
A recently published technical report from Microsoft Research (how timely!)
8 pages. Block-level Link Analysis Quote:
![]() |
|
#2
|
||||
|
||||
|
Excellent finding, Gary.
I did a quick reading. It is very enlightening. I see they are still using tf values and the underlying ideas introduced by Robertson/Sparck Jones term weights. What they call "block" is what I call "passages" in pages. This is where we are all heading to, to semantic blocks or passages of information semantically connected across the web as a graph of nodes (semantic one). Now my two cents. The article destroys one of the fallacies the PageRank has from the get-go, that link citation is or can be considered a vote of citation importance. The general analogy that link citation is something like literature citation (Impact Factors) was always a carefully marketed idea, which may work in controlled laboratory conditions (a free from noise IR system), but not on the noisy commercial web full of vested interests of all kind. Literature citation is driven by peer review and editorial policies. On the commercial web, where anyone can say anything at any time or add/change/remove links at will (or buy links), linkage is mostly driven by commercial and vested interests. I never bought the above marketing line embedded in the pagerank metric purely for theoretical reasons. Semantic connectivity arguments is one of the reasons. I welcome the Microsoft work...Still I wonder... how long may take for marketers to dilute this algo, too? Orion Last edited by orion : 07-29-2004 at 11:18 PM. |
|
#3
|
||||
|
||||
|
I read the first few pages, a very interesting angle. But how hard will it be able to abuse? Segments of copy on the page is based on what they call 'blocks'. If something appears to be in a block, but is not (css positioning) then what?
I didn't read it that carefully, but I would think it doesn't look at the anchor text itself to determine if it belongs to the block, but the location of the text within the source. Of course, its not wise for me to comment yet, but I did anyway. More reading needed on my side. ![]() |
|
#4
|
|||
|
|||
|
Orion:
I'm a librarian/MLIS and have been using citation analysis tools (from the Institute for Scientific Info) for many years. I haven't had a chance to finish the paper. That said, I've had many of the same thoughts (link analysis vs. citation analysis) that you write about. You (and others) might be interested in browsing/reading some of the work writings of Eugene Garfield, the founder/developer of citation analysis. "Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas." http://www.garfield.library.upenn.ed...p108y1955.html Dr. Garfield's Home Page and Links to Almost All of His Publications http://www.garfield.library.upenn.edu/ The FUTURE of Citation Indexing: An Interview with Eugene Garfield http://hypatia.slis.hawaii.edu/~jacs...-interview.pdf More Interviews with Dr. Garfield http://www.garfield.library.upenn.edu/interviews.html Last edited by garyp : 07-29-2004 at 11:44 PM. |
|
#5
|
|||
|
|||
|
Quote:
the one thing this type of algorithm will do is increase the space required to rent links ie: you will need to put them in context vice putting 20 in a banner, but it also means that pages would also be able to influence off topic pages easier than some of the other algorithms (like Teoma for example) I really wish I knew more about search engine algorithms...
__________________
The SEO Book |
|
#6
|
||||
|
||||
|
Hi Rustybrick
"I read the first few pages, a very interesting angle. But how hard will it be able to abuse? Segments of copy on the page is based on what they call 'blocks'. If something appears to be in a block, but is not (css positioning) then what?" Good point. It is not clear from the article if they are talking about blocks as perceived by humans or blocks as perceived by a crawler (I would assume this will be the case since they are talking about web nodes) I use the concept of "passages" (or blocks if you wish) as interpreted by crawlers and part of self-similar web subgraphs (fractal-like). Hi Gary Gary, you are awesome at researching literature. Happy to hear I'm not alone. I'm well familiar with Dr Eugene Garfield research and with the arguments in favor or against impact factors in the literature. Still, is what most Research I University administrators look at to allocate resources. Looking back, the analogy in the sense that link citation is a vote of importance similar to literature citation was a carefully crafted lie and many marketers bought it because it was running along with their vested interests. This is not the only fallacy embedded in the page rank metric, there are more from the theoretical standpoint, but those are another twenty bucks (otros veinte dolares). Orion Last edited by orion : 07-30-2004 at 10:45 AM. |
|
#7
|
||||
|
||||
|
Considering this research is a collaborative work between Microsoft and Yahoo may me think about the future of searches.
Yesterday Microsoft released its hard drive search technology to compete with Google and to wipeout a dozen of small desktop clustered search technologies. Now this, published Today http://www.marketwatch.com/news/yhoo...B77A46BD6AE%7D Microsoft not targeting Yahoo? I don't believe in coincidences but I don't want to speculate either. What's going on here? What do you think, guys? Orion |
|
#8
|
|||
|
|||
|
Orion:
I posted this and a couple of other links (including a list of MS search patents and other technical papers) in this thread in the MSN Forum last night at about 8pm. |
|
#9
|
||||
|
||||
|
Great! I'll take a quick look at the links.
Looking back, I may have to take back my concerns, before giving the impression that I states Microsoft and Yahoo are becoming research buddies. I don't have all the facts before me and I can be wrong. (Lord, protect us). Orion |
|
#10
|
|||
|
|||
|
Passage analysis is definitely valuable
I've been thinking about the problem of long documents and free-text search for a while now, because it just doesn't work very well. There are problems with distance between search terms, even for those engines which can get you to the correct page for the first match. Just being confronted with a long document is less useful than a shorter, more focused chunk of text, not to mention the SEO advantages.
I very much agree that breaking long documents into topical passages works better for search on all levels, and where there are links, would improve the citation-based relevance weights. |
|
#11
|
||||
|
||||
|
Hi, Avi.
Welcome to the thread. Yes this is toward the next generation search engine is heading to (finally). You would enjoy the term vector thread and when we get there, to the discussion on passages at the keywords co-occurrence thread (both SEW threads). Orion |
|
#12
|
|||
|
|||
|
Quote:
First: What do you understand with "Semantic connectivity"? Is this what you have in mind as a better alternative to "link citation"? Second: As far as I've understood the article, it still relies on inlinks and linkpops - but now based on a smaller unit than a page. So, where is is the fundamental difference? |
|
#13
|
||||
|
||||
|
Hi, sfk. Welcome to the thread. Did you get my reply to the private message? Let's take that through regular email.
About this post, let start with second point first. 1. "As far as I've understood the article, it still relies on inlinks and linkpops - but now based on a smaller unit than a page. So, where is is the fundamental difference?" Yes, Microsoft's research article relies on links. (I'm now researching some of their background work in this area). After reading the original work several times, it is clear they use a block approach to documents. Their approach consists in dissecting a documents in blocks and conduct a link and semantic analyses of each block. Then they construct blocks-to-page and page-to-blocks subgraphs. What they call blocks (effectively portions of documents) I call passages. The difference is that I use the term passages in the readability sense; ie, OKAPI and Dale-Chall sense (with some modifications) 2. "What do you understand with "Semantic connectivity"? As explained in terms of standard Latent Semantic (LSI) theory. 3. "Is this what you have in mind as a better alternative to "link citation"?" No. I'm working on research toward new theories on terms co-occurrence, sequencing and semantic connectivity. The goal is to understand how incidents of terms co-occurrence and term sequences affect semantics and IR retrieval. I use self-similar concepts (eg., fractal distributions of frequencies) to analyze documents and collection of documents. A hint is given in my last page on term vector theory at my research site. I believe it can be incorporated into a modified vector model. If we can eliminate the main cause of failure of link citation models (these link citation models fail miserably in the presence of commercial link noise) the framework could be incorporated into a link model, but these are another twenty bucks. You may want to check my two threads I have initiated -the term vector and semantic connectivity threads- I'm trying to run the threads very slow and explain things in non technical terms since most posters are marketers and not scientists, like you and me and others. My goal is to help SEO/SEM marketers to become more aware of scientific tools and of IR concepts. I believe it is time for them to use less speculative approaches and more the Scientific Method. I hope this help. Orion PS. Is getting late here. I will check the geo-tag subject tomorrow, sfk. It sounds pretty interesting. Take care. Last edited by orion : 08-02-2004 at 10:51 PM. |
|
#14
|
|||
|
|||
|
More More More
Greetings everyone. I was doing a bit more research and discovered that another closely related tr was published in June. All of the authors but one are the same.
This MS Tech Report paper is titled: Block-based Web Search Quote:
|
|
#15
|
||||
|
||||
|
Hi, Gary.
We got same finding (the pdf version of the paper). Enlightening, isn't it? They use passage segmentations to narrow down semantics, too. This is toward we are heading. Avi (Searchtools), check it out. They states a well known truth you have also pointed out: the problem with long documents. The known fact that cosine similarity measures work well with short documents and in passages, but not with very long, multi-topic documents is one reason as to why we need to look at passage strategies. They did not use stemming and with good reason. It will be interesting how aside TREC collections, their method will work with noisy (commercial, link-farms) collections. Orion |
|
#16
|
||||
|
||||
|
Hi, there.
Most definitely the concept of link/semantic segmentation of documents into passages affects anchor links. Accordingly, the following two threads may be interrelated. Thread A: Block-level Link Analysis: http://forums.searchenginewatch.com/...read.php?t=832 Thread B: Anchor Texts: http://forums.searchenginewatch.com/...=7773#post7773 How these threads are related? Let see. In thread A we are discussing that Microsoft researchers have developed a new framework in which documents are segmented into portions. They use the term "blocks". I use the term "passages". In another research paper Microsoft use the term "passages" and in a particular sense to describe four models. One of the goal is to succeed where current link analysis models such as PageRank have failed ie. with multi-topic documents, especially long documents. The idea is to analyze links and semantics within portions of the same document. More information is given in the research papers discussed in Thread A. Now with regard to Thread B, anchor links historically have been reserved for local navigation within a page (see the W3C site). That is, anchor links enhance accessibility and usability within a document. Consequently, if the anchors are semantically relevant, this will be algorithmic-friendly to block-based IR models. If Microsoft's and similar algorithms others are working on reach mainstream the ball will be on the optimizer side, as he/she would have to learn how to optimize anchor links that not only serve the purpose of local navigation but that maximize the semantic connectivity within passages of same document. He/she would also have to learn how to optimize links in a given portion pointing to another portion in another document. Now a recent poster (sfk) brought to my attention a new element in the picture: geo-tagging. If we include geo-tags in the picture, then we have a vast area for research, optimization and marketing to exploit. Imagine searching for car insurance in Arizona and this is related to specific passages in different documents. Or imagine hotel discounts topics specifically placed in a block in several portals from different geographic locations. I know these examples are quite not the best. The point is that I can see the value of geotagging when we deal with anchor links pointing to different segments geo-organized and pointing to passages within the same documents or to particular passages in other documents. Thanks sfk. (I'm keep researching the topic). For more information on geotagging feel free to visit sfk thread "Anyone using geo-tagging?" at http://forums.searchenginewatch.com/...read.php?t=636 What do you think of the above? Feel free to comment at either thread (A or B) Orion Last edited by orion : 08-03-2004 at 06:14 PM. |
|
#17
|
|||
|
|||
|
Thoughts on Blocks/Anchors and Geo-Tags
Orion, this is a very noteworthy and outstanding idea you are mentioning above about block-level analysis using anchors and geo-tags (seems to me quite some more than 2 cents… :->)!
I’ll begin with some glossary work: You emphasized the different names ‘passage’ or ‘block’; I’ve also found “location/place/spot” in the “pages/documents”. I’d personally prefer ‘block’ in the context of text documents because this is mentioned in the HTML-reference. In the context of a graphical map document, ‘location’ or ‘spot’ are clearly more adequate. Still, I understand that location/spot/place are nice words in order to explain the ‘block’ notion of text documents. Now, I want to drop some thoughts on ‘link cardinality (or multiplicity’) of structured (text) documents. As of now it seems that the link (analysis) model assumed that a document is one consistent/homogeneous/unbreakable unit of observation. Consequently the model assumes a 1-to-1 relationship between two documents; more precisely its (0..1)-to-(0..1). Introducing blocks in text documents extends this cardinality to (0..*)-to-(0..*). Besides: It is right that there exist long documents. It is said that Google only indexes first 100 Kb. But are there hints in literature how much it makes a difference when block-structuring long documents? How many pages are longer than those 100 Kb? Regarding geo-tags: To where are they linking and how about ‘link cardinality’ there? In short: Geo-tags are pointers to geographical locations/places, not documents and the link cardinality of geo-tags seems to have also (0..*)-to-(0..*) cardinality. See the other thread about here. |
|
#18
|
||||
|
||||
|
Hi, sfk
Superb thread you are running. Excellent. Actually in the Microsoft's pdf version of the second research paper referenced by Gary, they push for the term passages in a special sense. My use of the terms passages is different. The term is standard in readability studies (Dale Chall-, Okapi-based measures) and this is how I use it but with variable window sizes. The length of the passages seem to matter when varied according to scaling concepts. If one encounters documents in which scaling distributions are observed, then we must look at semantics and terms co-occurrence in a different way. Many long documents appear to have random distributions at long, fixed length scales. However, at different length scales, a pattern becomes apparent. We have found such cases (not all of them), in long documents with a lot of commercial noise. In well structured documents, especially short ones, we do not observe that scaling matters. I'm trying to understand if this is or not an artifact of the methods and restrictions employed. Orion |
|
#19
|
|||
|
|||
|
Quote:
Quote:
|
|
#20
|
||||
|
||||
|
Searchtools, I most definitely agree with you on the subject of short documents and not too heavy usage of stemming. The problem I keep seen with these link citation models (PageRank, Hits, and now with Microsoft's block model) is that they work well under laboratory, controlled conditions but keep breaking down on the commercial Web.
Once one tests them with documents of all sort of lengths, topics and lot of noisy data as found in the commercial Web, they just keep breaking down. It is clear that document segmentation into passages may solve at least in part the problem of noisy collections but it would also create others. Still Microsoft's block model is a step in the right direction and I can see why it can succeed where PageRank has failed. I cannot wait to see their model in action on the Web. On others unrelated matters. These link citation models should take into consideration all sort of link structures that are outhere on the Web. I don't buy the "patch" approach used by Google in which once a while they come with external penalty functions and banning actions to purify results. Many view this more or less arbitrary in nature. From the operational side, is not cost-effective either. The bar is keep getting high month after month for them. A valid link citation model should 1. account for all sort of link structures on the Web (random linkage, cross pollination, link farms, web rings, link patterns within link patterns or fractal-like), etc. 2. penalize the required link structures but as a natural consequence of applying the scoring system. To me the Web is a dynamical system so as Web traffic has two components: random and deterministic. Any link citation model should account for this duality in user's behaviors. Penalizing pre-patterned link structures just because the structures breakdown the link model is like cooking the books (or the model). Two years ago we wrote a review (http://www.miislita.com/searchito/wpssreview.html) on Diligenti's work on WPSS in which the Italy group proposed a model based on web surfing dynamics. In this review, we identified three type of traffic interactions relevant to link citation models 1 Surfer-surfer interactions 2. Surfer-structure interactions 3. Structure-structure interactions Their link citation model effectively pushes relevant documents to the top of the search results without the need for arbitrary penalty functions. Apparently their model is still in the lab since it has not reached mainstream. The model was presented at W3C's 2002 http://www2002.org/CDROM/refereed/629/ I will love to incorporate a passage component into their model. Orion Last edited by orion : 08-05-2004 at 09:39 PM. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|