Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 07-29-2004   #1
garyp
 
Join Date: Jun 2004
Posts: 265
garyp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the rough
Block-level Link Analysis

A recently published technical report from Microsoft Research (how timely!)
8 pages.

Block-level Link Analysis

Quote:
Link Analysis has shown great potential in improving the per-formance of web search. PageRank and HITS are two of the most popular algorithms. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. In this paper, the web page is partitioned into blocks using the vision-based page segmentation algorithm. By extracting the page-to-block, block-to-page relationships from link structure and page layout analysis, we can construct a semantic graph over the WWW such that each node exactly represents a single semantic topic. This graph can better describe the semantic structure of the web. Based on block-level link analysis, we proposed two new algorithms, Block Level PageRank and Block Level HITS, whose performances we study extensively using web data.
garyp is offline   Reply With Quote
Old 07-29-2004   #2
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Excellent finding, Gary.

I did a quick reading. It is very enlightening. I see they are still using tf values and the underlying ideas introduced by Robertson/Sparck Jones term weights. What they call "block" is what I call "passages" in pages. This is where we are all heading to, to semantic blocks or passages of information semantically connected across the web as a graph of nodes (semantic one).

Now my two cents.

The article destroys one of the fallacies the PageRank has from the get-go, that link citation is or can be considered a vote of citation importance. The general analogy that link citation is something like literature citation (Impact Factors) was always a carefully marketed idea, which may work in controlled laboratory conditions (a free from noise IR system), but not on the noisy commercial web full of vested interests of all kind.

Literature citation is driven by peer review and editorial policies. On the commercial web, where anyone can say anything at any time or add/change/remove links at will (or buy links), linkage is mostly driven by commercial and vested interests. I never bought the above marketing line embedded in the pagerank metric purely for theoretical reasons.

Semantic connectivity arguments is one of the reasons.

I welcome the Microsoft work...Still I wonder... how long may take for marketers to dilute this algo, too?

Orion

Last edited by orion : 07-29-2004 at 10:18 PM.
orion is offline   Reply With Quote
Old 07-29-2004   #3
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
I read the first few pages, a very interesting angle. But how hard will it be able to abuse? Segments of copy on the page is based on what they call 'blocks'. If something appears to be in a block, but is not (css positioning) then what?

I didn't read it that carefully, but I would think it doesn't look at the anchor text itself to determine if it belongs to the block, but the location of the text within the source.

Of course, its not wise for me to comment yet, but I did anyway. More reading needed on my side.
rustybrick is offline   Reply With Quote
Old 07-29-2004   #4
garyp
 
Join Date: Jun 2004
Posts: 265
garyp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the rough
Orion:
I'm a librarian/MLIS and have been using citation analysis tools (from the Institute for Scientific Info) for many years.

I haven't had a chance to finish the paper. That said, I've had many of the same thoughts (link analysis vs. citation analysis) that you write about.

You (and others) might be interested in browsing/reading some of the work writings of Eugene Garfield, the founder/developer of citation analysis.

"Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas."
http://www.garfield.library.upenn.ed...p108y1955.html

Dr. Garfield's Home Page and Links to Almost All of His Publications
http://www.garfield.library.upenn.edu/

The FUTURE of Citation Indexing: An Interview with Eugene Garfield
http://hypatia.slis.hawaii.edu/~jacs...-interview.pdf

More Interviews with Dr. Garfield
http://www.garfield.library.upenn.edu/interviews.html

Last edited by garyp : 07-29-2004 at 10:44 PM.
garyp is offline   Reply With Quote
Old 07-29-2004   #5
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
Quote:
Originally Posted by orion
I welcome the Microsoft work...Still I wonder... how long may take for marketers to dilute this algo, too?
believe it or not many people are renting links in a manner where it already does dilute this algorithm.

the one thing this type of algorithm will do is increase the space required to rent links ie: you will need to put them in context vice putting 20 in a banner, but it also means that pages would also be able to influence off topic pages easier than some of the other algorithms (like Teoma for example)

I really wish I knew more about search engine algorithms...
__________________
The SEO Book
seobook is offline   Reply With Quote
Old 07-29-2004   #6
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi Rustybrick

"I read the first few pages, a very interesting angle. But how hard will it be able to abuse? Segments of copy on the page is based on what they call 'blocks'. If something appears to be in a block, but is not (css positioning) then what?"

Good point. It is not clear from the article if they are talking about blocks as perceived by humans or blocks as perceived by a crawler (I would assume this will be the case since they are talking about web nodes)

I use the concept of "passages" (or blocks if you wish) as interpreted by crawlers and part of self-similar web subgraphs (fractal-like).

Hi Gary

Gary, you are awesome at researching literature. Happy to hear I'm not alone.

I'm well familiar with Dr Eugene Garfield research and with the arguments in favor or against impact factors in the literature. Still, is what most Research I University administrators look at to allocate resources.

Looking back, the analogy in the sense that link citation is a vote of importance similar to literature citation was a carefully crafted lie and many marketers bought it because it was running along with their vested interests. This is not the only fallacy embedded in the page rank metric, there are more from the theoretical standpoint, but those are another twenty bucks (otros veinte dolares).

Orion

Last edited by orion : 07-30-2004 at 09:45 AM.
orion is offline   Reply With Quote
Old 07-30-2004   #7
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Considering this research is a collaborative work between Microsoft and Yahoo may me think about the future of searches.

Yesterday Microsoft released its hard drive search technology to compete with Google and to wipeout a dozen of small desktop clustered search technologies. Now this, published Today

http://www.marketwatch.com/news/yhoo...B77A46BD6AE%7D

Microsoft not targeting Yahoo? I don't believe in coincidences but I don't want to speculate either. What's going on here? What do you think, guys?

Orion
orion is offline   Reply With Quote
Old 07-30-2004   #8
garyp
 
Join Date: Jun 2004
Posts: 265
garyp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the rough
Orion:
I posted this and a couple of other links (including a list of MS search patents and other technical papers) in this thread in the MSN Forum last night at about 8pm.
garyp is offline   Reply With Quote
Old 07-30-2004   #9
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Great! I'll take a quick look at the links.

Looking back, I may have to take back my concerns, before giving the impression that I states Microsoft and Yahoo are becoming research buddies. I don't have all the facts before me and I can be wrong. (Lord, protect us).

Orion
orion is offline   Reply With Quote
Old 08-02-2004   #10
searchtools
enterprise search analyst
 
Join Date: Jun 2004
Posts: 4
searchtools is on a distinguished road
Passage analysis is definitely valuable

I've been thinking about the problem of long documents and free-text search for a while now, because it just doesn't work very well. There are problems with distance between search terms, even for those engines which can get you to the correct page for the first match. Just being confronted with a long document is less useful than a shorter, more focused chunk of text, not to mention the SEO advantages.

I very much agree that breaking long documents into topical passages works better for search on all levels, and where there are links, would improve the citation-based relevance weights.
searchtools is offline   Reply With Quote
Old 08-02-2004   #11
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, Avi.

Welcome to the thread. Yes this is toward the next generation search engine is heading to (finally). You would enjoy the term vector thread and when we get there, to the discussion on passages at the keywords co-occurrence thread (both SEW threads).

Orion
orion is offline   Reply With Quote
Old 08-02-2004   #12
sfk
search engine researcher
 
Join Date: Jul 2004
Posts: 18
sfk is on a distinguished road
Quote:
Originally Posted by orion
The article destroys one of the fallacies the PageRank has from the get-go, that link citation is or can be considered a vote of citation importance.
(...)
Literature citation is driven by peer review and editorial policies. On the commercial web, where anyone can say anything at any time or add/change/remove links at will (or buy links), linkage is mostly driven by commercial and vested interests. I never bought the above marketing line embedded in the pagerank metric purely for theoretical reasons.

Semantic connectivity arguments is one of the reasons.
Orion
I am always looking for better algorithms than link citation... but I can't follow here, Orion:

First: What do you understand with "Semantic connectivity"? Is this what you have in mind as a better alternative to "link citation"?

Second: As far as I've understood the article, it still relies on inlinks and linkpops - but now based on a smaller unit than a page. So, where is is the fundamental difference?
sfk is offline   Reply With Quote
Old 08-02-2004   #13
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, sfk. Welcome to the thread. Did you get my reply to the private message? Let's take that through regular email.

About this post, let start with second point first.

1. "As far as I've understood the article, it still relies on inlinks and linkpops - but now based on a smaller unit than a page. So, where is is the fundamental difference?"

Yes, Microsoft's research article relies on links. (I'm now researching some of their background work in this area). After reading the original work several times, it is clear they use a block approach to documents. Their approach consists in dissecting a documents in blocks and conduct a link and semantic analyses of each block. Then they construct blocks-to-page and page-to-blocks subgraphs.

What they call blocks (effectively portions of documents) I call passages. The difference is that I use the term passages in the readability sense; ie, OKAPI and Dale-Chall sense (with some modifications)


2. "What do you understand with "Semantic connectivity"?

As explained in terms of standard Latent Semantic (LSI) theory.


3. "Is this what you have in mind as a better alternative to "link citation"?"

No. I'm working on research toward new theories on terms co-occurrence, sequencing and semantic connectivity. The goal is to understand how incidents of terms co-occurrence and term sequences affect semantics and IR retrieval.

I use self-similar concepts (eg., fractal distributions of frequencies) to analyze documents and collection of documents. A hint is given in my last page on term vector theory at my research site. I believe it can be incorporated into a modified vector model. If we can eliminate the main cause of failure of link citation models (these link citation models fail miserably in the presence of commercial link noise) the framework could be incorporated into a link model, but these are another twenty bucks.


You may want to check my two threads I have initiated -the term vector and semantic connectivity threads- I'm trying to run the threads very slow and explain things in non technical terms since most posters are marketers and not scientists, like you and me and others. My goal is to help SEO/SEM marketers to become more aware of scientific tools and of IR concepts. I believe it is time for them to use less speculative approaches and more the Scientific Method.


I hope this help.


Orion


PS. Is getting late here. I will check the geo-tag subject tomorrow, sfk. It sounds pretty interesting. Take care.

Last edited by orion : 08-02-2004 at 09:51 PM.
orion is offline   Reply With Quote
Old 08-02-2004   #14
garyp
 
Join Date: Jun 2004
Posts: 265
garyp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the roughgaryp is a jewel in the rough
More More More

Greetings everyone. I was doing a bit more research and discovered that another closely related tr was published in June. All of the authors but one are the same.

This MS Tech Report paper is titled:


Block-based Web Search


Quote:
Multiple-topic and varying-length of web pages are two negative factors significantly affecting the performance of web search. In this paper, we explore the use of page segmentation algorithms to partition web pages into blocks and investigate how to take advantage of block-level evidence to improve retrieval performance in the web context. Because of the special characteristics of web pages, different page segmentation method will have different impact on web search performance. We compare four types of methods, including fixed-length page segmentation, DOM-based page segmentation, vision-based page segmentation, and a combined method which integrates both semantic and fixed-length properties. Experiments on block-level query expansion and retrieval are performed. Among the four approaches, the combined method achieves the best performance for web search. Our experimental results also show that such a semantic partitioning of web pages effectively deals with the prob-lem of multiple drifting topics and mixed lengths, and thus has great potential to boost up the performance of current web search engines.
garyp is offline   Reply With Quote
Old 08-02-2004   #15
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, Gary.

We got same finding (the pdf version of the paper). Enlightening, isn't it? They use passage segmentations to narrow down semantics, too. This is toward we are heading.

Avi (Searchtools), check it out. They states a well known truth you have also pointed out: the problem with long documents. The known fact that cosine similarity measures work well with short documents and in passages, but not with very long, multi-topic documents is one reason as to why we need to look at passage strategies.

They did not use stemming and with good reason. It will be interesting how aside TREC collections, their method will work with noisy (commercial, link-farms) collections.


Orion
orion is offline   Reply With Quote
Old 08-03-2004   #16
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, there.


Most definitely the concept of link/semantic segmentation of documents into passages affects anchor links. Accordingly, the following two threads may be interrelated.


Thread A: Block-level Link Analysis: http://forums.searchenginewatch.com/...read.php?t=832

Thread B: Anchor Texts: http://forums.searchenginewatch.com/...=7773#post7773

How these threads are related? Let see. In thread A we are discussing that Microsoft researchers have developed a new framework in which documents are segmented into portions. They use the term "blocks". I use the term "passages". In another research paper Microsoft use the term "passages" and in a particular sense to describe four models. One of the goal is to succeed where current link analysis models such as PageRank have failed ie. with multi-topic documents, especially long documents. The idea is to analyze links and semantics within portions of the same document. More information is given in the research papers discussed in Thread A.

Now with regard to Thread B, anchor links historically have been reserved for local navigation within a page (see the W3C site). That is, anchor links enhance accessibility and usability within a document. Consequently, if the anchors are semantically relevant, this will be algorithmic-friendly to block-based IR models.

If Microsoft's and similar algorithms others are working on reach mainstream the ball will be on the optimizer side, as he/she would have to learn how to optimize anchor links that not only serve the purpose of local navigation but that maximize the semantic connectivity within passages of same document. He/she would also have to learn how to optimize links in a given portion pointing to another portion in another document.

Now a recent poster (sfk) brought to my attention a new element in the picture: geo-tagging. If we include geo-tags in the picture, then we have a vast area for research, optimization and marketing to exploit. Imagine searching for car insurance in Arizona and this is related to specific passages in different documents. Or imagine hotel discounts topics specifically placed in a block in several portals from different geographic locations. I know these examples are quite not the best. The point is that I can see the value of geotagging when we deal with anchor links pointing to different segments geo-organized and pointing to passages within the same documents or to particular passages in other documents. Thanks sfk. (I'm keep researching the topic).

For more information on geotagging feel free to visit sfk thread "Anyone using geo-tagging?" at http://forums.searchenginewatch.com/...read.php?t=636

What do you think of the above? Feel free to comment at either thread (A or B)

Orion

Last edited by orion : 08-03-2004 at 05:14 PM.
orion is offline   Reply With Quote
Old 08-04-2004   #17
sfk
search engine researcher
 
Join Date: Jul 2004
Posts: 18
sfk is on a distinguished road
Thoughts on Blocks/Anchors and Geo-Tags

Orion, this is a very noteworthy and outstanding idea you are mentioning above about block-level analysis using anchors and geo-tags (seems to me quite some more than 2 cents… :->)!

I’ll begin with some glossary work: You emphasized the different names ‘passage’ or ‘block’; I’ve also found “location/place/spot” in the “pages/documents”. I’d personally prefer ‘block’ in the context of text documents because this is mentioned in the HTML-reference. In the context of a graphical map document, ‘location’ or ‘spot’ are clearly more adequate. Still, I understand that location/spot/place are nice words in order to explain the ‘block’ notion of text documents.

Now, I want to drop some thoughts on ‘link cardinality (or multiplicity’) of structured (text) documents.

As of now it seems that the link (analysis) model assumed that a document is one consistent/homogeneous/unbreakable unit of observation. Consequently the model assumes a 1-to-1 relationship between two documents; more precisely its (0..1)-to-(0..1). Introducing blocks in text documents extends this cardinality to (0..*)-to-(0..*).

Besides: It is right that there exist long documents. It is said that Google only indexes first 100 Kb. But are there hints in literature how much it makes a difference when block-structuring long documents? How many pages are longer than those 100 Kb?

Regarding geo-tags: To where are they linking and how about ‘link cardinality’ there? In short: Geo-tags are pointers to geographical locations/places, not documents and the link cardinality of geo-tags seems to have also (0..*)-to-(0..*) cardinality. See the other thread about here.
sfk is offline   Reply With Quote
Old 08-04-2004   #18
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, sfk

Superb thread you are running. Excellent.

Actually in the Microsoft's pdf version of the second research paper referenced by Gary, they push for the term passages in a special sense. My use of the terms passages is different. The term is standard in readability studies (Dale Chall-, Okapi-based measures) and this is how I use it but with variable window sizes.

The length of the passages seem to matter when varied according to scaling concepts. If one encounters documents in which scaling distributions are observed, then we must look at semantics and terms co-occurrence in a different way. Many long documents appear to have random distributions at long, fixed length scales. However, at different length scales, a pattern becomes apparent.

We have found such cases (not all of them), in long documents with a lot of commercial noise. In well structured documents, especially short ones, we do not observe that scaling matters. I'm trying to understand if this is or not an artifact of the methods and restrictions employed.

Orion
orion is offline   Reply With Quote
Old 08-05-2004   #19
searchtools
enterprise search analyst
 
Join Date: Jun 2004
Posts: 4
searchtools is on a distinguished road
Quote:
Originally Posted by orion
They state a well known truth you have also pointed out: the problem with long documents. The known fact that cosine similarity measures work well with short documents and in passages, but not with very long, multi-topic documents is one reason as to why we need to look at passage strategies.
Orion
Short passages are great for focus and finding similarities (cosine, vector, whatever), but there is always the problem of very very short queries (one to four words).


Quote:
Originally Posted by orion
They did not use stemming and with good reason. It will be interesting how aside TREC collections, their method will work with noisy (commercial, link-farms) collections.
Orion
In small passages, you desperately need lightweight pluralization. Not big fancy stemming, not at index time, but human-oriented searching of both singular and plural forms. Google can mostly get away with because they use incoming links to get the other form. But look up blocks links analyses 153,000 hits vs. block link analysis 1,260,000 hits.
searchtools is offline   Reply With Quote
Old 08-05-2004   #20
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Searchtools, I most definitely agree with you on the subject of short documents and not too heavy usage of stemming. The problem I keep seen with these link citation models (PageRank, Hits, and now with Microsoft's block model) is that they work well under laboratory, controlled conditions but keep breaking down on the commercial Web.

Once one tests them with documents of all sort of lengths, topics and lot of noisy data as found in the commercial Web, they just keep breaking down. It is clear that document segmentation into passages may solve at least in part the problem of noisy collections but it would also create others. Still Microsoft's block model is a step in the right direction and I can see why it can succeed where PageRank has failed. I cannot wait to see their model in action on the Web.

On others unrelated matters.

These link citation models should take into consideration all sort of link structures that are outhere on the Web.

I don't buy the "patch" approach used by Google in which once a while they come with external penalty functions and banning actions to purify results. Many view this more or less arbitrary in nature. From the operational side, is not cost-effective either. The bar is keep getting high month after month for them. A valid link citation model should

1. account for all sort of link structures on the Web (random linkage, cross pollination, link farms, web rings, link patterns within link patterns or fractal-like), etc.

2. penalize the required link structures but as a natural consequence of applying the scoring system.

To me the Web is a dynamical system so as Web traffic has two components: random and deterministic. Any link citation model should account for this duality in user's behaviors. Penalizing pre-patterned link structures just because the structures breakdown the link model is like cooking the books (or the model).

Two years ago we wrote a review (http://www.miislita.com/searchito/wpssreview.html) on Diligenti's work on WPSS in which the Italy group proposed a model based on web surfing dynamics.

In this review, we identified three type of traffic interactions relevant to link citation models

1 Surfer-surfer interactions
2. Surfer-structure interactions
3. Structure-structure interactions

Their link citation model effectively pushes relevant documents to the top of the search results without the need for arbitrary penalty functions. Apparently their model is still in the lab since it has not reached mainstream. The model was presented at W3C's 2002 http://www2002.org/CDROM/refereed/629/

I will love to incorporate a passage component into their model.


Orion

Last edited by orion : 08-05-2004 at 08:39 PM.
orion is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off