PDA

View Full Version : Block Analysis 101


orion
10-11-2004, 09:54 PM
Thanks, Nacho for the phone conversation. I'm stealing the "101 part" from one of your threads, buddy.

In these threads, we already discussed Microsoft's new block level technology for extracting semantics from web pages:

block-level link analysis (forums.searchenginewatch.com/showthread.php?t=832&highlight=orion)
Themed Sites Level of Importance (forums.searchenginewatch.com/showthread.php?t=1288&highlight=orion)

However, I found these threads are quite technical. I was looking for a simple way to describe their technology, so we can all understand it and discuss it. I came across this news link

http://www.technologyreview.com/articles/04/10/rnb_100604.asp

which explains in simple terms the block level algo.

Rather than a refrit of these threads, let's discuss the practical aspect of the block level algorithm. Wearing the mod hat I'll present some challenging questions and let others to comment. First, the article. The news says in part

"Researchers from the University of Chicago and Microsoft Research Asia have devised a system that analyzes Web content at the level of blocks of information on a page rather than coarser page-level. This allows for a model of the relationships between Web pages that shows the intrinsic semantic structure of the Web. The method could lead to more accurate search engines, according to the researchers. The researchers use their previously-developed Vision-Based Page Segmentation algorithm to delineate different parts of a Web page based on how a human views a page. The algorithm segments pages by horizontal and vertical lines, and blocks of content are weighted by page position. Advertisement links, for example, and in count for less than links from central content blocks."

Questions

Assuming someone wants to target MSN

1. How this would impact your web design habits?
2. Do you think CSS will impact MSN's algo?
3. Do you think the algo is susceptible to gaming strategies? If so, which one?
4. How block level would mpact the building of theme sites?
5. What's your take on advertisement links and links from central content?

Orion

Nick W
10-12-2004, 03:13 AM
Nice post Orion, i'll take no.2:

Yes. If they are going to be analysing links based on where they appear in the html code, then I for one can place them wherever I wish on the page but have them exactly where they need to be in the code. Absolute childs play.

I'll make a comment on no.3 too:

Sure it is, it would just raise the bar a little, and for me personally that will be a good thing.

Nick

Kali
10-12-2004, 06:49 AM
A few answers

1. It won't change the way I design pages

2. CSS creates blocks of code which would be easy to identify and evaluate.

3. Yes - but I'm not going to tell as to which ones I think will work.

4. Block level analysis shouldn't impact the building of themed sites at all - might have a big impact on non-themed directories though.

rustybrick
10-12-2004, 09:35 AM
Ill take a shot at #5.

I bet SEOs will figure out a way to obtain the link popularity and weight they are looking for through text ad links. Might take some time but first MSN needs to deploy it for it to be broken.

dannysullivan
10-12-2004, 10:17 AM
Last year, in one of my articles about various Google things, they acknowledged the idea that they could do block-level style analysis as well. Didn't say they WERE doing it -- just the usual "that's one of the things that could always be possible" statements.

I actually hate the term block level analysis. For whatever reason, it doesn't suggest to my ear the idea that a page is going to be analyzed in parts. Of course, I don't have any great ideas on another term.

1. How this would impact your web design habits?
That implies that the only way this is done is to try and see a page visual, as a human might. I suspect it is already happening and is not visually-tied.

For example, take all the little links that run in the navigation here in the forums. They don't read as natural copy, in the way the "content" of a forums discussion might. I suspect part of link analysis might be to discount links that appear to be navigational in nature, on the basis of not being near natural language content. I also suspect you might reinforce this if you see common cues across a series of pages -- the links always in the same place, same font size, etc.

2. Do you think CSS will impact MSN's algo?
IE, CSS could make thing seem visually appealing to a user while the underlying non-CSS HTML code might try to paint another picture for the search engine. First thought is that I don't think it will be only visual cues that are used. Second thought is that I think search engines will eventually grow up to understand CSS better.

3. Do you think the algo is susceptible to gaming strategies? If so, which one?
Sure, as Nick W notes, he and others would look at ways to game it. Any system that essentially invites the webmaster into the process and gives feedback in the form of how a page ranks is going to be susceptible. But it will make it a bit harder, let link analysis be nursed out a bit longer -- plus it still won't be the only thing used.

4. How block level would mpact the building of theme sites?
Since no major search engine I know of is currently trying to give a particular page a ranking boost because other pages within a "site" are on the same theme, shouldn't have any impact.

5. What's your take on advertisement links and links from central content?
Back to my earlier reply -- if these can be seen as "unnatural" or not part of the core content -- using both visual and language cues -- I think they'll be discounted. Not banned, not ignored -- just not weighted as highly.

Interestingly, I got the impression that Google's "named entities" might be a similar thing, but not tied to links. This is something they talked about recently: Google Demos Word Clustering (http://blog.searchenginewatch.com/blog/041008-246). It sounded to me like they were, by analyzing language rather than visual cues, trying to understand what the core content of a page is.

orion
10-12-2004, 01:00 PM
Excel observations.

I don't like their block term either but now most are using it. I prefer to stick to the notion of passages in the old readability sense when thinking of blocks (with some modifications).

Back in 2003 I was looking for grad textbooks and came across Andy King's masterpiece Speed Up Your Site - Web Site Optimization (www.websiteoptimization.com) (1st Edition, 2003, New Riders). In chapter 8, page 188 under "Raising Relevance", he discusses a simple CSS technique for reverse-positioning coded content in order to raise relevance.

One approach/component of Microsoft's algo consists in trying to take into consideration human readability and visual positioning; i.e. the way users see page content. From their papers, it is hard to tell if they have/haven't considered CSS positioning or usability issues. I found this important for their block algo.

It will be interesting to see how this model performs in the presence of commercial noise or external interests.


Orion

orion
10-12-2004, 05:44 PM
Hi, Mikkel. I got your PM, thanks. I'll probably be attending one of the SES next year, not sure yet.

Hi, guys. I found the following information very interesting.

This early work about Microsoft's VIPS (VIsion-based Page Segmentation)
"ImageSeer: Clustering and Searching WWW Images
Using Link and Page Layout Analysis"
ftp://ftp.research.microsoft.com/pub/tr/TR-2004-38.pdf

provides more clues about the block-level model. Note the difference between link structure and page layout. The graphs of this paper are very revealing.

In the paper, they used the html structure to identify horizontal and vertical segments but not to build a DOM-based link model. Although in this work they applied the model to image retrieval, the basics of their model are more clear than in recent papers.

The main purpose of the original work was not to score the importance value of pages but to construct better subgraphs of the Web that are faster to crawl and to mine than PageRank and similar models. To do this they considered three different relationships

a. block-to-page (link structure); i..e, a block linking to a document
b. page-to-block (page layout); i.e., a document linking to a block
c. block-to-image (inclusion); i.e., a block linking to an image

Based on these "jumps" we end walking three different graphs

a. page-to-page graph
b. block-to-block graph
c. image-to-image graph

See Figure 3 of the VIPS paper.

Block-to-Page (Link Structure)

In the particular case of images retrieval, (block-to-page), links outside the image blocks are considered noisy links. Consider this: images co-occurring in a given block are likely to be topically-related or on-topic. Therefore, links inside image blocks are relevant while links outside are more likely to be irrelevant. Thus, it is possible to discriminate between links within an image block and noisy links outside the block; e.g.. links inside navigation menus, advertisement, etc.

Page-to-Block (Layout)

Page-to-block could be viewed as an attempt to quantify intuition and perception. For average users, big, centered blocks are perceived as more important than those with small size and placed at the margins or corners..

Let

S = size of block b in page p
D = distance from the center of b to the center of screen
a = normalization factor

Then the importance of every block b in page p is given by the working expression

fp(b) = a*(S/D)

Thus, fp(b) can be taken for the probability that a user is focused on the block b when looking at page p.

(Note. They use different symbols in the VIPS paper.)

I will stop here for now.

Challenging questions

1. Do you think this model would produce faster graphs to crawl? If so, why?
2. What's your take on noisy links from the marketing standpoint?
3.What do you think at their attempt of quantifying block importance, etc?

Orion

PS. Some references

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting content
structure for web pages based on visual representation”,
Proc.5th Asia Pacific Web Conference, Xi’an China, 2003.

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: a visionbased
page segmentation algorithm”, Microsoft Technical
Report, MSR-TR-2003-79, 2003.

Nacho
10-13-2004, 03:52 AM
I’m not sure if it’s too late to take a shot at the first set of questions, but here goes . . .

1. How this would impact your web design habits?
Yes, I believe SEOs will always pay attention to everything it takes (ethically of course) to make pages #1 on the SERPs, but still make them be likeable by the users. There is nothing wrong with finding new ways, if this is what it takes. In a sense it’s kind of like if the search engines are educating us how to build better pages that will then help organize the www’s documents more efficiently.

2. Do you think CSS will impact MSN's algo?
I agree with Danny on “search engines will eventually grow up to understand CSS better”. Web developers/designers will tend to stop forcing tags for SEO purposes and focus on a more balanced design that makes sense for the user to see. My personal opinion is, I hate to see websites that look like this:

The other day I took my blue dog down the street, where my friends with other blue dogs like to meet. It’s always nice to see Joe’s blue dog because it has spots all over him. Funny to say, John’s HUGE BLUE DOG had a cold and sneezed all over us.
You get the picture . . . haven’t we all seen sites like this? Web developers/designers also need to grow up.

3. Do you think the algo is susceptible to gaming strategies? If so, which one?
YES, but it will become harder in the future, to a point that only if you deserve to be #1 you’ll get it. I don’t think link popularity algorithms in terms of quantity will be the key though, even if they will always be important. I think it’s going to be more into on-page factors and content where its going to be weighted.

4. How block level would impact the building of theme sites?
In the future, this has to be an important element of the link algorithm to make it more challenging for websites to gain true relevancy. Otherwise, we are stuck with spammers taking advantages ridiculously easy.

5. What's your take on advertisement links and links from central content?
As user’s tend to ignore advertisements in this type of fashion, they will need to be devalued in a link algorithm to increase relevancy for links on central content that might be more on target to the anchor text used. Search engines eventually will be able to analyze the entire paragraph to be on topic and not just a few words to the left and a few to the right.

Nacho
10-13-2004, 04:15 AM
And I'll take a shot at the last three as well :)

1. Do you think this model would produce faster graphs to crawl? If so, why?
Yes, because if the crawlers are replicating the users movements, then there is no need to crawl the entire page (ie. advertisement links at the bottom left navigation) and go from one link to the next quicker.

2. What's your take on noisy links from the marketing standpoint?
The user learns to avoid them, therefore crawlers will need to do the same and algorithms devalue its existance. Sometimes, even on targeted content pages. For example, take the “Search Engine Watch Marketplace” (sorry Danny) and ask yourselves how many times have you clicked on ANY of them, but we spend almost every day here, right?

3.What do you think at their attempt of quantifying block importance, etc?
This is the hardest question of all because not enough testing has been made by the real world in block analysis. It sounds great and I’m fascinated by it. However, it’s not the same to perform an perfectly executed experiment, and then have 200 million search queries be depending on it. That’s where luck can be on your side or slap you in the face if sh*t hits the fan.

orion
10-13-2004, 11:47 AM
Hi, Nacho. Happy to read your comments.

I forget to include another 2 questions (4, 5) to the last post. Here is:

4. What's your take at Microsoft efforts of trying to build into a search engine some form of AI (artificial intelligence); e.g., their attempt at trying to emulate human readability and perception of importance of portions (blocks) of pages? (Sorry for the long sentence.)

5. From the client side, how usability would be affected or play into the picture?

Orion

Nacho
10-13-2004, 12:26 PM
Thank you Orion, these are all very interesting questions.

4. What's your take at Microsoft efforts of trying to build into a search engine some form of AI (artificial intelligence); e.g., their attempt at trying to emulate human readability and perception of importance of portions (blocks) of pages?
As a concept I think it’s brilliant, but maybe a little too overboard. (eg. Maybe adding a Porsche 911 engine into a VW Beetle is a little too much. I like the VW Bug just the way it is.). The problem I see is that all humans may have a different perception of a page and thus making it very subjective to each human’s opinion. Microsoft would be taking a few elements that are (I guess) totally necessary to try and make it objective, for example extremes: clean vs. dirty, loud vs. quiet, tall vs. short, etc.

It must be a very careful concept to implement, because what if two pages are identical but one is embedded into images 100% and the other has an 80% text / 20% image combination. However to the human eye, the pages are identical, so the engine must play an objective role and score them the same, right?

5. From the client side, how usability would be affected or play into the picture?
If web developers / designers are building pages that are a little more organized, less noisy and more balanced, then users will benefit from more efficient usability. IMO, good designers will always have the ability to lead the user to the right movements (eg, a sign-up page, information to read, a buy button, etc) and users will be pleased to find exactly what either met or exceeded their expectations. Otherwise, they are one click away to the great Back Button.

rustybrick
10-13-2004, 01:06 PM
Image search is fascinating. I was speaking with someone over at IBM a few weeks ago, and he was letting me know about a technology they developed/developing to read images like a human would and associate them with keywords.

I did not get into the details, I was hoping Orion knew the name of the technology (he told me but it slipped my mind) and if you had any papers on it of reference.

Thanks.

orion
10-13-2004, 09:25 PM
I don't remember either. Not sure if this does a soft recall, but just in case check the links.

Intelligent Miner Visualization (http://www.findarticles.com/p/articles/mi_m0ISJ/is_4_42/ai_111505387)

WebFountain (http://www.findarticles.com/p/articles/mi_m0ISJ/is_1_43/ai_114367551)

The WebFountain is a large scale project. Coincidentally, one WebFountain expert is Laurent Chavet, the same Chavet that went to work for Microsoft and was charged few months ago in the AV case.

About the Intelligent Miner Visualization

The Intelligent Miner presents the results of data-mining functions and statistical functions. Customized visualizers are available for depicting clustering, tree classification, or association analyses. Each visualizer deploys various types of diagrams and color-coding techniques to facilitate the comprehension of complex data and relationships.

About WebFountain

WebFountain processes and analyzes billions of documents and hundreds of terabytes of information by using an efficient and scalable software and hardware architecture.

Orion

orion
10-14-2004, 12:28 AM
In this recent work, Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information (http://www.cen.uiuc.edu/~dengcai2/p1568934657-cai.pdf) the authors expand on image visualization. They discuss applications to image searches.

"By using a vision-based page segmentation algorithm,
a web page is partitioned into blocks, and the textual and
link information of an image can be accurately extracted from the
block containing that image. By using block-level link analysis
techniques, an image graph can be constructed. We then apply
spectral techniques to find a Euclidean embedding of the images
which respects the graph structure. Thus for each image, we have
three kinds of representations, i.e. visual feature based representation,
textual feature based representation and graph based representation.
Using spectral clustering techniques, we can cluster the
search results into different semantic clusters. An image search
example illustrates the potential of these techniques."

Orion

Nacho
10-14-2004, 12:52 AM
Sounds exciting! Any predictions as to when we might see this (6 months - year, more)?

rustybrick
10-14-2004, 01:08 AM
Thanks Orion, I will try to touch base with my IBM contact this week and get the research papers. Very interesting area in search, very...

rustybrick
10-14-2004, 06:59 PM
Got the name, it is called "Masala (http://www-306.ibm.com/software/data/integration/search_qa.html)". It is not limited to image search, but according to this C|Net Article (http://news.com.com/2100-7343_3-5198086.html) Masala will "help people retrieve foreign-language documents, 3D and 2D drawings, old e-mails and other hard-to-find material from the nether regions of their hard drives."

Based on my discussion with the IBM individual (who is not a technical guy by the way) he said Masala searches images by associating the shapes, colors, etc. with known objects in its image collection.

rustybrick
10-14-2004, 08:51 PM
In this recent article (http://searchenginewatch.com/searchday/article.php/3421021) by Andrew Goodman at SEW, it has links at the bottom to "Search Headlines" one link was to an article named IBM Masala heats up search sector (http://uk.news.yahoo.com/041013/175/f4h3s.html).

orion
10-14-2004, 11:09 PM
Excellent findings, Rusty.

For what I have, Masala (http://www.internetnews.com/ent-news/article.php/3364191) (Indian word meaning a "mixture of spices") provides a single view of information assets, independent of data type and location. it enables users to grab data from products of such vendors as Oracle, Microsoft, Documentum and others.

According to http://news.oreillynet.com/pub/n/Masala
Masala is a "new version of its DB2 Information Integrator software that will let corporate employees retrieve information from databases, applications and the Web at the same time. Subsequent improvements will include a data-mining component code-named Criollo."

While Masala is a retrieval software solution, the block-level technology developed by Microsoft is aimed mostly at generating Web graphs that are faster to crawl. (At least this is where their technology is now. Tomorrow, who knows.)

Block level, when applied to image searches, is used to construct image graphs from the Web. Thus, image searches is just one application of the block-level technology.

Orion

hiero
10-15-2004, 02:55 PM
What I like the most about the "block level technology" concept if I understand it correctly is that you could have a web page with multiple topics and each topic within the page could do well in the serps based on it's content. Am I seeing that right?

orion
10-16-2004, 12:44 PM
I hope this help.

I would say, let's wait and see. One can see many speculations flying around in several forums and newsletters regarding the use of this technology.

So far the technology has been used for image searching/clustering/ranking and in a limited fashion, not for SERP on the Web. Even, these fine young researchers (and brilliant grad students) have stated

"Due to the lack of sufficient resources, we are currently not able to perform
image search on the whole Internet which is our ultimate
goal." ImageSeer: Clustering and Searching WWW Images (http://www.cen.uiuc.edu/~dengcai2/tr-2004-38.pdf)


Most definitely this may change and based on their published papers, I don't see why block level analysis could not be used for scoring textual topics -not just on-topic images- in the near future. (If one can separate on-topic blocks of links from discourse text, I don't see why not.) The key is that -either as links or images- if they can be recognized as blocks of on-topic data, why not.

Let's review Figure 4 of ImageSeer: Clustering and Searching WWW Images (http://www.cen.uiuc.edu/~dengcai2/tr-2004-38.pdf)


They have stated::

"When the user submits a query, the system first computes the
relevance score for every image and the images are ranked according
to their relevance scores. For top N images, we re-rank them
according to the combined scores. The re-ranked top N images are
then presented to the user. Figure 4 shows the design of our system."

When a user queries their system, both Web pages and images are parsed by their VIPS parser and sent to an Indexer and Graph Constructor. The Indexer is a typical indexor that inverts the documents and asigns a relevance ranking.

The Graph Constructor builds the page-to-page, block-to-block and image-to-image graphs. The page-to-page graph is used to compute a PageRank, while an ImageRank is computed from the block-to-block and image-to-image graphs. These ranks are combined with the relevance ranking score. The image-to-image graph is also used to cluster images. Finally, the re-ranked top N images are presented to the user.

Since a given page may contain image blocks for different topics, most definetly different on/off-topic images will be clustered and ranked differently. Therefore, it will be hard for off-topic images to rank in the top N results served to the user.

BTW, in the ACM Multimedia Conference that is just ending, Microsoft's research team presented a new image clustering algorithm named Locality Preserving Clustering for Image Database (http://www.cen.uiuc.edu/~dengcai2/p1568934827-zheng.pdf)

[Note: Microsoft has submitted the patent

Method and System for Identifying Image Relatedness Using Link and page Layout Analysis (Filed, May 2004, Microsoft)]



Orion

rustybrick
10-16-2004, 08:43 PM
Orion,

This thread has made me really think. Over the weekend I have read all the papers listed here. My head is spinning (good spinning) with ideas. I am thinking of working up a little treat for this SEM industry to better understand the threats and opportunities involved in the block analysis topic.

Thanks for this thread.

Nacho
10-16-2004, 09:20 PM
I second that. The key learnings from this thread and the Block Analysis topic is up in my top 3 for this year.

Mil gracias Orion!

orion
10-16-2004, 11:58 PM
Hi, Rusty and Nacho.

Thanks for those kind words. We are all here to learn new things that would make this industry stronger and more educated.

I believe this technology is huge advance. Although it has been applied to images in blocks, I don't see why it could not be applied to positioned passages (with minor modifications).

I got a demo of their software, but when I try to open it, it gives me an error. I contacted one of the author to ask him about it. He emailed back since he was busy with the ACM. I'm waiting to contact him this week.

Orion