Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 10-11-2004   #1
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Block Analysis 101

Thanks, Nacho for the phone conversation. I'm stealing the "101 part" from one of your threads, buddy.

In these threads, we already discussed Microsoft's new block level technology for extracting semantics from web pages:

block-level link analysis
Themed Sites Level of Importance

However, I found these threads are quite technical. I was looking for a simple way to describe their technology, so we can all understand it and discuss it. I came across this news link

http://www.technologyreview.com/arti...rnb_100604.asp

which explains in simple terms the block level algo.

Rather than a refrit of these threads, let's discuss the practical aspect of the block level algorithm. Wearing the mod hat I'll present some challenging questions and let others to comment. First, the article. The news says in part

"Researchers from the University of Chicago and Microsoft Research Asia have devised a system that analyzes Web content at the level of blocks of information on a page rather than coarser page-level. This allows for a model of the relationships between Web pages that shows the intrinsic semantic structure of the Web. The method could lead to more accurate search engines, according to the researchers. The researchers use their previously-developed Vision-Based Page Segmentation algorithm to delineate different parts of a Web page based on how a human views a page. The algorithm segments pages by horizontal and vertical lines, and blocks of content are weighted by page position. Advertisement links, for example, and in count for less than links from central content blocks."

Questions

Assuming someone wants to target MSN

1. How this would impact your web design habits?
2. Do you think CSS will impact MSN's algo?
3. Do you think the algo is susceptible to gaming strategies? If so, which one?
4. How block level would mpact the building of theme sites?
5. What's your take on advertisement links and links from central content?

Orion

Last edited by orion : 10-11-2004 at 08:57 PM. Reason: typo
orion is offline   Reply With Quote
Old 10-12-2004   #2
Nick W
Member
 
Join Date: Jun 2004
Posts: 593
Nick W is a jewel in the roughNick W is a jewel in the roughNick W is a jewel in the roughNick W is a jewel in the rough
Nice post Orion, i'll take no.2:

Yes. If they are going to be analysing links based on where they appear in the html code, then I for one can place them wherever I wish on the page but have them exactly where they need to be in the code. Absolute childs play.

I'll make a comment on no.3 too:

Sure it is, it would just raise the bar a little, and for me personally that will be a good thing.

Nick
Nick W is offline   Reply With Quote
Old 10-12-2004   #3
Kali
Ohh Bondage .......
 
Join Date: Oct 2004
Location: Here
Posts: 11
Kali will become famous soon enough
Not sure this is purely speculative.

A few answers

1. It won't change the way I design pages

2. CSS creates blocks of code which would be easy to identify and evaluate.

3. Yes - but I'm not going to tell as to which ones I think will work.

4. Block level analysis shouldn't impact the building of themed sites at all - might have a big impact on non-themed directories though.
Kali is offline   Reply With Quote
Old 10-12-2004   #4
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Ill take a shot at #5.

I bet SEOs will figure out a way to obtain the link popularity and weight they are looking for through text ad links. Might take some time but first MSN needs to deploy it for it to be broken.
rustybrick is offline   Reply With Quote
Old 10-12-2004   #5
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
Last year, in one of my articles about various Google things, they acknowledged the idea that they could do block-level style analysis as well. Didn't say they WERE doing it -- just the usual "that's one of the things that could always be possible" statements.

I actually hate the term block level analysis. For whatever reason, it doesn't suggest to my ear the idea that a page is going to be analyzed in parts. Of course, I don't have any great ideas on another term.

Quote:
1. How this would impact your web design habits?
That implies that the only way this is done is to try and see a page visual, as a human might. I suspect it is already happening and is not visually-tied.

For example, take all the little links that run in the navigation here in the forums. They don't read as natural copy, in the way the "content" of a forums discussion might. I suspect part of link analysis might be to discount links that appear to be navigational in nature, on the basis of not being near natural language content. I also suspect you might reinforce this if you see common cues across a series of pages -- the links always in the same place, same font size, etc.

Quote:
2. Do you think CSS will impact MSN's algo?
IE, CSS could make thing seem visually appealing to a user while the underlying non-CSS HTML code might try to paint another picture for the search engine. First thought is that I don't think it will be only visual cues that are used. Second thought is that I think search engines will eventually grow up to understand CSS better.

Quote:
3. Do you think the algo is susceptible to gaming strategies? If so, which one?
Sure, as Nick W notes, he and others would look at ways to game it. Any system that essentially invites the webmaster into the process and gives feedback in the form of how a page ranks is going to be susceptible. But it will make it a bit harder, let link analysis be nursed out a bit longer -- plus it still won't be the only thing used.

Quote:
4. How block level would mpact the building of theme sites?
Since no major search engine I know of is currently trying to give a particular page a ranking boost because other pages within a "site" are on the same theme, shouldn't have any impact.

Quote:
5. What's your take on advertisement links and links from central content?
Back to my earlier reply -- if these can be seen as "unnatural" or not part of the core content -- using both visual and language cues -- I think they'll be discounted. Not banned, not ignored -- just not weighted as highly.

Interestingly, I got the impression that Google's "named entities" might be a similar thing, but not tied to links. This is something they talked about recently: Google Demos Word Clustering. It sounded to me like they were, by analyzing language rather than visual cues, trying to understand what the core content of a page is.
dannysullivan is offline   Reply With Quote
Old 10-12-2004   #6
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Excel observations.

I don't like their block term either but now most are using it. I prefer to stick to the notion of passages in the old readability sense when thinking of blocks (with some modifications).

Back in 2003 I was looking for grad textbooks and came across Andy King's masterpiece Speed Up Your Site - Web Site Optimization (1st Edition, 2003, New Riders). In chapter 8, page 188 under "Raising Relevance", he discusses a simple CSS technique for reverse-positioning coded content in order to raise relevance.

One approach/component of Microsoft's algo consists in trying to take into consideration human readability and visual positioning; i.e. the way users see page content. From their papers, it is hard to tell if they have/haven't considered CSS positioning or usability issues. I found this important for their block algo.

It will be interesting to see how this model performs in the presence of commercial noise or external interests.


Orion

Last edited by orion : 10-12-2004 at 12:03 PM. Reason: typo
orion is offline   Reply With Quote
Old 10-12-2004   #7
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, Mikkel. I got your PM, thanks. I'll probably be attending one of the SES next year, not sure yet.

Hi, guys. I found the following information very interesting.

This early work about Microsoft's VIPS (VIsion-based Page Segmentation)
"ImageSeer: Clustering and Searching WWW Images
Using Link and Page Layout Analysis"
ftp://ftp.research.microsoft.com/pub/tr/TR-2004-38.pdf

provides more clues about the block-level model. Note the difference between link structure and page layout. The graphs of this paper are very revealing.

In the paper, they used the html structure to identify horizontal and vertical segments but not to build a DOM-based link model. Although in this work they applied the model to image retrieval, the basics of their model are more clear than in recent papers.

The main purpose of the original work was not to score the importance value of pages but to construct better subgraphs of the Web that are faster to crawl and to mine than PageRank and similar models. To do this they considered three different relationships

a. block-to-page (link structure); i..e, a block linking to a document
b. page-to-block (page layout); i.e., a document linking to a block
c. block-to-image (inclusion); i.e., a block linking to an image

Based on these "jumps" we end walking three different graphs

a. page-to-page graph
b. block-to-block graph
c. image-to-image graph

See Figure 3 of the VIPS paper.

Block-to-Page (Link Structure)

In the particular case of images retrieval, (block-to-page), links outside the image blocks are considered noisy links. Consider this: images co-occurring in a given block are likely to be topically-related or on-topic. Therefore, links inside image blocks are relevant while links outside are more likely to be irrelevant. Thus, it is possible to discriminate between links within an image block and noisy links outside the block; e.g.. links inside navigation menus, advertisement, etc.

Page-to-Block (Layout)

Page-to-block could be viewed as an attempt to quantify intuition and perception. For average users, big, centered blocks are perceived as more important than those with small size and placed at the margins or corners..

Let

S = size of block b in page p
D = distance from the center of b to the center of screen
a = normalization factor

Then the importance of every block b in page p is given by the working expression

fp(b) = a*(S/D)

Thus, fp(b) can be taken for the probability that a user is focused on the block b when looking at page p.

(Note. They use different symbols in the VIPS paper.)

I will stop here for now.

Challenging questions

1. Do you think this model would produce faster graphs to crawl? If so, why?
2. What's your take on noisy links from the marketing standpoint?
3.What do you think at their attempt of quantifying block importance, etc?

Orion

PS. Some references

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting content
structure for web pages based on visual representation”,
Proc.5th Asia Pacific Web Conference, Xi’an China, 2003.

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: a visionbased
page segmentation algorithm”, Microsoft Technical
Report, MSR-TR-2003-79, 2003.

Last edited by orion : 10-12-2004 at 07:26 PM. Reason: typo
orion is offline   Reply With Quote
Old 10-13-2004   #8
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
I’m not sure if it’s too late to take a shot at the first set of questions, but here goes . . .

Quote:
1. How this would impact your web design habits?
Yes, I believe SEOs will always pay attention to everything it takes (ethically of course) to make pages #1 on the SERPs, but still make them be likeable by the users. There is nothing wrong with finding new ways, if this is what it takes. In a sense it’s kind of like if the search engines are educating us how to build better pages that will then help organize the www’s documents more efficiently.

Quote:
2. Do you think CSS will impact MSN's algo?
I agree with Danny on “search engines will eventually grow up to understand CSS better”. Web developers/designers will tend to stop forcing tags for SEO purposes and focus on a more balanced design that makes sense for the user to see. My personal opinion is, I hate to see websites that look like this:
The other day I took my blue dog down the street, where my friends with other blue dogs like to meet. It’s always nice to see Joe’s blue dog because it has spots all over him. Funny to say, John’s HUGE BLUE DOG had a cold and sneezed all over us.
You get the picture . . . haven’t we all seen sites like this? Web developers/designers also need to grow up.

Quote:
3. Do you think the algo is susceptible to gaming strategies? If so, which one?
YES, but it will become harder in the future, to a point that only if you deserve to be #1 you’ll get it. I don’t think link popularity algorithms in terms of quantity will be the key though, even if they will always be important. I think it’s going to be more into on-page factors and content where its going to be weighted.

Quote:
4. How block level would impact the building of theme sites?
In the future, this has to be an important element of the link algorithm to make it more challenging for websites to gain true relevancy. Otherwise, we are stuck with spammers taking advantages ridiculously easy.

Quote:
5. What's your take on advertisement links and links from central content?
As user’s tend to ignore advertisements in this type of fashion, they will need to be devalued in a link algorithm to increase relevancy for links on central content that might be more on target to the anchor text used. Search engines eventually will be able to analyze the entire paragraph to be on topic and not just a few words to the left and a few to the right.
Nacho is offline   Reply With Quote
Old 10-13-2004   #9
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
And I'll take a shot at the last three as well

Quote:
1. Do you think this model would produce faster graphs to crawl? If so, why?
Yes, because if the crawlers are replicating the users movements, then there is no need to crawl the entire page (ie. advertisement links at the bottom left navigation) and go from one link to the next quicker.

Quote:
2. What's your take on noisy links from the marketing standpoint?
The user learns to avoid them, therefore crawlers will need to do the same and algorithms devalue its existance. Sometimes, even on targeted content pages. For example, take the “Search Engine Watch Marketplace” (sorry Danny) and ask yourselves how many times have you clicked on ANY of them, but we spend almost every day here, right?

Quote:
3.What do you think at their attempt of quantifying block importance, etc?
This is the hardest question of all because not enough testing has been made by the real world in block analysis. It sounds great and I’m fascinated by it. However, it’s not the same to perform an perfectly executed experiment, and then have 200 million search queries be depending on it. That’s where luck can be on your side or slap you in the face if sh*t hits the fan.
Nacho is offline   Reply With Quote
Old 10-13-2004   #10
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, Nacho. Happy to read your comments.

I forget to include another 2 questions (4, 5) to the last post. Here is:

4. What's your take at Microsoft efforts of trying to build into a search engine some form of AI (artificial intelligence); e.g., their attempt at trying to emulate human readability and perception of importance of portions (blocks) of pages? (Sorry for the long sentence.)

5. From the client side, how usability would be affected or play into the picture?

Orion

Last edited by orion : 10-13-2004 at 10:48 AM. Reason: typos
orion is offline   Reply With Quote
Old 10-13-2004   #11
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Thank you Orion, these are all very interesting questions.

Quote:
4. What's your take at Microsoft efforts of trying to build into a search engine some form of AI (artificial intelligence); e.g., their attempt at trying to emulate human readability and perception of importance of portions (blocks) of pages?
As a concept I think it’s brilliant, but maybe a little too overboard. (eg. Maybe adding a Porsche 911 engine into a VW Beetle is a little too much. I like the VW Bug just the way it is.). The problem I see is that all humans may have a different perception of a page and thus making it very subjective to each human’s opinion. Microsoft would be taking a few elements that are (I guess) totally necessary to try and make it objective, for example extremes: clean vs. dirty, loud vs. quiet, tall vs. short, etc.

It must be a very careful concept to implement, because what if two pages are identical but one is embedded into images 100% and the other has an 80% text / 20% image combination. However to the human eye, the pages are identical, so the engine must play an objective role and score them the same, right?

Quote:
5. From the client side, how usability would be affected or play into the picture?
If web developers / designers are building pages that are a little more organized, less noisy and more balanced, then users will benefit from more efficient usability. IMO, good designers will always have the ability to lead the user to the right movements (eg, a sign-up page, information to read, a buy button, etc) and users will be pleased to find exactly what either met or exceeded their expectations. Otherwise, they are one click away to the great Back Button.
Nacho is offline   Reply With Quote
Old 10-13-2004   #12
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Image search is fascinating. I was speaking with someone over at IBM a few weeks ago, and he was letting me know about a technology they developed/developing to read images like a human would and associate them with keywords.

I did not get into the details, I was hoping Orion knew the name of the technology (he told me but it slipped my mind) and if you had any papers on it of reference.

Thanks.
rustybrick is offline   Reply With Quote
Old 10-13-2004   #13
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

I don't remember either. Not sure if this does a soft recall, but just in case check the links.

Intelligent Miner Visualization

WebFountain

The WebFountain is a large scale project. Coincidentally, one WebFountain expert is Laurent Chavet, the same Chavet that went to work for Microsoft and was charged few months ago in the AV case.

About the Intelligent Miner Visualization

The Intelligent Miner presents the results of data-mining functions and statistical functions. Customized visualizers are available for depicting clustering, tree classification, or association analyses. Each visualizer deploys various types of diagrams and color-coding techniques to facilitate the comprehension of complex data and relationships.

About WebFountain

WebFountain processes and analyzes billions of documents and hundreds of terabytes of information by using an efficient and scalable software and hardware architecture.

Orion

Last edited by orion : 10-13-2004 at 11:29 PM.
orion is offline   Reply With Quote
Old 10-13-2004   #14
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

In this recent work, Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information the authors expand on image visualization. They discuss applications to image searches.

"By using a vision-based page segmentation algorithm,
a web page is partitioned into blocks, and the textual and
link information of an image can be accurately extracted from the
block containing that image. By using block-level link analysis
techniques, an image graph can be constructed. We then apply
spectral techniques to find a Euclidean embedding of the images
which respects the graph structure. Thus for each image, we have
three kinds of representations, i.e. visual feature based representation,
textual feature based representation and graph based representation.
Using spectral clustering techniques, we can cluster the
search results into different semantic clusters. An image search
example illustrates the potential of these techniques."

Orion

Last edited by orion : 10-13-2004 at 11:31 PM.
orion is offline   Reply With Quote
Old 10-13-2004   #15
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Sounds exciting! Any predictions as to when we might see this (6 months - year, more)?
Nacho is offline   Reply With Quote
Old 10-14-2004   #16
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Thanks Orion, I will try to touch base with my IBM contact this week and get the research papers. Very interesting area in search, very...
rustybrick is offline   Reply With Quote
Old 10-14-2004   #17
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Got the name, it is called "Masala". It is not limited to image search, but according to this C|Net Article Masala will "help people retrieve foreign-language documents, 3D and 2D drawings, old e-mails and other hard-to-find material from the nether regions of their hard drives."

Based on my discussion with the IBM individual (who is not a technical guy by the way) he said Masala searches images by associating the shapes, colors, etc. with known objects in its image collection.
rustybrick is offline   Reply With Quote
Old 10-14-2004   #18
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
In this recent article by Andrew Goodman at SEW, it has links at the bottom to "Search Headlines" one link was to an article named IBM Masala heats up search sector.
rustybrick is offline   Reply With Quote
Old 10-14-2004   #19
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Excellent findings, Rusty.

For what I have, Masala (Indian word meaning a "mixture of spices") provides a single view of information assets, independent of data type and location. it enables users to grab data from products of such vendors as Oracle, Microsoft, Documentum and others.

According to http://news.oreillynet.com/pub/n/Masala
Masala is a "new version of its DB2 Information Integrator software that will let corporate employees retrieve information from databases, applications and the Web at the same time. Subsequent improvements will include a data-mining component code-named Criollo."

While Masala is a retrieval software solution, the block-level technology developed by Microsoft is aimed mostly at generating Web graphs that are faster to crawl. (At least this is where their technology is now. Tomorrow, who knows.)

Block level, when applied to image searches, is used to construct image graphs from the Web. Thus, image searches is just one application of the block-level technology.

Orion

Last edited by orion : 10-14-2004 at 10:12 PM. Reason: typo
orion is offline   Reply With Quote
Old 10-15-2004   #20
hiero
If winning isn't everything, why do they keep score? --Vince Lombardi
 
Join Date: Aug 2004
Location: Los Angeles, California
Posts: 119
hiero is on a distinguished road
Arrow

What I like the most about the "block level technology" concept if I understand it correctly is that you could have a web page with multiple topics and each topic within the page could do well in the serps based on it's content. Am I seeing that right?
hiero is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off