|
#1
|
||||
|
||||
|
Thanks, Nacho for the phone conversation. I'm stealing the "101 part" from one of your threads, buddy.
In these threads, we already discussed Microsoft's new block level technology for extracting semantics from web pages: block-level link analysis Themed Sites Level of Importance However, I found these threads are quite technical. I was looking for a simple way to describe their technology, so we can all understand it and discuss it. I came across this news link http://www.technologyreview.com/arti...rnb_100604.asp which explains in simple terms the block level algo. Rather than a refrit of these threads, let's discuss the practical aspect of the block level algorithm. Wearing the mod hat I'll present some challenging questions and let others to comment. First, the article. The news says in part "Researchers from the University of Chicago and Microsoft Research Asia have devised a system that analyzes Web content at the level of blocks of information on a page rather than coarser page-level. This allows for a model of the relationships between Web pages that shows the intrinsic semantic structure of the Web. The method could lead to more accurate search engines, according to the researchers. The researchers use their previously-developed Vision-Based Page Segmentation algorithm to delineate different parts of a Web page based on how a human views a page. The algorithm segments pages by horizontal and vertical lines, and blocks of content are weighted by page position. Advertisement links, for example, and in count for less than links from central content blocks." Questions Assuming someone wants to target MSN 1. How this would impact your web design habits? 2. Do you think CSS will impact MSN's algo? 3. Do you think the algo is susceptible to gaming strategies? If so, which one? 4. How block level would mpact the building of theme sites? 5. What's your take on advertisement links and links from central content? Orion Last edited by orion : 10-11-2004 at 08:57 PM. Reason: typo |
|
#2
|
|||
|
|||
|
Nice post Orion, i'll take no.2:
Yes. If they are going to be analysing links based on where they appear in the html code, then I for one can place them wherever I wish on the page but have them exactly where they need to be in the code. Absolute childs play. I'll make a comment on no.3 too: Sure it is, it would just raise the bar a little, and for me personally that will be a good thing. Nick |
|
#3
|
|||
|
|||
|
Not sure this is purely speculative.
A few answers
1. It won't change the way I design pages 2. CSS creates blocks of code which would be easy to identify and evaluate. 3. Yes - but I'm not going to tell as to which ones I think will work. 4. Block level analysis shouldn't impact the building of themed sites at all - might have a big impact on non-themed directories though. |
|
#4
|
||||
|
||||
|
Ill take a shot at #5.
I bet SEOs will figure out a way to obtain the link popularity and weight they are looking for through text ad links. Might take some time but first MSN needs to deploy it for it to be broken. |
|
#5
|
|||||
|
|||||
|
Last year, in one of my articles about various Google things, they acknowledged the idea that they could do block-level style analysis as well. Didn't say they WERE doing it -- just the usual "that's one of the things that could always be possible" statements.
I actually hate the term block level analysis. For whatever reason, it doesn't suggest to my ear the idea that a page is going to be analyzed in parts. Of course, I don't have any great ideas on another term. Quote:
For example, take all the little links that run in the navigation here in the forums. They don't read as natural copy, in the way the "content" of a forums discussion might. I suspect part of link analysis might be to discount links that appear to be navigational in nature, on the basis of not being near natural language content. I also suspect you might reinforce this if you see common cues across a series of pages -- the links always in the same place, same font size, etc. Quote:
Quote:
Quote:
Quote:
Interestingly, I got the impression that Google's "named entities" might be a similar thing, but not tied to links. This is something they talked about recently: Google Demos Word Clustering. It sounded to me like they were, by analyzing language rather than visual cues, trying to understand what the core content of a page is. |
|
#6
|
||||
|
||||
|
Excel observations.
I don't like their block term either but now most are using it. I prefer to stick to the notion of passages in the old readability sense when thinking of blocks (with some modifications). Back in 2003 I was looking for grad textbooks and came across Andy King's masterpiece Speed Up Your Site - Web Site Optimization (1st Edition, 2003, New Riders). In chapter 8, page 188 under "Raising Relevance", he discusses a simple CSS technique for reverse-positioning coded content in order to raise relevance. One approach/component of Microsoft's algo consists in trying to take into consideration human readability and visual positioning; i.e. the way users see page content. From their papers, it is hard to tell if they have/haven't considered CSS positioning or usability issues. I found this important for their block algo. It will be interesting to see how this model performs in the presence of commercial noise or external interests. Orion Last edited by orion : 10-12-2004 at 12:03 PM. Reason: typo |
|
#7
|
||||
|
||||
|
Hi, Mikkel. I got your PM, thanks. I'll probably be attending one of the SES next year, not sure yet.
Hi, guys. I found the following information very interesting. This early work about Microsoft's VIPS (VIsion-based Page Segmentation) "ImageSeer: Clustering and Searching WWW Images Using Link and Page Layout Analysis" ftp://ftp.research.microsoft.com/pub/tr/TR-2004-38.pdf provides more clues about the block-level model. Note the difference between link structure and page layout. The graphs of this paper are very revealing. In the paper, they used the html structure to identify horizontal and vertical segments but not to build a DOM-based link model. Although in this work they applied the model to image retrieval, the basics of their model are more clear than in recent papers. The main purpose of the original work was not to score the importance value of pages but to construct better subgraphs of the Web that are faster to crawl and to mine than PageRank and similar models. To do this they considered three different relationships a. block-to-page (link structure); i..e, a block linking to a document b. page-to-block (page layout); i.e., a document linking to a block c. block-to-image (inclusion); i.e., a block linking to an image Based on these "jumps" we end walking three different graphs a. page-to-page graph b. block-to-block graph c. image-to-image graph See Figure 3 of the VIPS paper. Block-to-Page (Link Structure) In the particular case of images retrieval, (block-to-page), links outside the image blocks are considered noisy links. Consider this: images co-occurring in a given block are likely to be topically-related or on-topic. Therefore, links inside image blocks are relevant while links outside are more likely to be irrelevant. Thus, it is possible to discriminate between links within an image block and noisy links outside the block; e.g.. links inside navigation menus, advertisement, etc. Page-to-Block (Layout) Page-to-block could be viewed as an attempt to quantify intuition and perception. For average users, big, centered blocks are perceived as more important than those with small size and placed at the margins or corners.. Let S = size of block b in page p D = distance from the center of b to the center of screen a = normalization factor Then the importance of every block b in page p is given by the working expression fp(b) = a*(S/D) Thus, fp(b) can be taken for the probability that a user is focused on the block b when looking at page p. (Note. They use different symbols in the VIPS paper.) I will stop here for now. Challenging questions 1. Do you think this model would produce faster graphs to crawl? If so, why? 2. What's your take on noisy links from the marketing standpoint? 3.What do you think at their attempt of quantifying block importance, etc? Orion PS. Some references D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting content structure for web pages based on visual representation”, Proc.5th Asia Pacific Web Conference, Xi’an China, 2003. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: a visionbased page segmentation algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 2003. Last edited by orion : 10-12-2004 at 07:26 PM. Reason: typo |
|
#8
|
|||||
|
|||||
|
I’m not sure if it’s too late to take a shot at the first set of questions, but here goes . . .
Quote:
Quote:
The other day I took my blue dog down the street, where my friends with other blue dogs like to meet. It’s always nice to see Joe’s blue dog because it has spots all over him. Funny to say, John’s HUGE BLUE DOG had a cold and sneezed all over us.You get the picture . . . haven’t we all seen sites like this? Web developers/designers also need to grow up. Quote:
Quote:
Quote:
|
|
#9
|
||||
|
||||
|
And I'll take a shot at the last three as well
![]() Quote:
Quote:
Quote:
|
|
#10
|
||||
|
||||
|
Hi, Nacho. Happy to read your comments.
I forget to include another 2 questions (4, 5) to the last post. Here is: 4. What's your take at Microsoft efforts of trying to build into a search engine some form of AI (artificial intelligence); e.g., their attempt at trying to emulate human readability and perception of importance of portions (blocks) of pages? (Sorry for the long sentence.) 5. From the client side, how usability would be affected or play into the picture? Orion Last edited by orion : 10-13-2004 at 10:48 AM. Reason: typos |
|
#11
|
||||
|
||||
|
Thank you Orion, these are all very interesting questions.
Quote:
It must be a very careful concept to implement, because what if two pages are identical but one is embedded into images 100% and the other has an 80% text / 20% image combination. However to the human eye, the pages are identical, so the engine must play an objective role and score them the same, right? Quote:
|
|
#12
|
||||
|
||||
|
Image search is fascinating. I was speaking with someone over at IBM a few weeks ago, and he was letting me know about a technology they developed/developing to read images like a human would and associate them with keywords.
I did not get into the details, I was hoping Orion knew the name of the technology (he told me but it slipped my mind) and if you had any papers on it of reference. Thanks. |
|
#13
|
||||
|
||||
|
I don't remember either. Not sure if this does a soft recall, but just in case check the links.
Intelligent Miner Visualization WebFountain The WebFountain is a large scale project. Coincidentally, one WebFountain expert is Laurent Chavet, the same Chavet that went to work for Microsoft and was charged few months ago in the AV case. About the Intelligent Miner Visualization The Intelligent Miner presents the results of data-mining functions and statistical functions. Customized visualizers are available for depicting clustering, tree classification, or association analyses. Each visualizer deploys various types of diagrams and color-coding techniques to facilitate the comprehension of complex data and relationships. About WebFountain WebFountain processes and analyzes billions of documents and hundreds of terabytes of information by using an efficient and scalable software and hardware architecture. Orion Last edited by orion : 10-13-2004 at 11:29 PM. |
|
#14
|
||||
|
||||
|
In this recent work, Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information the authors expand on image visualization. They discuss applications to image searches.
"By using a vision-based page segmentation algorithm, a web page is partitioned into blocks, and the textual and link information of an image can be accurately extracted from the block containing that image. By using block-level link analysis techniques, an image graph can be constructed. We then apply spectral techniques to find a Euclidean embedding of the images which respects the graph structure. Thus for each image, we have three kinds of representations, i.e. visual feature based representation, textual feature based representation and graph based representation. Using spectral clustering techniques, we can cluster the search results into different semantic clusters. An image search example illustrates the potential of these techniques." Orion Last edited by orion : 10-13-2004 at 11:31 PM. |
|
#15
|
||||
|
||||
|
Sounds exciting! Any predictions as to when we might see this (6 months - year, more)?
|
|
#16
|
||||
|
||||
|
Thanks Orion, I will try to touch base with my IBM contact this week and get the research papers. Very interesting area in search, very...
|
|
#17
|
||||
|
||||
|
Got the name, it is called "Masala". It is not limited to image search, but according to this C|Net Article Masala will "help people retrieve foreign-language documents, 3D and 2D drawings, old e-mails and other hard-to-find material from the nether regions of their hard drives."
Based on my discussion with the IBM individual (who is not a technical guy by the way) he said Masala searches images by associating the shapes, colors, etc. with known objects in its image collection. |
|
#18
|
||||
|
||||
|
In this recent article by Andrew Goodman at SEW, it has links at the bottom to "Search Headlines" one link was to an article named IBM Masala heats up search sector.
|
|
#19
|
||||
|
||||
|
Excellent findings, Rusty.
For what I have, Masala (Indian word meaning a "mixture of spices") provides a single view of information assets, independent of data type and location. it enables users to grab data from products of such vendors as Oracle, Microsoft, Documentum and others. According to http://news.oreillynet.com/pub/n/Masala Masala is a "new version of its DB2 Information Integrator software that will let corporate employees retrieve information from databases, applications and the Web at the same time. Subsequent improvements will include a data-mining component code-named Criollo." While Masala is a retrieval software solution, the block-level technology developed by Microsoft is aimed mostly at generating Web graphs that are faster to crawl. (At least this is where their technology is now. Tomorrow, who knows.) Block level, when applied to image searches, is used to construct image graphs from the Web. Thus, image searches is just one application of the block-level technology. Orion Last edited by orion : 10-14-2004 at 10:12 PM. Reason: typo |
|
#20
|
|||
|
|||
|
What I like the most about the "block level technology" concept if I understand it correctly is that you could have a web page with multiple topics and each topic within the page could do well in the serps based on it's content. Am I seeing that right?
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|