Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 04-22-2005   #1
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Methods and technologies discussed at SEM (Boston)

The search engine meeting I went to was great. Its good to get away and all that, but good also to check out what everyone else is doing in search provision and where the focus is.

The things that were brought up time and time again were keyword burstiness for classification, user interraction for better results, understanding the purpose of the user, and the queries. The current failures of search are: empirical validation, shallow user modeling, substance, relevance, relations, rationale and the lack of unification. This was agreed to by all.

The ideas around these issues were not agreed to by all speakers and attendees. Some believe a certain amount of redundancy is a good thing, as the answer to a question may reside in more than one place, we can't be so linear with out search for relevant documents. People were divided on the issue of user interraction. I think some thought it should be passive, as in gathering information from the user via toolbars and so on, and others thought the user should be more involved. Personally, I think direct user involvment is really too laborious. On the other hand, people are scared of things that gather information. Keyword burstiness is used a lot for automated news (MSN and Google for example), and there is talk about how it can be used for search results.

The nice thing was that not everything was centered around internet search results, but also around QA systems, and enterprise search for example. Methods in everything in IR and IE overlap, and it was nice to see that being discussed.

Interestingly I guess, MSN refused to answer anything on clustering methods but claimed to be working on it. Previously on the beta, it didn't work. Its not that easy to cluster things as there are many dimensions and it very much depends on the context of the query, which is really hard to identify.

A few things I found really interesting:

Jan Pedersen (Yahoo)
The keyword search method efficiency drops dramatically after a certain cut-off point - It is an approximation of results dues to the architecture of the indices. He stated that a larger index does not necessarily mean better results.

Claude Vogel
The pitfalls of a semantic engine are the complexity, the semantic links, and disambiguation. A better search has authoritative results, semantic indexing, personalization, and speed.

Susan Dumais
For search providers, search should not be the end goal. We need to understand the purpose of the users, obtain richer queries, enable integration so we don't have to open other windows to do search within a document or another web page. Implicit queries (IQ), which are similarity metrics are being looked into by Microsoft. An example is that 1 click can show all the relationships of a document to the rest of an email inbox, including emails, query words, author,... The basic equation for a similarity score is score = tf doc/log(tf corpus + 1).

Robert Carlson (IBM)
"All those who attempt to predict the future were fools, prophets or wannabees."

Other really interesting chats between speeches over coffee involves NLP techniques and it seems none of us have been able to do this in any coherent manner as yet. Using linguistics alone are not sufficient in a system architecture. We sometimes forget that a computer is a machine, not a human, and our brains work in different ways. Summarization is necessary as a basis for all our systems, and tests show that conventional methods are still outdoing anything new (no interest in fractals Orion - sorry). Newer and more interesting results involving patterns, coherence and cohesion following the methods of Halliday and Hassan are being implemented and tested. Blah blah, there were many other discussions too, but perhaps not all relevant to you guys!

Something you will all smile at is that during the MSN demo, IE crashed and Firefox was used instead with Google to find the right link
It was all taken in good humour!

I wrote a more complete update for all those interested in reading more about the methods and systems discussed at
Search science
xan is offline   Reply With Quote
Old 05-01-2005   #2
claus
It is not necessary to change. Survival is not mandatory.
 
Join Date: Dec 2004
Location: Copenhagen, Denmark
Posts: 62
claus will become famous soon enough
Microsoft and clustering

You will find more information on MS and clustering here and a framework called Microsoft Bayesian Network here. Also, there's the MSRA Search Result Clustering web site with a search field that don't work (at all) for me, as well as a cluster search toolbar which i haven't tried, as it's for IE only.

There's a PDF Paper too:

Hua-Jun Zeng, Qi-Cai He, Zheng Chen, and Wei-Ying Ma.
Learning To Cluster Web Search Results. In Proceedings of the 27th annual international conference on research and development in information retrieval (SIGIR'04), pp. 210-217, Sheffield, United Kingdom, July 2004

Last edited by claus : 05-01-2005 at 10:51 AM. Reason: typo
claus is offline   Reply With Quote
Old 05-01-2005   #3
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Thanks Claus, its interesting to see the results of the SRC toolbar as well. Not good at all right now, but it is a beta, and it is being improved, Clusty does seem to do alright as well. I'm not entirely sure that visualizing clusters in the way that we are seeing them right now is necessarily the best way. What do you think?
xan is offline   Reply With Quote
Old 05-01-2005   #4
claus
It is not necessary to change. Survival is not mandatory.
 
Join Date: Dec 2004
Location: Copenhagen, Denmark
Posts: 62
claus will become famous soon enough
A few thoughts on interfaces to clusters

Please bear with me if this turns out to be more of a novel than a post. I will try to keep it brief

Clusty does the same that Northern Light used to do, listing the clusters along the left hand side of the page. I think the left hand "navigation metaphor" is probably the easiest to understand for most users, as they're used to this from other web sites. Doing it this way means that you don't really have to explain what a cluster is (and you can avoid the word as well). If you only have a few clusters (around/max 5), then tabs would probably be good too.

Gigablast (example) arranges its clusters on top of the page. This, combined with the term "Giga Bits" makes it a fair bit harder to understand for the average user, imho.

Then, there is Flash and Java. Kartoo uses Flash to visualize clusters. They provide a graphical map in the center of the page and adds the "left-navigation" as well. It's a very user-friendly way to display a difficult concept.

Taken to the extreme, there's also the Newsmap. This is a Flash page that visualizes clusters of news stories as they appear in Google News. Actually it's more of a graphical front-end to Google News as all the clustrering is being done by Google, so this page only provides an illustration. Afaik, the Newsmap was inspired by the Market map which, again, builds on the generic Treemap.

Things like these latter ones are too complicated for the average user. First, a "treemap" is a construct that needs explanation on its own before you can start to use it. Second, you simply have to provide an "aggregate view" first, and then the ability to "drill down" (Market Map does this better than Newsmap, i have to add). Otherwise it's simply too confusing (see, eg. the Netscan treeview for a very confusing one)

Otherwise, The Brain provides a very good and user-frindly interface to clusters. I don't remember if it's Java or Flash. With The Brain, it's not "clusters" as such, more like a generic relational database front-end, but i can imagine that this kind of interface would be very good for navigation among clusters in a data set.

The TouchGraph Google Browser use the "spiders web", or "haystack" kind of image, in which all nodes are connected with simple lines. It's similar in concept to The Brain, but execution is very different. Compared to The Brain it seems a bit "clumsy" to me (for lack of a better word).

But i could go on... There are more examples on the web - it seems people have experimented a good deal here.

IMHO:

For visual interactive representations, those by Kartoo and The Brain are among the best for the average user.

For non-interactive listings, you should probably try to use other metaphors, eg. the "File manager" or "Explorer" folder-view from windows (or, mac/linux style folders).

For non-graphical display, using other well-known concepts, such as "directory", "left navigation" or "tabs" will probably be best.
claus is offline   Reply With Quote
Old 05-01-2005   #5
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Graphical interfaces are not scalable, that's the main problem. The others don't help significantly.

I find that many of the clustering things aren't able to handle compaounds either, which is a problem.

I did a visualization study once on all the different academic and commercial solutions to data representaion, and unbelievably things like the url list and kartoo didn't win here, people seemed to prefer the coloured block approach (although this is not an answer, but the point wasn't to find one).

SIS looks like quite a conservative approach to me. SWimple is best when there's not a better way mind you!
xan is offline   Reply With Quote
Old 05-02-2005   #6
claus
It is not necessary to change. Survival is not mandatory.
 
Join Date: Dec 2004
Location: Copenhagen, Denmark
Posts: 62
claus will become famous soon enough
I agree 100% that simple is the way. That was also the reason for my comment on the more complicated interfaces that you have to get an aggregate view first and then be able to drill down. I believe the term for this is progressive disclosure - some also talk about information scent; as long as information keeps getting more relevant, users will click through to the next level.

It's not easy though. You will find big clusters and small clusters, but finding the relevant clusters is a task in itself, as then we're back to ambiguity and understanding the query. Take "apple" as an example - there are probably magnitudes more web sites on the computer variety then the fruit variety, so if your focus area is web sites, and you simply order by cluster size you might miss relevance altogether (same flaw that link popularity and other "authority measures" have, btw.)

For most of the solutions I've seen there's simply too much detail - that is great for a type like me once I know the tool inside out, but for the average user it's a huge obstacle in stead. I wonder... is your "compounds" the same as my "aggregate"? I think they might be. In that case I've just rephrased what you said above, and of course I agree

In my view, clustering should really be a help when faced with ambiguous queries. So, that's where the difficulty arises: You have to list a number of clusters because you don't know in advance which one is the right one. That, in turn, leads to cluttered lists with too much information. Applying a distance metric might be able to solve that problem, so that you display the clusters that are most apart first, and then let the user refine the query by choosing the most related set.

For display purposes you could either do it graphically (e.g. coloured boxes) or textually (e.g. a directory, like a mini-dmoz). However, these are "just tools" as there are always more than one way to do it. Good Information Architecture and interface design is almost a science in itself, and, given the tool metaphor, an unskilled worker would create more destruction with a power tool than with a simple one (the netscan treeview mentioned above illustrates this point, I think). On the opposite side, an expert can create small wonders with very simple tools.

Graphics and visualizations:

These can be "power tools", but you really really need very competent people in order to construct them properly. They should not be made by IR-researchers, and they should not be made by graphical designers either (although you will need a graphical designer for the design layer ultimately). This is the sub-field of Information Architecture called Interaction Design - of course, some graphical designers can do this, but in my experience it's extremely few of them who can (and less who can do it well), simply because a good deal of the work involved is outside their field of specialization.

I should add that "Interaction Design" does not imply "interactive graphics" - it can easily be 100% text based. It's all about processes, scenarios, and "getting from A to B".

I'm not sure about what you mean by the word "scalable" in this context as it can be several things:

- number of individual nodes?
- number of clusters?
- number of users?
- number of requests per day/hour/second?
- bandwidth consumed by images / Java applets / Flash files?

Regarding Java Applets, these are cumbersome on the Windows operating system, which most people use. If a web browser is what should be used for display, you can get much better interfaces by using techniques like Flash or Ajax.

Lately I've had the opportunity to work with people that understand Flash. I've worked with a lot of people that use Flash, like Flash, prefer Flash, and so on, but that had not led me to see any significant benefits in Flash so far. On the contrary, it only led to frustration with Flash, and disabling Flash altogether in my preferred browser.

The thing I now understand is that 99% of so-called "Flash developers" use Flash in all the wrong ways, plus that the Flash technology itself has developed significantly. You can actually base quite complicated Flash applications on very small Flash files that downloads instantly and reads data from XML. That way you get really fast downloads, instant rendering and very fast interaction. In fact, it's very much like HTML-pages in terms of benefits, but it's dynamic as well.

Anyway, I digress. I could probably just have written "it's not which tool you choose, it's how you use it", but apparently I'm having a verbose day today.
claus is offline   Reply With Quote
Old 05-02-2005   #7
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Well clusters have been with us for a very long time, but we didn't need to make them available to the public before.

By scalable I mean that as a larger dataset needs to be viewed and that it grows, it becomes unreadable in solutions like kartoo, ot the touchgraph for example.

Your distance metric works, but only in very specific and closed datasets. Its an assumption that there are more PC sites online to do with the word "Apple". The sets get a lot more complex than that the more you drill down as well.

Data visialization is an area of
Its necessary to identify the characteristics of data, the dimensions, and then decide on the graphical entities and attributes. I mean if you work in anything you have to represent data visually at some point, especially in IR. Sure graphical user interraction is very important indeed, and we get headaches about that all the time, but data visualization isn't the same thing at all.

Statisticians have done a huge amount of work in the area and are still a major component of such a project group. A lot of work has been done in the area of cluster visualization, such as the SOM-algorithm and there are many systems available for viewing clusters which are clearly superior than anything available online. The reason for this is that the dataset online is far more difficult to deal with due to its size and nature of it.

Dataminers are the guys who know a lot about this stuff.

How the user manipulated the structure is in fact the place of user interraction, and this would be made easier if the visualization of the data was sorted!

I'm not a great lover of flash to be honest, but it has its place.
Jacques Bertin's Semiology of Graphics (1983).
xan is offline   Reply With Quote
Old 05-02-2005   #8
claus
It is not necessary to change. Survival is not mandatory.
 
Join Date: Dec 2004
Location: Copenhagen, Denmark
Posts: 62
claus will become famous soon enough
Sorry, it appears that I have totally misunderstood your post(s). I thought you were thinking about how to display clusters to average web users, but now it appears to me that you were really thinking about tools to use as part of the research process.

In that case it's not really that critical. Scatterplots and dendrograms are probably the simplest tools, with 3D plots a close second (i had one app some time back that would let you pivot 3D scatterplots in a cube and switch variables on the axes. That was a great tool, but unfortunately i've lost it). I've never actually used SOM-maps. They look like impressionist art to me, only without the motive *lol* Still, i don't doubt that they're useful.
claus is offline   Reply With Quote
Old 05-03-2005   #9
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
No Claus you were spot on! Its easy to digress eh?
xan is offline   Reply With Quote
Old 05-05-2005   #10
claus
It is not necessary to change. Survival is not mandatory.
 
Join Date: Dec 2004
Location: Copenhagen, Denmark
Posts: 62
claus will become famous soon enough
>> Its easy to digress eh?

Way too easy - i find it's easier if you're either very awake or very tired *lol*
claus is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off