Join Date: Jun 2004
IPAM Document Space
Thanks to a generous NSF grant from IPAM (Institute of Pure and Applied Mathematics) and to an angel, I'll be attending IPAM's Document Space workshop & conference from 01/23 to 01/27 at the University of California, Los Angeles (UCLA). The event will be held at the IPAM facilities on campus.
They are no longer accepting online registrations. You may still register at the door ($100, I think) on the first day of the workshop. And the Doubletree Hotel in LA is already sold out. A hotel manager there was kind enough to get me a nice room just in time.
The IPAM Document Space Workshop and Conference is attracting the attention of IR colleagues, many of which are true IR icons. The list of confirmed speakers so far includes:
Michael Berry (University of Tennessee)
David Blei (Princeton University)
Eugene Charniak (Brown University)
Ronald Coifman (Yale University)
John Conroy (IDA Center for Computing Sciences)
Nello Cristianini (UC Davis)
Jason Eisner (Johns Hopkins University)
Djoerd Hiemstra (Universiteit Twente)
David Horn (Tel Aviv University)
Piotr Indyk (Massachusetts Institute of Technology)
Frederick Jelinek (Johns Hopkins University)
Peter Jones (Yale University)
Damianos Karakos (Johns Hopkins University)
Sanjeev Khundapur (Johns Hopkins University)
John Lafferty (Carnegie Mellon University)
Stephane Lafon (Yale University)
Mauro Maggioni (Yale University)
Michael Mahoney (Yahoo! Research)
David Marchette (Johns Hopkins University)
Carey Priebe (Johns Hopkins University)
Andrew Tomkins (Yahoo! Research)
Michael Trosset (College of William & Mary)
After the conference I plan to use this thread to report on the several scientific sessions and presentations. All are relevant to ranking algorithms and heuristics and are interdisciplinary in nature. That's why I opened this new thread at this section.
Meanwhile, here is some scientific background about IPAM's Document Space event (emphasis added):
"Processing and management of the ever-increasing amount of spoken and written information appears to be a huge challenge for statisticians, computer scientists, engineers and linguists; the situation is further aggravated by the explosive growth of the web, the largest known electronic document collection.
There is a pressing need for high-accuracy Information Retrieval (IR) systems, Speech Recognition systems, and "smart" Natural Language Processing (NLP) systems. For tackling many problems in these fields, most approaches rely on:
Well-established statistical techniques, sometimes borrowed from the analysis of numerical data, Ad-hoc, fast techniques that appear to work "well", but which lack a solid understanding of how the language is structured, and High-complexity algorithms from Computational Linguistics that exploit the syntactic structure of language but which do not scale well with the amount of information that needs to be processed in emerging applications.
This workshop on Document Space has the goal of bringing together researchers in Mathematics, Statistics, Electrical Engineering, Computer Science and Linguistics; the hope is that a unified theory describing "document space" will emerge that will become the vehicle for the development of algorithms for tackling efficiently (both in accuracy and computational complexity) the challenges mentioned above.
Text documents are sequences of words, usually with high syntactic structure, where the number of distinct words per document ranges from a few hundreds to a few thousands. Much effort has been devoted to finding (e.g., through statistical means) useful low-dimensional representations of these inherently high-dimensional documents, that would facilitate NLP tasks such as document categorization, question answering, machine translation, unstructured information management, etc. Moreover, many of these tasks can be formulated as problems of clustering, outlier detection, and statistical modeling. Many important questions arise:
What is the best way to perform dimensionality reduction? The fact that documents can have diverse features in terms of vocabulary, genre, style, etc., makes the mapping into a common space very challenging. Is there a single best metric for measuring similarity between documents? Documents can be similar in many ways (in terms of content, style, etc); how do different vector representations facilitate different similarity judgments?
How can the semantics of each word be incorporated into the analysis and representation? For example, there are many cases where related documents share very few common words (e.g., due to synonymy). On the other hand, documents with high vocabulary overlap are not necessarily on the same topic.
It has been argued that sub-corpus dependent feature extraction (that is, document feature computation that depends on collective features of a subset of the corpus) yields far better retrieval results than when the features depend only on each document independently. Hence, efficient representation of documents into a common space becomes a "hard" problem: in principle, one would have to consider all possible subsets of a corpus in order to find the one that yields the best feature selection.
There is a natural duality between the symbolic and stochastic approaches in NLP, which have been exploited in order to organize document corpora. Symbolic information can be used to define coordinates and/or similarities between documents, and conversely the stochastic approach can lead to the definition of symbolic information. As above, this correspondence is relative to different subsets, of both documents and symbols, and organizing and fully exploiting it, with efficient algorithms, is challenging. "
"We expect that this workshop will lead the way toward well-justified answers (in terms of theory and experimental results) to the questions above, and, hopefully, contribute to a better understanding of the rich medium of language."
This gonna be the event of the year within the IR community. Great way to start 2006.
For additional information, visit IPAM site
Last edited by orion : 01-19-2006 at 07:37 PM.
Join Date: Jun 2004
This is not yet a report, just random typing before the conference starts.
I got into the suttle from the hotel to IPAM very early. Next to me was Michael Berry from Univ. on Tennessee. He is one of the pioneers of LSI along with Susan Dumais and T. K. Landauer. He told me they put the wrong picture at IPAM site (of an old physicist prof, not him). He is younger guy.
He mentioned to me of his new book coming for the spring "Lecture Notes on Data Mining", (World Scientific) by, who else, Michael Berry and Murray Browne.
We chat during the ride about latent semantics indexing. He mention one problem of LSI is which cluster to pick for dimensionality reduction. Processing power is getting a less of an issue in their research.
I mentioned to him of two term vector models I'm working on which incorporate semantics to the term space.
He says it might work if I concentrate on Noun-Adjectives.
We arrived to IPAM early and Michael run to a computer, tried some passwords and get it work right away (we laugh). I use the password and start writing these random words.
A tech guy shows up with the password for using the system. Too late. We laugh.
A lovely man arrives. Is Mark Green, director of IPAM. He ask us how we got in. I said that it was open. More laughs. I take a picture of me with Dr. Green. He left.
No one has arrived yet.
Last edited by orion : 01-24-2006 at 02:27 PM.
Join Date: Jun 2004
These are not detailed reports. I plan to do that at another web property.
Mark Green opens with a lovely welcome and mentions the mission of IPAM. Carey Priebe from Johns Hopkins Univ. follows. Tall guy with strong voice. No need of a microphone. He is one of the co-organizers of Document Space and explains the goal of the workshop.
The unique thing about Document Space is that for the first time the applied math community have the opportunity of addressing what is a document space, a document, a query and what we try to accomplish by measuring similarity and dissimilarity. It is clear that Euclidean Spaces and crude LSI has many drawbacks when we discuss documents in higher dimensions and thus better models are needed.
Carey introduces the illustrious Michael Trosset from College of William and Mary. His talk "Trading Spaces: Measuring of Document Proximity and Methods for Embedding Them" makes clear that similarity and disimilarity in terms of Euclidean space and Euclidean distances is not enough to represent documents and queries. The general agreement appears to be that queries are linear combinations of terms. However, to define docs one needs to define first the document space.
An interesting debate starts as to why we do clustering and what we are trying to accomplish by doing that. More discussions follows. We stop for lunch.
Michael Berry (Univ. of Tennessee)'s talk is enlightening: "Text Mining Approaches for Email Surveillance" Mike explains how they LSI, NMF and other techniques on the Enron Corpus and that NSA (National Security Agency) might uses it, among other techniques. But he doesn't knows exactly how.
He then shows working examples they conducted on the Enron Corpus. Amazing how they tracked down and discovered relevant emails to the Enron case. I wish I have time to discuss here all about his fascinating presentation. Michael mentions of the upcoming Sixth SIAM International Conference on Data Mining.
We take a break.
Sanjeev Khundapur (Johns Hopkins Unv) is next with "Document Representations for Topic-Adaptation in Statistical language Modeling". He explains how they use word co-occurrence in speech recognition.
Next is the legendary Ronald Coifman (Yale University) with "Diffusion Geometries of Digital Document Spaces, Ontologies and Knowledge Building". He mentions that in higher dimensions, the notion of talking about distances (Minkowski, Euclidean, etc) or distances in general is meaningless. Diffusion maps are alternate choice.
He explains how they use self-similar, multidimensional scaling and a simple technique of radio 1, radio 2 in the analysis. Applications to Collaborative Filtering are discussed. Amazing talk. Best presentation of the day.
Say "adios" to mere Euclidean Spaces and LSI.
Say "hello" to Diffusion Geometries & Diffusion Spaces.
We end and go to a wine and cheese session.
Last edited by orion : 01-31-2006 at 03:44 AM.
Join Date: Jun 2004
John Lafferty's (Carnegie Mellon Univ) and David Biel (Princeton University) "Topic Models 1: Probabilistic Models of Documents and Topic Models 2: Structured and Dynamic Models" presentation starts. These are back to back presentations. They cover how low dimensional representations of doc collections can be applied to Collaborative Filtering, IR, Anotate unlabeled imae and for topic evolution over time. He then get into Latent Dirichlet Allocation (LDA) where each document is considered a random mixture of topics from which documents can be generated. Awesome discussion on LDA and its relation with matrix factorization and simplex.
I go to lunch with Michael Berry, one of his grad students and a former student now professor at UCSD (University of California, San Diego). Pictures, burgers, jokes, and a great chat follows during lunch.
Next is Eugene Charniak (Brown Univ), a venerable parser expert. His presentation is on free-grammar and parsing. Very enlightening presentation. He gets into precision and recall issues from the parsing standpoint. Then defines what is a constituent and constituent accuracy. I wish I have time to cover everything here.
Frederick Jelinek (Johns Hopkins), a brilliant man, is next with "Experiments with Random Forests". He explains why a language model is a distribution over words. Get into a model based on words occurrence and acoustic. Very interesting topic. Now he explains how and why they smooth the data using Kneser-Ney Smoothing. Gives examples of random trees and some drawback of trees.
Jason Eisner (John Hopkins Univ) delivers perhaps the most electrifying presentation: "Bootstrapping without the Boot" In my view, Jason is what a speaker should be. It is not all about presenting at a conference, how to present counts.
He discusses how to tackle the double spiral problem when we do clustering. Then presents an unsupervised learning technique that requires little math. It is a clever technique to address the problem of WSD (word sense disambiguation). In my opinion, this was the best presentation.
Last edited by orion : 01-28-2006 at 03:29 AM.
Join Date: Jun 2004
Damianos Karakos (John Hopkins Univ) presents "Language Model with the Maximum Likelikhood Set: Complexity Issues and the Back-Off Formula" He explains that language modeling is a probabilistic assigment to any word sequence and then mentions applications in the area of document categorization, information retrieval and speech recognition. He explains how they smooth the collected data.
Peter Jones (Yale Univ) is next with "Eigenfunction Local Coordinates and the Local Rieman Mapping Theorem. He explains what it a geodesic distance and its applicabilities. Very interesting. Shows a picture of a fractal.
The afternoon session starts with David Marquette from the Naval Surface Warfare Center. His presentation, "How Document Space is like an Elephant?" seems to be a co-work between the NAVSEA and Johns Hopkins University. David explains the many views of scientists (applied mathematicians, statisticians, IR, et) when it comes to defining what is a document space, from here his presentation title. He then discusses the so-callled Dimensionality Reduction Curse when one does SVD.
The last two presentations are from Yahoo! and in my view steal the "show".
First is Michael W. Mahoney from Yahoo! with "Data-driven Dictionary Definition for Diverse Document Domains." Essentially he reviews LSI's SVD and introduces new advances in SVD and matrix selection. Applications to DNA and etnicity identification are mentioned.
What in my opinion was perhaps the best presentation of the day and so far of the entire IPAM workshop was Andrew Tomkins (Yahoo!)'s presentation: "Representation of Web Document Spaces" Simply a mystifying presentation. Prior to his part I was able to chat a bit with him on research issues.
Andrew presents two representations of the Web. The first one is a graph consisting of co-concentric circles from which tendril-like segments hang. The other is his infamous BowTie Theory Graph of the Web, in collaboration with Andrei Broder (back then at AltaVista) and other IRs.
He reviews quickly this one. I asked a question about the current state of this graph and he provides me with an enlightening clues. In my opinion, folks in the room don't seem to understand the implications of the current state of this graph for the marketing standpoint.
He then discusses how and why is important to identify specific link structures and gives examples using a K2,3 Web community and some other examples.
He also discusses a circuital approach to link graphs they are researching at Yahoo!. Very enlightening. I ponder how link spammers could attempt to deceive the circuit and its "flow". Fortunately the folks in the room are scientists, only.
The good thing about being both an IR and marketing researcher is that I can see both sides of the fence, test what works and what doesn't work in a real-case scenario contaminated with noise (or spam), and then facilitate that information in non technical terms to marketers, many of which can simplify even further this material.
Facilitating this knowledge is a good thing as it forces both sides (IRs and marketers) to come up with betters algorithms and techniques to test. My opinion.
Last edited by orion : 01-28-2006 at 01:19 AM.
Join Date: Jun 2004
The day before I caught a cold.
The first speaker is Stephane Lafon from Google.
His topic "Geometric clustering in kernel embedding spaces for document corpora organization" This is a collaborative work between Google, Ronald Coifman's Difussion Geometries Group at Yale, Princeton, Carnegie Mellon and Weizmann. In my opinion, difussion geometries is where the future of document space is. Since data sets exist in higher dimensions, dimensionality reduction techniques are necessary. This is a challenge. Another challenge is that we need to presort results (prerank) or do clustering in the difussion space. A third challenge of this work is how to extend the model to non-symmetric spaces.
Collections of documents are described in terms of a diffusion graph with difussion coordinates. The structure of this graph is analyzed using k-means clustering, covering balls (radio 1, radio 2 balls) or other geometric algorithms. I'm of the opinion that this is where covering scaling techniques (fractals) helps to address the complexity of the diffusion graph. It would be interesting to know if they have realized the relevance of the diffusion-limited agregation growth model (DLA) from the eighties.
I wish I have the time to discuss in details the significance of #words in a document and in the collection in their treatment or how they computed the eigenfunctions and eigen vectors. A pdf is available at the IPAM site. Unfortunately, if you were not there, it may not help you to fully grasp the significance of Google's research in the area of diffusion geometries.
Lafon then describes a circuital and molecular model Google is using. It would be interesting to know if they have thought about developing rotational selection rules for their "atoms" and "molecules".
This was the best presentation of the day.
The next speaker is Mauro Maggini from Yale Univ. This is a work from the Diffusion Group of Ronald Coifman. His topic, "Multiscale Analysis of Graphs and Document Corpora" is very interesting. His abstract expresses this very well:
"This coherent multiscale organization allows to analyse the graph at different levels of resolution, to reveal (soft) clusters and communities, and to construct multiscale learning algorithms. When the graph is associated with a body of documents, this construction leads to two dual, tightly related, multiscale structures, one on documents and one on words and concepts, which allow to extract information at different levels of specifity."
As I expected, organized structures at different length scales of observations are everywhere!
The next speaker is David Horn (Tel Aviv University). His presentation, "Unsupervised learning of Natural Languages" is relevant to linguistics and bioinformatics. Essentially, he tries to address the following. A document is a collection of symbols from which the underlying rules that govern its production can be inferred. For this purpose he uses ADIOS (Automatic Distillation of Structure), their pattern extraction algorithm. In his presentation abstract, they claim and quote:
"This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics."
The next speaker in line is Nello Cristianini from UC Davis. He presents on "Kernel Methods for Text Analysis" An expert in kernel methods, Nello shows how these can be used with many algorithm. He then provide a list of these. He then reviews how kernel methods can be used to embed text from documents in a semantically meaningful way. Applications to PCA, CCA and string matching algortihms are provided.
The last speaker is Piotr Indyk from MIT (Massachusetts Institute of Technology). The cold quicks in and I lost most of the presentation. He presents on "Algorithmic Applications of Low-Distortion Embeddings". Piotr covers embedding maps.
Last edited by orion : 01-30-2006 at 03:19 AM.
Join Date: Jun 2004
The last day of the IPAM workshop and conference. Only two speakers are scheduled.
John Conroy (IDA Center of Computing Sciences)'s presentation, "Multi-Document Summary Space:What do People Agree is Important?", states that a multi-document summary gives the gist of what is contained in a collection of related documents, i.e. what a collection is about. To address the question of how to define a "gist" he estimates the probability that a word selected by a human will be included in written summaries of document sets.
But how can we define a gist? His abstract states "We explore this question by analyzing human written summaries for clusters of document sets. In particular, we estimate the probability that word will be chosen by a human to be included in a summary. We demonstrate that if this probability model were given by an oracle, then a simple automatic method of summarization can produce extract summaries which are statistically indistinguishable from the human summaries."
Djoerd Hiemstra (Universiteit Twente)'s presentation "Expressing language modeling approaches as region algebra queries" proposes a unified theory of document space by combining two different approaches:
1. region models = developed for structured doc retrieval
2. language models = useful for ranking search results and for developing new ranking approaches
In their abstract they claim the following:
"We show a remarkable one-to-one relationship between region queries and the language models they represent for a wide variety of applications: simple ad-hoc search, cross-language retrieval, video retrieval, and web search."
Document Space ends.
Some Random Notes
Document Space accomplished several things.
1. As expected from these conferences, scientists were able to do some networking and identify possible areas of collaboration and prospective co-researchers.
2. It is clear from this conference that the breach between information and computer sciences (of which IR is a mere sub division) and natural sciences is getting smaller and smaller.
3. The math community received an opportunity to present their case on what is or is not a document space, a document and a query.
4. The limitations of models that use the Euclidean Space as the document embedding space were dissected and alternatives were provided.
5. In my opinion, the Diffusion Space solidified its position as the most obvious choice for embedding documents in higher dimensions. The research conducted at Yahoo! and Google in the area of diffusion geometries and at some of the top research centers (e.g. Coifman's Diffusion Group) seem to confirm this perception. Within the next years, it will be fun to watch if a solid molecular and circuital framework emerges.
I am grateful to NSF, IPAM's director Dr. Mark Green, and his supporting staff for giving me the opportunity of attending the Document Space workshop.
Until next time, adios.
Last edited by orion : 01-28-2006 at 03:12 PM.
Join Date: Sep 2004
Location: Seattle, WA
Dr. Garcia - first off, thanks for the great coverage. THis is something we'd never be exposed to without you. Second - can you elaborate on Andrew Tomkins' work:
Join Date: Jun 2004
Actually their bow tie graph is quite old (W3C, May, 2000) and its relevancy to link analysis has been dissected many times in discussion forums and by marketers, but if you want links, here is the original that started all and go from there: Graph structure in the web
|Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)|