PDA

View Full Version : STM: Topic Analysis Model


orion
12-03-2004, 11:10 PM
If you are familiar with the On-Topic Analysis (http://forums.searchenginewatch.com/showthread.php?t=2031) thread, this new thread might interest you.

Here I would like to discuss the paper Topic Analysis Using a Finite Mixture Model (http://acl.ldc.upenn.edu/W/W00/W00-1305.pdf)

Since this work is not recent, I initially was planning in including the paper into the On-Topic Analysis thread, but afterthoughts I think it deserves its own thread.

I find the STM work very interesting and feel there may be others out there interested in discussing it. At least it will give new readers the opportunity of comparing this work with the On-Topic Analysis paper.

The authors of the paper describe their Stochastic Topic Model (STM) as follows, and I quote:

“We address the issue of 'topic analysis,' by
which is determined a text's topic structure,
which indicates what topics are included in a
text, and how topics change within the text.
We propose a novel approach to this issue, one
based on statistical modeling and learning.
We represent topics by means of word clusters,
and employ a finite mixture model to represent
a word distribution within a text. Our
experimental results indicate that our method
significantly outperforms a method that combines
existing techniques.”


THE MODEL

Their model is based on two main characteristics

a. representing a topic by means of a cluster of words.

b. employing a stochastic model, called a finite mixture model (e.g., to represent a word distribution within a text.

The model has a hierarchical structure of probability distributions.

1. The first level is a probability distribution of topics (topic distribution).

2. The second level consists of probability distributions of words included within topics (word distributions). These word distributions are linearly combined to represent a word distribution within a text.

Here are some highlights of the paper

1. A topic is defined as “a cluster of words that are closely related to the topic[/B]”

2. they use a terms co-occurrence model based on stochastic complexity (SC)


Their analysis consists of three processes:

a. a pre-process called 'topic spotting’

b. text segmentation

c. topic identification.


I plan to discuss these areas in future posts. In the meantime, feel free to grab a drink, relax and digest the paper. Some components of the model are enlightening. Interestingly, they embedded text segmentation and a terms co-occurrence component into the model. In my view, for those SEOs/SEMs and content writers that still have doubts about passage segmentation and terms co-occurrence theory this should sound as a “wake up call”. Time to start incorporating theming and topic analysis in your marketing mix? In any event, Content is King!

APPLICATIONS

For those interested in practical applications, here is an interesting quote

“…given a collection of texts
(e.g., home pages), we can automatically construct
an index of the texts on the basis of the
extracted topics. We can indicate which topic
is from which text or even which block of a
text. Furthermore, we can indicate which topics
are main topics of texts and which topics
are subtopics (e.g., by displaying main topics
in boldface, etc). In this way, users can get a
fair sense of the contents of the texts simply
by looking through the index. For a specific
text, users can get a rough sense of the content
by looking at the topic structure as, for
example, it is shown in Figure 3.”


“Our method can also be useful for text mining,
text summarization, information extraction,
and other text processing, which require
one to first analyze the structure of a text.”

Interesting claims/findings.

Orion

orion
12-23-2004, 01:40 PM
ON THEMES, TOPICS and BLUE BANANAS


When one reads the STM and On-Topic Analysis papers and then look around to see how some SEOs implement the notion of themes and topics in their optimization mix one can easily spot several misconceptions.

One of these is the notion that pages from theme sites score higher than pages from non-thematic sites. Besides the fact that one can see many instances of pages not belonging to theme sites ranking high, there are also two good reasons that make me believe this notion is incorrect:

1. Stochastic Probability Distribution of Themes
2. Similarity-Based Measures

For clarity, let's consider a simplistic scenario consisting of a query Q and a database of three documents (D=3) D1, D2, and D3. Each document is part of a site. The sites can be visualized as the document "containers".

1. Stochastic Probability Distribution of Themes

The STM paper shows that probability distributions can be extracted from individual documents. These probability measures are leveled.

One can isolate the contribution of a given word or group of words within a piece of text by means of passage segmentation and terms co-occurrence.

In the STM paper the authors write

“For a fixed seed word s, we take a word w as
a frequently co-occurring word if the presence
of s is a statistically significant indicator of
the presence of w.

Let a data sequence: (sl,wl), (s2,w2), .-.,
(Sin,Win) be given where (si, wi) denotes the
state of co-occurrence of words s and w in
the i-th text in the corpus data.”

That is, STM measures are computed for individual documents (D1, D2, and D3). Whether the “container” is a theme site or not is irrelevant.


2. Similarity-Based Measures

A search engine that uses similarity functions to rank web documents computes similarity scores by comparing a query to individual documents. Once computed, these scores are sorted and the documents are ranked.

For example, Term Vector (http://forums.searchenginewatch.com/showthread.php?t=489)-based systems compute cosine similarity scores between documents and queries. Scoring and sorting is done without regard for the environment in which they reside. That is, the theme of the “containers” does not matter.

To sum up, each document stands by its own merits.


Blue Bananas

This does not mean that the container of a document will not matter in other scoring schemes.

Assume now that D1, D2 and D3 all carry the same term weights and vectors (they are “co-similar”) with respect to Q.

If in addition to Term Vector scores the IR system uses other scoring frameworks such as authority-based, link-based, or “blue bananas”-based and only D1 conforms to these, then D1 should score higher than D2 or D3, not because of similarity arguments but because additional factors included in the mix.

Let D1, D2 and D3 be co-similars with respect to Q (all with identical similarity scores)

and let “blue banana” weights = weights involving the theme of a container

then D1 should score higher than D2 and D3.

But again, Reality Check: How many commercial search engines compute scores based on “blue banana” metrics? From the computational standpoint such implemenation (scoring a document based on its container) is a formidable task to do. As of today, I'm not aware that the big guys are doiing this (tomorrow, who knows),

My take and view is that there is nothing wrong with having

1. a site with multiple themes.
2. a document with a main theme and sub-themes.

So far experiments carried out suggest that documents with well-defined term data structures tend to rank higher than those with randomized or inter-mingled structures. With documents with disordered data structures, overused terms carry less weight and are less important than document with well-defined data structures. This has a lot to do with semantics, information entropy and the way IR systems parse text and attempt to associate concepts (not mere words embedded in documents).

LCA

A common limitation of term vector scores is that cosine measures are based on words, not on concepts. Identical similarity scores may convey different concepts. Indeed, concept-based measures are traditionally missing in TVT. Perhaps we should revisit local context analysis (LCA) (http://forums.searchenginewatch.com/showthread.php?t=2030) studies.

LSI

Latent semantic indexing is an attempt to tackling terms, documents and concepts in multiple dimensions, but again a large-scale architecture based on LSI is computationally expensive, even for most commercially available systems…but there is hope ahead…

As SE’s become smarter and “semantic machines” go mainstream, more research is required in the area of on-topic analysis.


Orion