Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 10-07-2004   #1
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Local Context Analysis

J. Xu and W. B. Croft; Improving the Effectiveness of Informational Retrieval with Local Context Analysis http://citeseer.ist.psu.edu/cache/pa...0improving.pdf

Let's discuss Local Context Analysis and Expansion Concepts

Orion
orion is offline   Reply With Quote
Old 10-07-2004   #2
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Ok, let's start some discussion about LCA. The following is taken from the reference given in http://www.miislita.com/exp1/on-topic-analysis.html

See also the On-Topic Analysis thread (http://forums.searchenginewatch.com/...6826#post16826)

Local Context Analysis (LCA) is based on the use of expansion concepts. Some IR scientists use the expression document concepts. Expansion concepts or document concepts are noun phrases; that is, noun groups consisting of one, two or three adjacent nouns. For instance, the phrases food recipe and California car insurance are expansion concepts but cheap hotel and exciting hot vacations are not.

LCA discovers expansion concepts as follows. Instead of examining entire documents, concepts are extracted from passages of fixed length L; typically from a fixed-length window of 300 words. Concepts are ranked by their co-occurrence with the query terms in the top N ranked documents. The highest ranked concepts are then used for query expansion.

The model effectively applies a novel tf*idf treatment and co-occurrence theory to a subset that is part of a global set. This combination of local and global techniques produces more efficient query expansion. Nouns are used because research suggests that these are more informative and provide more possibilities for expanding queries than other type of terms.

Now, for those SEO/SEM that still having doubts about term co-occurrence, Here is what Professor Bruce Croft (a pioneer in this area) found: The larger the number of co-occurrences, the less likely that an expansion concept co-occurs with the query by chance.

Any Comment? Feedback?

Orion

Last edited by orion : 10-07-2004 at 11:52 PM.
orion is offline   Reply With Quote
Old 10-22-2004   #3
Kali
Ohh Bondage .......
 
Join Date: Oct 2004
Location: Here
Posts: 11
Kali will become famous soon enough
Well I'm interested in this thread - so I'll throw in my thoughts

At the moment the science is beyond me - I'd need to do about 6 months serious study and background reading to be able to truly get my head around it but..

Here goes my thoughts - this would be a very neat way of getting rid of a lot of 'spam' web pages created to target specific keyphrases used by searchers as the search keyphrases have a different language concept behind them.

Search keyphrases are built up as a set of single words that are important to the searcher in finding what they want, they will ignore superfluous glue.

Thus we get search phrase flight town cheap whereas the natural language would be find a cheap flight to town.

Thus where search engine optimisers attempt to find and target what the searcher is looking for by targetting the exact phrase used by the searcher, they often create lots of very unnatural language in the construction of their pages.

What we would see in the SERPs following the implementation of such a concept would be a significant eradication of this unnatural language thus significantly promoting natural language pages.

Quote:
The larger the number of co-occurrences, the less likely that an expansion concept co-occurs with the query by chance.
As most writing/communication is designed to keep the interest of the person on the reciving end, we naturally vary the language used across the course of a segment of that communication. (I'd guess that 300 words would be a fairly nice size chunk across which natural communication uses fairly few repetitions as there are enough words to keep subtley changing without running out of changes) Extreme repetition would create a very dull read, which would lose the attention of the reader very quickly thus causing back button clicks or clicks off page.

Effectively this is saying the more repetitions the more unnatural the language, in scientific speak, and thus the more likely the segment is to be SEO generated junk.

It would be interesting to me to understand how this compares with the use of stemming techniques where different suffixes of words are taken into account as the two may well produce similar SERPs to someone who doesn't understand the theories.
Kali is offline   Reply With Quote
Old 10-22-2004   #4
Nick W
Member
 
Join Date: Jun 2004
Posts: 593
Nick W is a jewel in the roughNick W is a jewel in the roughNick W is a jewel in the roughNick W is a jewel in the rough
Anyone that can make a dummy like me understand one of orions posts deserves a bit of rep :-)

thanks Kali, much appreciated..

Nick
Nick W is offline   Reply With Quote
Old 10-22-2004   #5
Jeremy_Goodrich
Member
 
Join Date: Jun 2004
Posts: 55
Jeremy_Goodrich will become famous soon enough
Nice explanation, Kali

Not too good at the science of all this myself - the points being raised remind me of something (brotherhood_of_lan?) posted @ webmasterworld a while ago, about using natural language parsing techniques on dialogue in a forum versus NLP used against directory pages, articles, etc.

The thinking being that pages with different type of content parse differently, and is important, imho to "future SERP enhancements coming to a search engine near you". That or this type of thing could already be in use...

interestingly, it seems that there are a few different ways of doing up the whole personalization thing - this being one of the techniques that lead to more tailored SERP results than older techniques, that treat all content the same. Diversity in search results is good for the user experience, imho, so that they get at the various potential meanings of their ambigious queries when using search.
Jeremy_Goodrich is offline   Reply With Quote
Old 10-22-2004   #6
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Good post!

Hi, Kali. Thank you for stopping by. Nice post and great presentation of thoughts.

1. "Here goes my thoughts - this would be a very neat way of getting rid of a lot of 'spam' web pages created to target specific keyphrases used by searchers as the search keyphrases have a different language concept behind them."

Very well put, Kali! However LCA is a query expansion technique for building queries, not an optimization technique or general purpose search technique.

However, I do agree with you that understanding how LCA works could be used to construct expansion concepts (noun phrases) that are keyword-rich. For example, if we look at abstracts, ignore stop words and delimiters, and pay attention to ontology, expansion concepts are well discernible in the form of

noun-noun phrases (e.g. in pairs, triplets, etc)
verb-noun phrases
adjective-noun phrases

Note. LCA uses noun-noun phrases.

So, how could SEOs use expansion concepts? Feel free to take the following with a grain of salt as is just my two cents.

One approach: Identify expansion concepts from a web document. Once identified, use a thesaurus to craft derivative expansion concepts semantically connected. Re-write the content embedding the expansion concepts. You should end with a shorter, richer content. A similar strategy could be tried with meta tags. Note that semantics can be improved without the need for keyword spamming. To avoid loosely connected combination of terms test their c-indices. When possible, use broader on-topic terms. Use narrower on-topic terms only when necessary.

3. "Extreme repetition would create a very dull read, which would lose the attention of the reader very quickly thus causing back button clicks or clicks off page.

Effectively this is saying the more repetitions the more unnatural the language, in scientific speak, and thus the more likely the segment is to be SEO generated junk."

I agree. Here we need to distinguish between co-occurrence and excesive terms repetition (keyword spamming). Co-occurrence is about measuring co-word incidents rather than frequency of individual terms. In the LCA paper, pioneer Prof Bruce Croft (and Director, Center for Intelligent Information Retrieval at Univ of Massachussetts) presents a tf*idf model based on co-occurrence, not on term counts (repetitions) as is usually done in conventional term vector schemes.

4. "Search keyphrases are built up as a set of single words that are important to the searcher in finding what they want, they will ignore superfluous glue. Thus we get search phrase flight town cheap whereas the natural language would be find a cheap flight to town."


Well put. This is described in the Precise paper. This paper discusses an example similar to the one you presents but in the context of building a parser for semantic tractability. The Precise paper discusses a statistical parser that can overcome parser errors and correctly map semantics and queries. It is used with a natural language interface (NLI) to query a database.


Orion
orion is offline   Reply With Quote
Old 10-23-2004   #7
Kali
Ohh Bondage .......
 
Join Date: Oct 2004
Location: Here
Posts: 11
Kali will become famous soon enough
More thoughts

Quote:
However LCA is a query expansion technique for building queries, not an optimization technique or general purpose search technique.
Surely the work on query expansion is designed to provide more relevant results to the searcher.

If it succeeds in doing that it will be used by the search engines - either in a raw form or more probably in a modified form. In which case we as SEOs will need to design techniques to ensure that our documents are still ranked well.
Kali is offline   Reply With Quote
Old 10-23-2004   #8
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

True, Kali. Examples of LCA can be found in contextual advertising and in clustered keywords (query suggestion tools).

Orion
orion is offline   Reply With Quote
Old 10-23-2004   #9
Kali
Ohh Bondage .......
 
Join Date: Oct 2004
Location: Here
Posts: 11
Kali will become famous soon enough
Questions

Would you think that synonyms are more or less likely to appear in the expansion concepts for a query?

From what I read it seems that the techniques espoused for generating the expansion concepts would make synonyms highly likely to be included in the expansion concepts.

The second question really relates to queries that have to completely different but equally relevant meaning for example the query 'Sydney Carpenter' could be someone looking for a carpenter in Sydney or some one looking for a person called Sydney Carpenter. It seems to me that results could flip from one position to another and back again with a fairly small changes in the overall document set thus producing some very unstable results.
Kali is offline   Reply With Quote
Old 10-23-2004   #10
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Question 1. Synonyms in expansion concepts.

Good question. If you have a noun-noun expansion concept like "car insurance", you could use in the document "auto insurance". You could also expand the previous one as "car insurance - auto...". If the IR system was programmed to ignore hypens, it will be interpreted as a noun-noun-noun expansion concept; i.e., a triplet. The same should apply to the query.

Question 2 relates with disambiguation and with the following, as stated by the authors of the LCA paper:

"An ambiguous query typically retrieves several clusters of documents which match the query equally well. We hope to utilize this property to determine whether a query is ambiguous. For an ambiguous query, we can choose to not expand it or ask the user to refine it."

However, there are two possible solutions to this.

One is using disambiguation techniques in combination with LCA. See the paper Disambiguation for Text Mining on the Web

The other is using clustering techniques. Often queries with two or more different meanings produce results with overlapping topics. The top N ranked documents will consist of different clusters, each of which are retrieval subsets. Any of the standard clustering techniques could be used to resolve the clusters.

In my view, in question 2 plain LCA is not enough.

I hope this help.

Orion

Last edited by orion : 10-23-2004 at 09:27 PM.
orion is offline   Reply With Quote
Old 10-25-2004   #11
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation LCA vs RF

Local Context Analysis vs. Relevance Feedback: Which one requires less terms and user's interaction?

I was asked this question and thought some of you may have the same question in mind. I found that the paper Relevance Feedback versus Local Context Analysis as Term Suggestion Devices answers this question. According to the authors of this study,


"This study investigated the use of two different techniques for supporting query reformulation in interactive information retrieval: relevance feedback and Local Context Analysis, both implemented as term suggestion devices."


The most important findings of their study:

1. Relevance feedback offers user control and understanding of term suggestion
2. Local Context Analysis (LCA) requires relatively little user effort.
3. There are no significant differences between the two systems implementing these techniques in terms of user preference and performance.
4. Local Context Analysis requires significantly fewer user defined query terms

Conclusions:

1. Term suggestion without user guidance/control is the better of the two methods tested.
2. it is not necessary to rely on external evaluators for measurement of performance of interactive retrieval.


Orion

Last edited by orion : 10-26-2004 at 01:11 PM. Reason: removing white space
orion is offline   Reply With Quote
Old 10-19-2005   #12
ahmadq
Newbie
 
Join Date: Oct 2005
Posts: 2
ahmadq is on a distinguished road
Local context analysis implementation

i am a new member in this forums and please,i need the Local context analysis implementation ( the code ) in any programming language.
thank you
ktaysh
ahmadq is offline   Reply With Quote
Old 10-21-2005   #13
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by orion
Question 2 relates with disambiguation and with the following, as stated by the authors of the LCA paper:

"An ambiguous query typically retrieves several clusters of documents which match the query equally well. We hope to utilize this property to determine whether a query is ambiguous. For an ambiguous query, we can choose to not expand it or ask the user to refine it."

However, there are two possible solutions to this.

One is using disambiguation techniques in combination with LCA. See the paper Disambiguation for Text Mining on the Web

The other is using clustering techniques.
Orion
A lot of research conducted since this post. Indeed, there is a third solution:

Co-occurrence (c-indices) and on-topic analysis, conducted locally (a doc) or globally (from a corpus). For term disambiguation during a season, an experimental technique I use is temporal semantic analysis from which temporal co-occurrence is a special case.

Ahmadq,

Sorry, but that's not possible.

Orion
orion is offline   Reply With Quote
Old 11-20-2005   #14
ahmadq
Newbie
 
Join Date: Oct 2005
Posts: 2
ahmadq is on a distinguished road
Questions about local context analysis

Dear Orion,

I have some questions about Local Context Analysis. I hope from you to answer these questions:

Q1. The retrieval of documents using initial query is based on a standard similarity function or term co-occurrences or what ?
Q2.How to extract a noun groups from a text?
Q3. What is the simplest way to construct a phrase?
Q3. How can I obtain the TREC collections? There is a free site on the internet allow me to download the test collection for free? Or if there is another free test collection?


Ahmad Alktaysh
ahmadq is offline   Reply With Quote
Old 11-21-2005   #15
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, there.

I hope this help.

I'm afraid you may need to do a bit of your own homework
Quote:
Q1.

The retrieval of documents using initial query is based on a standard similarity function or term co-occurrences or what ?
This is explained in Section 4.2 and 4.3 of their paper. the author explains how co-occurrence theory is used in the model. They do mention the use of an IDF function that is different from the IDF measures used to compute conventional cosine similarity with term vector models.

Quote:
Q2.How to extract a noun groups from a text?
There are three ways of doing this: (1) with a pre-coded list of nouns and (2) a library of regular expressions or (3) through ontology analysis

Quote:
Q3. What is the simplest way to construct a phrase?
You got me with this one. In which language?

Quote:
How can I obtain the TREC collections? There is a free site on the internet allow me to download the test collection for free? Or if there is another free test collection?
NIST.gov has tons of information on this. As many gov sites, some sections of the site may be old.

TREC collections are not software. TREC stands for NIST.gov's Text Retrieval Conference (TREC). Authors often refer to "TREC collections" followed by a number which refers to the conference. They also refer to TREC work followed by a topic number.

Some author publish papers with links pointing to their own collections used for their work. TREC collections are free downloads. Some colleagues even have test projects for students. Here is one example from Jimmy Lin http://www.umiacs.umd.edu/~jimmylin/...pring/hw3.html

Check also Donna Harman's TREC Editor page at http://www.nist.gov/nta-bin/query2.cgi , though this section haven't been updated in a while.

Regarding additional information on TREC collections, search in Google for

1. trec collection download, which is a 3-noun phrase (NNN) query
2. download trec collection, which is a 1-verb + 2-noun phrase (V-NN) query

Irronically, note that these two 3-term queries behave as two different expansion concepts (document concepts) as described in the LCA paper, since we have NNN and V-NN

Let me know if this is of any help.

Orion

Last edited by orion : 11-21-2005 at 12:13 AM.
orion is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off