Special thanks to:
|
#1
|
||||
|
||||
|
J. Xu and W. B. Croft; Improving the Effectiveness of Informational Retrieval with Local Context Analysis http://citeseer.ist.psu.edu/cache/pa...0improving.pdf
Let's discuss Local Context Analysis and Expansion Concepts Orion |
|
#2
|
||||
|
||||
|
Ok, let's start some discussion about LCA. The following is taken from the reference given in http://www.miislita.com/exp1/on-topic-analysis.html
See also the On-Topic Analysis thread (http://forums.searchenginewatch.com/...6826#post16826) Local Context Analysis (LCA) is based on the use of expansion concepts. Some IR scientists use the expression document concepts. Expansion concepts or document concepts are noun phrases; that is, noun groups consisting of one, two or three adjacent nouns. For instance, the phrases food recipe and California car insurance are expansion concepts but cheap hotel and exciting hot vacations are not. LCA discovers expansion concepts as follows. Instead of examining entire documents, concepts are extracted from passages of fixed length L; typically from a fixed-length window of 300 words. Concepts are ranked by their co-occurrence with the query terms in the top N ranked documents. The highest ranked concepts are then used for query expansion. The model effectively applies a novel tf*idf treatment and co-occurrence theory to a subset that is part of a global set. This combination of local and global techniques produces more efficient query expansion. Nouns are used because research suggests that these are more informative and provide more possibilities for expanding queries than other type of terms. Now, for those SEO/SEM that still having doubts about term co-occurrence, Here is what Professor Bruce Croft (a pioneer in this area) found: The larger the number of co-occurrences, the less likely that an expansion concept co-occurs with the query by chance. Any Comment? Feedback? Orion Last edited by orion : 10-07-2004 at 11:52 PM. |
|
#3
|
|||
|
|||
|
Well I'm interested in this thread - so I'll throw in my thoughts
At the moment the science is beyond me - I'd need to do about 6 months serious study and background reading to be able to truly get my head around it but..
Here goes my thoughts - this would be a very neat way of getting rid of a lot of 'spam' web pages created to target specific keyphrases used by searchers as the search keyphrases have a different language concept behind them. Search keyphrases are built up as a set of single words that are important to the searcher in finding what they want, they will ignore superfluous glue. Thus we get search phrase flight town cheap whereas the natural language would be find a cheap flight to town. Thus where search engine optimisers attempt to find and target what the searcher is looking for by targetting the exact phrase used by the searcher, they often create lots of very unnatural language in the construction of their pages. What we would see in the SERPs following the implementation of such a concept would be a significant eradication of this unnatural language thus significantly promoting natural language pages. Quote:
Effectively this is saying the more repetitions the more unnatural the language, in scientific speak, and thus the more likely the segment is to be SEO generated junk. It would be interesting to me to understand how this compares with the use of stemming techniques where different suffixes of words are taken into account as the two may well produce similar SERPs to someone who doesn't understand the theories. |
|
#4
|
|||
|
|||
|
Anyone that can make a dummy like me understand one of orions posts deserves a bit of rep :-)
thanks Kali, much appreciated.. Nick |
|
#5
|
|||
|
|||
|
Nice explanation, Kali
Not too good at the science of all this myself - the points being raised remind me of something (brotherhood_of_lan?) posted @ webmasterworld a while ago, about using natural language parsing techniques on dialogue in a forum versus NLP used against directory pages, articles, etc.
The thinking being that pages with different type of content parse differently, and is important, imho to "future SERP enhancements coming to a search engine near you". That or this type of thing could already be in use... interestingly, it seems that there are a few different ways of doing up the whole personalization thing - this being one of the techniques that lead to more tailored SERP results than older techniques, that treat all content the same. Diversity in search results is good for the user experience, imho, so that they get at the various potential meanings of their ambigious queries when using search. |
|
#6
|
||||
|
||||
|
Hi, Kali. Thank you for stopping by. Nice post and great presentation of thoughts.
1. "Here goes my thoughts - this would be a very neat way of getting rid of a lot of 'spam' web pages created to target specific keyphrases used by searchers as the search keyphrases have a different language concept behind them." Very well put, Kali! However LCA is a query expansion technique for building queries, not an optimization technique or general purpose search technique. However, I do agree with you that understanding how LCA works could be used to construct expansion concepts (noun phrases) that are keyword-rich. For example, if we look at abstracts, ignore stop words and delimiters, and pay attention to ontology, expansion concepts are well discernible in the form of noun-noun phrases (e.g. in pairs, triplets, etc) verb-noun phrases adjective-noun phrases Note. LCA uses noun-noun phrases. So, how could SEOs use expansion concepts? Feel free to take the following with a grain of salt as is just my two cents. One approach: Identify expansion concepts from a web document. Once identified, use a thesaurus to craft derivative expansion concepts semantically connected. Re-write the content embedding the expansion concepts. You should end with a shorter, richer content. A similar strategy could be tried with meta tags. Note that semantics can be improved without the need for keyword spamming. To avoid loosely connected combination of terms test their c-indices. When possible, use broader on-topic terms. Use narrower on-topic terms only when necessary. 3. "Extreme repetition would create a very dull read, which would lose the attention of the reader very quickly thus causing back button clicks or clicks off page. Effectively this is saying the more repetitions the more unnatural the language, in scientific speak, and thus the more likely the segment is to be SEO generated junk." I agree. Here we need to distinguish between co-occurrence and excesive terms repetition (keyword spamming). Co-occurrence is about measuring co-word incidents rather than frequency of individual terms. In the LCA paper, pioneer Prof Bruce Croft (and Director, Center for Intelligent Information Retrieval at Univ of Massachussetts) presents a tf*idf model based on co-occurrence, not on term counts (repetitions) as is usually done in conventional term vector schemes. 4. "Search keyphrases are built up as a set of single words that are important to the searcher in finding what they want, they will ignore superfluous glue. Thus we get search phrase flight town cheap whereas the natural language would be find a cheap flight to town." Well put. This is described in the Precise paper. This paper discusses an example similar to the one you presents but in the context of building a parser for semantic tractability. The Precise paper discusses a statistical parser that can overcome parser errors and correctly map semantics and queries. It is used with a natural language interface (NLI) to query a database. Orion |
|
#7
|
|||
|
|||
|
More thoughts
Quote:
If it succeeds in doing that it will be used by the search engines - either in a raw form or more probably in a modified form. In which case we as SEOs will need to design techniques to ensure that our documents are still ranked well. |
|
#8
|
||||
|
||||
|
True, Kali. Examples of LCA can be found in contextual advertising and in clustered keywords (query suggestion tools).
Orion |
|
#9
|
|||
|
|||
|
Questions
Would you think that synonyms are more or less likely to appear in the expansion concepts for a query?
From what I read it seems that the techniques espoused for generating the expansion concepts would make synonyms highly likely to be included in the expansion concepts. The second question really relates to queries that have to completely different but equally relevant meaning for example the query 'Sydney Carpenter' could be someone looking for a carpenter in Sydney or some one looking for a person called Sydney Carpenter. It seems to me that results could flip from one position to another and back again with a fairly small changes in the overall document set thus producing some very unstable results. |
|
#10
|
||||
|
||||
|
Question 1. Synonyms in expansion concepts.
Good question. If you have a noun-noun expansion concept like "car insurance", you could use in the document "auto insurance". You could also expand the previous one as "car insurance - auto...". If the IR system was programmed to ignore hypens, it will be interpreted as a noun-noun-noun expansion concept; i.e., a triplet. The same should apply to the query. Question 2 relates with disambiguation and with the following, as stated by the authors of the LCA paper: "An ambiguous query typically retrieves several clusters of documents which match the query equally well. We hope to utilize this property to determine whether a query is ambiguous. For an ambiguous query, we can choose to not expand it or ask the user to refine it." However, there are two possible solutions to this. One is using disambiguation techniques in combination with LCA. See the paper Disambiguation for Text Mining on the Web The other is using clustering techniques. Often queries with two or more different meanings produce results with overlapping topics. The top N ranked documents will consist of different clusters, each of which are retrieval subsets. Any of the standard clustering techniques could be used to resolve the clusters. In my view, in question 2 plain LCA is not enough. I hope this help. Orion Last edited by orion : 10-23-2004 at 09:27 PM. |
|
#11
|
||||
|
||||
|
Local Context Analysis vs. Relevance Feedback: Which one requires less terms and user's interaction?
I was asked this question and thought some of you may have the same question in mind. I found that the paper Relevance Feedback versus Local Context Analysis as Term Suggestion Devices answers this question. According to the authors of this study, "This study investigated the use of two different techniques for supporting query reformulation in interactive information retrieval: relevance feedback and Local Context Analysis, both implemented as term suggestion devices." The most important findings of their study: 1. Relevance feedback offers user control and understanding of term suggestion 2. Local Context Analysis (LCA) requires relatively little user effort. 3. There are no significant differences between the two systems implementing these techniques in terms of user preference and performance. 4. Local Context Analysis requires significantly fewer user defined query terms Conclusions: 1. Term suggestion without user guidance/control is the better of the two methods tested. 2. it is not necessary to rely on external evaluators for measurement of performance of interactive retrieval. Orion Last edited by orion : 10-26-2004 at 01:11 PM. Reason: removing white space |
|
#12
|
|||
|
|||
|
Local context analysis implementation
i am a new member in this forums and please,i need the Local context analysis implementation ( the code ) in any programming language.
thank you ktaysh |
|
#13
|
||||
|
||||
|
Quote:
Co-occurrence (c-indices) and on-topic analysis, conducted locally (a doc) or globally (from a corpus). For term disambiguation during a season, an experimental technique I use is temporal semantic analysis from which temporal co-occurrence is a special case. Ahmadq, Sorry, but that's not possible. Orion |
|
#14
|
|||
|
|||
|
Questions about local context analysis
Dear Orion,
I have some questions about Local Context Analysis. I hope from you to answer these questions: Q1. The retrieval of documents using initial query is based on a standard similarity function or term co-occurrences or what …? Q2.How to extract a noun groups from a text? Q3. What is the simplest way to construct a phrase? Q3. How can I obtain the TREC collections? There is a free site on the internet allow me to download the test collection for free? Or if there is another free test collection? Ahmad Alktaysh |
|
#15
|
||||
|
||||
|
Hi, there.
I hope this help. I'm afraid you may need to do a bit of your own homework Quote:
Quote:
Quote:
Quote:
TREC collections are not software. TREC stands for NIST.gov's Text Retrieval Conference (TREC). Authors often refer to "TREC collections" followed by a number which refers to the conference. They also refer to TREC work followed by a topic number. Some author publish papers with links pointing to their own collections used for their work. TREC collections are free downloads. Some colleagues even have test projects for students. Here is one example from Jimmy Lin http://www.umiacs.umd.edu/~jimmylin/...pring/hw3.html Check also Donna Harman's TREC Editor page at http://www.nist.gov/nta-bin/query2.cgi , though this section haven't been updated in a while. Regarding additional information on TREC collections, search in Google for 1. trec collection download, which is a 3-noun phrase (NNN) query 2. download trec collection, which is a 1-verb + 2-noun phrase (V-NN) query Irronically, note that these two 3-term queries behave as two different expansion concepts (document concepts) as described in the LCA paper, since we have NNN and V-NN Let me know if this is of any help. Orion Last edited by orion : 11-21-2005 at 12:13 AM. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|