View Full Version : On-Topic Analysis
orion
10-07-2004, 12:10 AM
Hello, everyone.
Back to business since the last hurricane!
I have published the results of experiment #1. Thanks to all participants for their help and input. The entire documentation can be read at http://www.miislita.com/exp1/on-topic-analysis.html
I changed the title of the exp to reflect more accurately the theory and purposes behind the research. The new title is
ON-TOPIC ANALYSIS - Online Discovery of On-Topic Terms
The paper discusses a procedure for the online discovery of on-topic terms. Discovery is based on occurrence and co-occurrence information. It is demonstrated that on-topic analysis is a valuable tool for enabling users to enhance the semantics of theme sites and concept-focused documents. Specific applications to search engine marketing strategies and information retrieval systems are presented.
This is a long work involving competitive queries submitted by professional search engine marketing specialists. It introduces a methodology and procedure called on-topic analysis, which allows users to discover top, broader, narrower, and optimum terms. The notion of term distances is also presented.
EXPANSION CONCEPTS (LOCAL CONTEXT ANALYSIS)
The paper briefly discusses other techniques, in particular Local Context Analysis (LCA). LCA includes standard theories such as term co-occurrence theory, a novel tf*idf approach, and the notion of expansion concepts (also known as document concepts). I am opening another thread about LCA for those interested in discussing the notion of document concepts at this SEW thread http://forums.searchenginewatch.com/showthread.php?t=2030
Let's discuss On-Topic Analysis!
Orion
Nacho
10-07-2004, 02:00 AM
The experiment, analysis and report are fascinating. Congratulations on a great job! I always learn something new from you every time you make a new communication like this, thank you.
Now, let me get my head together and post something soon. It's kind of late here to do it today.
Saludos!
fathom
10-07-2004, 12:35 PM
Hello, everyone.
Back to business since the last hurricane!
I have published the results of experiment #1. Thanks to all participants for their help and input. The entire documentation can be read at http://www.miislita.com/exp1/on-topic-analysis.html
I changed the title of the exp to reflect more accurately the theory and purposes behind the research. The new title is
ON-TOPIC ANALYSIS - Online Discovery of On-Topic Terms
The paper discusses a procedure for the online discovery of on-topic terms. Discovery is based on occurrence and co-occurrence information. It is demonstrated that on-topic analysis is a valuable tool for enabling users to enhance the semantics of theme sites and concept-focused documents. Specific applications to search engine marketing strategies and information retrieval systems are presented.
This is a long work involving competitive queries submitted by professional search engine marketing specialists. It introduces a methodology and procedure called on-topic analysis, which allows users to discover top, broader, narrower, and optimum terms. The notion of term distances is also presented.
EXPANSION CONCEPTS (LOCAL CONTEXT ANALYSIS)
The paper briefly discusses other techniques, in particular Local Context Analysis (LCA). LCA includes standard theories such as term co-occurrence theory, a novel tf*idf approach, and the notion of expansion concepts (also known as document concepts). I am opening another thread about LCA for those interested in discussing the notion of document concepts at this SEW thread http://forums.searchenginewatch.com/showthread.php?t=2030
Let's discuss On-Topic Analysis!
Orion
Excellent work Orion!
One thought to expand on... and using your example:
mexican food > mexican recipes > tortillas, burritos…
The thrust of the paper is improving page/link "On Topic" relationships.
I would tend to believe that the top hierarchy would be Mexican and food (as in Mexican Food) would be another relationship tree (possibly a historcial archive of the food style).
In this manner two new opportunities of expansion present themselves:
1. Interlinking of "on topic" pages,
2. Expansion into other Mexican on topical trees
e.g. Culture, Language, War/History
Back at your original dialogue the support provided by mexican food to mexican recipes - I would think would be somewhat diluted where mexican to mexican recipes would be complemetary.
Your thoughts?
orion
10-07-2004, 01:00 PM
Excellent work Orion!
One thought to expand on... and using your example:
mexican food > mexican recipes > tortillas, burritos…
The thrust of the paper is improving page/link "On Topic" relationships.
I would tend to believe that the top hierarchy would be Mexican and food (as in Mexican Food) would be another relationship tree (possibly a historcial archive of the food style).
In this manner two new opportunities of expansion present themselves:
1. Interlinking of "on topic" pages,
2. Expansion into other Mexican on topical trees
e.g. Culture, Language, War/History
Back at your original dialogue the support provided by mexican food to mexican recipes - I would think would be somewhat diluted where mexican to mexican recipes would be complemetary.
Your thoughts?
Hi, Fathom
Thank for the input. True that any diagram should lead to different possibilities.
First, the selection of mexican food was based on Nacho's submitted query as participant of the study.
Second, our software indicates that recipes is an optimum term for the query mexican food; i.e. it appears near the top of both search- and ranking-based Pi values.
Regards.
orion
10-07-2004, 01:27 PM
Oops. Sorry. I forget to include Pi values for Google Keyword Tool for the mexican food query. Here are just the most top results.
UNIQUE TERMS:157 TOTAL TERMS:600
N=GOOGLE QUERY:MEXICAN FOOD
Ri Pi (%) TERMi
1 32.50 MEXICAN
2 32.50 FOOD
3 1.50 RECIPES
4 0.83 HISTORY
Compare these with Table 1. The term recipes is found near the top of the search-based Pi lists. Since recipes is also found near the top of results extracted from the top N titles, it is considered a candidate optimum term. Interestingly to Fathom well-deserved credit, the term history also shows.
Orion
fathom
10-07-2004, 01:46 PM
Hi, Fathom
Thank for the input. True that any diagram should lead to different possibilities.
First, the selection of mexican food was based on Nacho's submitted query as participant of the study.
Second, our software indicates that recipes is an optimum term for the query mexican food; i.e. it appears near the top of both search- and ranking-based Pi values.
Regards.
Understood and not discrediting the work - it's exceptional.
From a "search" topical vantage-point and using conscious or even a subconscious searching habits - there would be a market segregation that plays a part in topic relevancy.
Ranking ability via on topic algorithimic analysis still plays second chair to a searchers appreciation of being on topic.
Tabular results as in output from Wordtracker or Google KW Tool (as an examples) often biases intial observations based on incomplete profiling information
(Illustration only) A query of mexican recipes is quite different to a query for mexican food... the former assumes "making it" as in specifics possibly a hierachical query for ingredients, the latter would normall assume "consuming it" as in specifics possibly a hierachical query for prepared foods, restaurants, or possibly general informative dialogue.
From a commercial vantage point one might assume product sales vs. services segregation and taking this a step further a niche portal developed in this manner could possibily affect negative sales trends or imbalance in sales e.g. why buy prepared if I can make it cheaper.
I guess my only point is taking pure math & science conclusions and attempting to figure out how to visualize real world applications particularly in a commercial environment.
Again - excellent work!
orion
10-07-2004, 02:18 PM
Hi, Fathom.
Your observations are extremely important and well put.
I found this two statement the reason why a complete approach must be considered.
Ranking ability via on topic algorithimic analysis still plays second chair to a searchers appreciation of being on topic.
Tabular results as in output from Wordtracker or Google KW Tool (as an examples) often biases intial observations based on incomplete profiling information
Ranking only or search stats only (Google, KW, WordTracker) are incomplete approaches to the process of on-topic keyword discovery. The first one is indeed second chair and the other can be indeed biased.
A complete approach as we presented, e.g. proper identification of optimum terms can be achieved.
I found very fascinating the fact that resolved data structures can be extracted from unstructured commercial documents full of noisy information.
Orion
fathom
10-07-2004, 03:26 PM
Hi, Fathom.
Your observations are extremely important and well put.
I found this two statement the reason why a complete approach must be considered.
Ranking only or search stats only (Google, KW, WordTracker) are incomplete approaches to the process of on-topic keyword discovery. The first one is indeed second chair and the other can be indeed biased.
A complete approach as we presented, e.g. proper identification of optimum terms can be achieved.
I found very fascinating the fact that structured data structures can be extracted from unstructured commercial documents full of noisy information.
Orion
Amazingly we complement each other here. While the math & science of things makes it work, I prefer conceptual model overviews and how they can be adaptive to existing structures.
In plain language - I dabble in Math & Science! ;)
orion
10-08-2004, 12:10 PM
In our experiment, we asked for theme queries for two reasons:
(a) to start with theme terms that are strongly connected (c-index pre-qualified).
(b) if we have started with a single term, this would produce many possible thematic trees.
This is why we started with "mexican food" and not with "mexican" or why we started with "used cars" and not with "used", "forex trading" and not "forex", "email marketing" and not "email", etc. These were theme queries submitted by participant SEOs.
A search for "mexican" and subsequent analysis reveals that "Mexico" qualifies as a candidate optimum term. However, this is quite obvious. Constructing a query or topic for something like "Mexican Mexico" sounds redundant.
Now here is something I would like to share.
Someone emailed me asking about the difference between top and optimum terms. Sender allowed me to reproduce my own response at this thread. Here is the response.
"First, thank you for reading the paper and taking the time to contacting me."
"According to the paper's thesis, a top term is a term found at the top of occurrence probability lists that have been extracted from the top N ranked titles."
"An optimum term is a bit more complex to define. Let me define what is an optimum. This is defined as the best possible solution to a problem. Thus, what works for you may not work for others. It all depends on what you're trying to accomplish. According to the paper's thesis, an optimum term should be
1. relevant to searchers (as measured from searches-based Pi lists)
2. relevant to ranking algorithms (as measured from ranking-based Pi lists)
3. strongly connected (as measured with c-indices)
4. good (ROI) performer
This should not be taken for "The Definition". A term can be an optimum term in Google but not in MSN or Overture. Certainly we can add copy style, contextuality, geolocation targeting, and others things to the mix. Thus the above should be taken as a mere guide for terms selection."
"The main finding of the On-Topic Analysis paper is that is possible to extract well resolved data structures of the form top > broader > narrower ... from unstructured collections full of commercial noise.
Once we know which data structures are triggered by which queries in which databases, then we can target these structures according to a site development or marketing strategy. Note that discovery is accomplished client-side and without having to use classic IR methodologies that may work well under lab conditions but often fail on the commercial Web.
Moreover, we can quantify points 1, 2, and 3 with our software and doing so on-demand for others (but this is a topic I am pondering with others). Point 4, ROI, is something the marketer has to take care of. This is one of the reasons why the terms are called candidate optimum terms. I hope this help."
Orion
Phoenix
10-08-2004, 01:20 PM
Wow, excellent article Orion, been looking for some good scientifically based articles lately on these topics. Reading it now. Would be interesting to see how this applies to other themed terms. :)
orion
10-08-2004, 05:12 PM
Hi, Phoenix.
Thanks for such kind words.
Would be interesting to see how this applies to other themed terms. :)
Any suggestion?
Orion
Nacho
10-08-2004, 08:35 PM
Once we know which data structures are triggered by which queries in which databases, then we can target these structures according to a site development or marketing strategy. Note that discovery is accomplished client-side and without having to use classic IR methodologies that may work well under lab conditions but often fail on the commercial Web.
I believe this is the key to making it all work and minimizing margin of error or increasing on-topic keyword generation.
When I look at Table 1. Data for Mexican Food (http://www.miislita.com/exp1/table1-mexican-food.html) and also look at my own search databases within MexGrocer.com for what users are trying to locate within the store eliminating the navigation method, I see that there are a number of high volume search queries that do not come up in your study or at least not in the top results. For example, here is our most recent list:
Searches / Query
930 mole
873 enchiladas
610 tamales
565 corn husks
514 tacos
501 mango
497 cheese
442 recipes
442 vanilla
440 chipotle
418 rice
406 carne asada
397 menudo
393 achiote
390 chocolate
390 guacamole
382 salsa
369 horchata
367 chorizo
351 pozole
Notice how "tortillas" or "burritos" do not come up in my top 20 internal search queries, but other words like mole, tamales, corn, guacamole, or salsa do come up. On the other hand, words like enchiladas, corn husks, tacos, mangos, cheese and others do very well in on-topic search demand. If we were to take some of these words and run a c-index ratio comparison.
First, the words used in "5.4.1 Visualization of Term Distances" and also on Table 1, but not MexGrocer.com's top 20 internal search queries.
http://www.ihispanic.com/sewf/mexican-food1.gif
Now the words "mole, tamales, corn, guacamole, or salsa" that appear on Table 1 and also on MexGrocer.com's top 20 internal search queries.
http://www.ihispanic.com/sewf/mexican-food2.gif
Now the words " enchiladas, corn husks, tacos, mangos, cheese" that DO NOT appear on Table 1, but do appear on MexGrocer.com's top 20 internal search queries.
http://www.ihispanic.com/sewf/mexican-food3.gif
Therefore, if we know a little more about "which data structures are triggered by which queries in which databases" then the on-topic analysis becomes even more valuable information for minimizing margin of error or increasing on-topic keyword generation.
orion
10-08-2004, 09:37 PM
Hi, Nacho
I got your email. We talk over the weekend. About your post. You are comparing two different things.
The tables you present are co-occurrence data, not search data or ranking data.
The example given in the tables and the other tables in the paper are based on Pi values from terms extracted from the top N ranked titles returned by a specific query, "mexican food". The stats you present in the post are from search logs. So, we cannot compare both lists. It is like comparing Google Keyword Tool Pi values with Pi values from the top N titles.
To associate your search logs with a data structure you would need to identify optimum terms. Then starting from the optimum terms construct the data structure for top, broader, narrower terms. Our software does this, precisely. This treatment, constructing data structures straight from optimum terms, was not discussed in the paper, which was aimed at presenting the basics of on-topic analysis.
The theme diagram Figure 1 and the term distances diagram shown in the paper are two different things. The theme diagram was to illustrate the architecture of a theme site. The tree diagram for term distances illustrate how term co-occurrence, indeed, generates a cluster of associations (similar PDQ_Med). There is no correlation between the two, except for the fact I use identical terms in the examples.
Orion
Nacho
10-08-2004, 09:48 PM
Orion, thank you very much for this clarification. Now I understand a lot better the differences. I'm learning a lot from your analysis.
I see that your software will be revolutionary and I know that most of us in the SEO industry will want to get a hold of it as soon as it is available. May I ask if there are any expected dates in mind for release?
orion
10-08-2004, 09:51 PM
One more thing. Those tables you have posted, Nacho, are not mine, so that's not my copyrighted material. However, let me comment on the figures shown in those not-mine tables
You are mixing too broader terms with too narrower terms, which explain why you are getting too small c-indices when queried in FINDALL mode. Thus, correlations cannot be made.
Orion
Nacho
10-08-2004, 10:23 PM
So if I want learn how to swim, I need to jump in the water, right? . . . . . . sombody quick, throw in a lifesaver :p
orion
10-08-2004, 11:11 PM
Hi, Nacho
I wish days have more than 24 hours as time is getting late, here in the Caribbean. Let me mention some few things. Here is one "salvavidas", amigo.
c-index calculations for three terms cannot be calculated with the c12-index equation. You would need to use the c123-index expression. See Appendix A.
So, your previous c12-index calculations need to be corrected.
However, to avoid the equation you can transform the 3-term case to a 2-term case by enclosing the mexican food phrase with quotes.
So for mexican food enchiladas, we get this results
k1="mexican food" n1=660000
k2=enchiladas n2=203000
k12="mexican food" enchiladas n12=23200
c12=27.63 ppt
See paper for the consequences of simplifying the 3-term case into a 2-term.
Take care, buddy.
Orion
Phoenix
10-08-2004, 11:43 PM
Orion,
Interesting thought I had that maybe you could clarify in regards to a comparision of two subject specific terms that seem related but are not. You mentioned in your study:
The simplest way to qualify on-topic narrower terms consists in combining them with other on-topic narrower terms and checking their c-index values. This approach helps with the qualification of narrower terms and leads to the discovery of new on-topic narrower terms.
The terms I am referencing to are "cheese nachos" and "bean and cheese nachos". Now the terms look specific to their subject. But for the sake of the question. Cheese Nachos are a food item that is not considered "mexican food", while "bean cheese nachos" are. An uncle of mine is accredited with the man whose company put the food item "cheese nachos" in stadiums and sports games many years ago. I know that from this experience those are classified as a "concession foods". However if you go to a mexican restaurant you can get "bean and cheese nachos" something different. Its appears that in the Google Keyword Suggestion Tool, groups these two as the same thing, in a category such as "appetizers". It provides under "additional words" to consider such as "hors d oeuvres, bean dip, appetizers, nachos, old el paso".
Now my question is how can qualification of these narrower terms with the c-index result in a categorization of a board terms away from other broad terms "mexican food & concession food". It would appear that those broad terms would be better than "appetizer". Or would it? How is this done?
orion
10-09-2004, 09:28 PM
I'm not sure I understand clearly the question. Could you refine it?
Orion
orion
10-12-2004, 11:41 PM
Phoenix and Nacho I don't know if this will help a bit.
I'm currently researching the nature of query-triggered data structures. One of the methods for discovering relationships between narrower and broader terms consists in looking at two type of data structures; i.e. data structures extracted from
a. the immediate results (top 30 ranked titles)
b. the "bulk" (top 100, 200, 300, 400, 500...etc titles)
Profesor Filippo Menczer, from Indiana University, emailed me a good reference of his work today http://informatics.indiana.edu/fil/Papers/cikm-04-326.pdf We are going back and forth with research notes since we are in similar research areas.
His approach on topic term discovery is different to mine but, when we look at from which levels the data structures are triggered, we come with similar results, at least for the example he provided in the paper using the terms Mars, MOC, Exploration, and similar terms (see his paper).
My partial results follows (compare with his example in paper)
First 8 results from the top 30 titles
UNIQUE TERMS:60 TOTAL TERMS:102
N=30 QUERY:MOC
UNIQUE TERMS/N=2.00
TOTAL TERMS/N=3.40
Ri Pi (%) TERM i
1 13.73 MOC
2 5.88 MARS
3 4.90 IMAGE
4 4.90 MINISTRY
5 2.94 GLOBAL
6 2.94 SURVEYOR
7 1.96 ORBITER
8 1.96 CAMERA
First 8 results from the top 300 titles
UNIQUE TERMS:835 TOTAL TERMS:1427
N=300 QUERY:MOC
UNIQUE TERMS/N=2.78
TOTAL TERMS/N=4.76
Ri Pi (%) TERM i
1 12.12 MOC
2 1.12 MARS
3 0.98 NEWS
4 0.77 SHOES
5 0.77 MERRELL
6 0.70 GLOBAL
7 0.63 SERVICES
8 0.63 FREE
Lists of results for top 200-500 titles show similar data structure in the bulk. Compare with the data structure in the top 30 results.
On and on, this is only one method of discovering data structures relevant to a narrower term. Again, once we know which data strutures goes with which narrower term as triggered by which query, we can proceed and exploit the structure for any particular purpose (eg monetary gain, etc)
Hope this help
Orion
Phoenix
10-13-2004, 12:04 AM
Orion,
Very interesting, actually this does help explain a bit more what I was wondering, and puts the concepts a little bit more into perspective. Especially when I looked at the comparison of top 30 titles and top 300 titles.
In regards to "discovering relationships between narrower and broader terms". In my original question, I guess I was trying to ask you how relationships are made between a board term: "mexican food" and a narrower term "cheese nachos" (or similar term)? And on how they are assigned to a particular data structure as you mentioned. I believe that would require discovering the data structure first, right?
On a site note: I wish they taught this stuff at the university I attend. Cause I would probably take the class.
Thanks,
Ben
orion
10-13-2004, 10:08 PM
I believe that would require discovering the data structure first, right?
Yes.
If you check the On-Topic paper, almost the same data structures were obtained in the immediate "surrounding" (top 30 results) and in the bulk (N=100, 200) with top and broader terms The exception were initial seeds consisting of narrower terms or loosely connected terms (e.g, aloha indiana and aloha montana).
In general, narrower terms return different data structures in the surrounding and in the bulk. Top and Broader terms tend to return the same type of data structures. This phenomenon can be used to identify narrower terms. The MOC (Mars Orbiter Camera) case is an example.
With cheese nachos, all I can tell you is that in Google "mexican food" does not provides a good tree of terms for cheese nachos. Moreover, cheese nachos as a seed query returns a data structure that has few to do with "mexican food".
Information is power.
Hope this help
Orion
orion
10-29-2004, 02:52 PM
For those that like conspiracy theories (wiretapping, spyware, gov-monitored forums/chat rooms, ufos, Elvis was the first SEO, etc) the wiretapped.net site, under "Covert Collection of High Capacity Signals", section b "ILC Processing Techniques" claims
"Covert agencies employ a vast array of multi-protocol data interception systems and devices. Such devices are capable of intercepting selectable, or randomly chosen communications channels implementing a new concept called "topic analysis"...
"However, such systems DO exist, and all operate on topic analysis techniques. For example: Such systems are based on dictionary computers with built in (pre-programmed) key words. These systems are designed to be placed in the paths of communications channels, such as standard voice traffic, or modem links." http://www.mirrors.wiretapped.net/security/info/papers/telecomms/hybrid-files/comint.txt
End of the quote. Certainly without any proof, all this reduces to mere speculations.
I was pondering seriously how specific patterns of related words and terms co-occurrance could be used for intelligence purposes. Without trying to dilute the thread with speculations or conspiracy theories, what do you think about on-topic techniques as intelligence and homeland security tools? Please provide value added information.
Orion
traian
06-29-2005, 04:11 PM
I read only the thread until now, but it seems to have some important ideas. It sound interesting. When I'll have more time i'll read the entire documentation, even I do not have any related knowlegde of topic and semantics.It seems that will be very important in future for all SEO.
Great job Orion.
traian
07-04-2005, 07:03 PM
I've read the article in weekend, it has some interesting points of view, it'a piece of work, no doubt, but I have something to ask you Orion.
How relevant it is? I mean, you care too much for the text in the title. As far as we know the results are not as much dependent only on title, but more for the page content and site theme. Shouldn't be extracted the title and the body text ? Some interesting results, may then occur. But what about cloacked pages, in this case?
orion
07-04-2005, 11:45 PM
Hi, there.
Thanks.
The paper is from last year and was limited to investigate passages as titles, only, just as a reference point. The top N passages (titles in this case) rather than entire content of the body were used to follow previous relevance feedback/query expansion methodology as described in the scientific literature.
I mentioned in the limitations and future work section of the paper I was planning to include other type of passages; SERPs entries as well as document content. So far this is what I have
Study with bodies
In some cases and depending on the query, I found that when defining passages as bodies, many spam documents that rank high in the SERPs introduce noisy terms to the analysis.
Study with urls
Defining the working passage as the top N urls, I found it is useful for assessing topically connected web communities, but the results are still inconclusive.
This second phase is not yet ready.
No, I haven't considered cloacking, yet, so I cannot comment on this. I'm currently working on a different short-term project for the summer.
Orion
traian
07-05-2005, 02:06 PM
Yes, interesting Orion. If you would continue doing research and you need more participants you can count on me. I'll PM my email.
orion
03-26-2006, 11:02 PM
I was planning to discuss this material in the Search Security Strategies (http://forums.searchenginewatch.com/showthread.php?p=77222#post77222) thread, but changed my mind. Instead I'm presenting here. Hope you like it.
Mapping a document to relevant term sequences can be accomplished by inspecting its content. The reverse process, mapping term sequences to documents without an a priori knowledge of the document is possible and is described below; but first some few words about On-Topic Analysis.
Unlike other keyword discovery techniques, On-Topic Analysis is a method for discovering keywords in which terms are
1. hierachically related, as via the data structure broader terms > narrower terms > specific (related) terms
2. contextually related, as in term sequences of the form k1 + k2 + k3....kn
In 1 one must know the initial seed, in order to obtain an output. This output is then used as input to grow a tree of terms. However, starting with the wrong seed may grow incorrect trees.
In 2 one must take into consideration the nature of the k's (N groups, NV, VN, etc).
There exist several workarounds to 1 and 2: wildcards, clustering, disambiguation, keywords co-occurrence, etc. For now, I want to concentrate on 2 and the use of wildcards.
On-Topic Iterations
Here I present a nice trick I call On-Topic Iterations, which could be viewed as an in context query expansion technique. The goal is to map specific sequences to specific documents without having a priori knowledge of the documents.
On-Topic Iterations are possible thanks to Google's implementation of the * wildcard. Discovered terms can then be used to refine new answer sets.
To find or expand term sequences about a given topic (k) one combines k with * and quotes the query, which is the same as submitting the query in EXACT mode; i.e.
"* k"
"k *"
"* k *"
where k can be one or more than one term. If k has only one meaning the queries should lead to a common tree.
To illustrate, suppose that we want to construct a topic tree about cars and we inspect the top N=30 results in Google (first 3 pages most users will care to view, anyway):
1. "* car" (http://www.google.com/search?num=30&q="* car") = coverges to rental car
2. "car *" (http://www.google.com/search?num=30&q="car * ") = converges to rental car
3. "* car *" (http://www.google.com/search?num=30&q="*car *") = converges to rental car
Surprisingly, the top 30 results discover a unique narrower term (rental) and converge to a unique topic in just the first iteration: rental car.
This is not always the case, especially with ambiguous terms or terms with more than one meaning -the nature and scope of the initial seed does matter.
The output is then used as the new seed, which can now help us to grow three possible trees. Continuing with the iteration,
1. "* rental car" (http://www.google.com/search?num=30&q="* rental car") = converges to specific firms about discount rental car
2. "rental car *" (http://www.google.com/search?num=30&q="rental car *") = discover specific terms
3. "* rental car *" (http://www.google.com/search?num=30&q="* rental car *") = discovers specific terms
The trick works great since it guides the searcher to quickly discovers, hence maps relevant term sequences to specific documents. These sequences can be used for multivariate testing or for discovering new sequences, as in
"discount * car" (http://www.google.com/search?num=30&q="discount * car") = which converges to location-specific sites.
When starting an iterative process one does not need to define k as a single term. In fact, for many esotheric terms like avatar, one can do better with two or more terms. To illustrate, try with
"* avatar loans" (http://www.google.com/search?num=30&q="* avatar loans")
"* avatar * loans" (http://www.google.com/search?num=30&q="* avatar * loans")
"avatar loans *" (http://www.google.com/search?num=30&q="avatar loans *")
Then pay attention to the new discovered sequences associated to a given document. Querying that sequence should rank high the corresponding document and if not, slightly tweaking the document for that sequence should improve its ranking. This open the possibilities for the discovery of many new on-topic sequences one might not be aware of.
I'm researching the intricacies of this iterative procedure to address why and when it works. So far, most of the cases I have tested work just fine.
Orion
traian
03-27-2006, 05:19 AM
Hi Orion,
In fact the search query "* car" (http://www.google.com/search?num=30&q=%22*%20car%22) returns for me, more bolded words, like:
Agencies Rental car
Alamo Rent A Car
Geek your car ... and so on.
How can I tell which one are the narrower terms and not? Are you taking in account the titles only or the descritions also?
Can you explain better " '* car' = coverges to rental car"? What do you mean by convergence?
I might have understood but I am sure that there are a lot of us who didn't.
Thanks,
traian
orion
03-27-2006, 01:29 PM
Great questions. I hope this help.
About * operator. Google implementation of * acts as both a wildcard and placeholder (position keeper). So
"* k"
"* * k"
"* * * k"
"k1 * k2"
"k1 * * k2"
"k1 * * * k2"
etc. tend to discover terms in those positions.
About Bold feature. The feature of highligthing terms by making them bold is more an usability feature than anything else used by search engines, not just Google. The idea is to facilitate users the finding of terms.
About Convergence. Terms from the top N results are counted and sorted by Pi values so we end with unique terms. (Pi = freq of term i/total occurrences). In this case rental was at the top of the Pi tables. I thought this procedure was discussed in the On-Topic paper.
About the goals. The goal of the above was to show how we can discover new sequences associated to specific documents without having an a priori knowledge of the later. The conventional way of doing this consists in finding term sequences by inspecting the documents. Most definitely is not a replacement for On-Topic Analysis, clustering or brainstorming. It is another tool in the toolbox.
Orion
orion
03-27-2006, 04:49 PM
Perhaps the best way to grasp the idea of mapping specific sequences to specific documents without an a priori
knowledge of the later is by using a mere brute force approach without paying attention to Pi values.
Let say I query
"* estate" which discovers "real". Now lets do this
q(0) = "* estate"
q(1) = "* real estate"
Select a discovered term from the serps. I will select "era"
q(2) = "* era real estate" which discovers "Guilford"
q(3) = "* guilford era real estate"
In my end I see that I have mapped the iteration to only 3 results relevant to the "guilford era real estate" exact sequence.
What I have done here is that I arbitrarily selected terms to do the iteration. This is what we call supervised learning since human intervention is needed to pick the arbitrary term(s).
Now if I make the selection based purely on number crushing Pi values or by using a random term picker there is no human intervention and the process becomes unsupervised learning.
Which of the two, supervised or unsupervised is better for mapping sequences to specific documents? Well in this case the outcome depends on many factors, one being the sample size. Here I limited Pi extraction to the first N ranked documents. I haven't tested yet the effect of N on both approaches (supervised/unsupervised).
Another thing to consider is that Google might serves results based on end users locations. This is not necessarily a drawback since the experiment can be made more selective by querying a specific collection like google.com.pr, google.fr, etc.
Orion
orion
04-06-2006, 04:04 PM
In the previous examples I suggested two simplistic approaches for finding on-topic sequences. Sequences were discovered by either using occurrence probabilities P(i) or by using human knowledge.
Another approach consists in pre-establishing some arbitrary selection rules and allow the iterative process to grow a tree of sequences without additional external knowledge. This makes the growth unsupervised learning.
Selection Rules
Before proceeding any further, some observations are in order:
1. The iterative approach herein described is not a replacement for brainstorming. In most cases the proposed approach can be used to enhance brainstorming sessions and keyword research activities. No more, no less. Accordingly, the procedure is not a replacement for clustering analysis or a comprehensive on-topic analysis. I am just interested in providing search engine marketers with a simple mechanism for the discovery of on-topic sequences without resourcing to any math arguments or semantic justifications.
2. Once discovered embedding sequences in documents is a different story. Only because a sequence was discovered does not mean it is ready for immediate use. These might need some tweaking to make them copy friendly. Furthermore, a sequence must be relevant to the central topic of a copy.
3. Considering that search engines tend to ignore delimiters and filling words (stopwords), these can be used to tweak candidate sequences. As a cardinal rule if a sequence cannot be accomodated to a copy, it must be avoided altogether.
Having said that, let's discuss now some selection rules.
When using selection rules these must be justified and make sense.
Considering that:
(a) users are inclined to visit sites based on the snippet records (title and descriptive text) displayed by search engines.
(b) users are inclined to visit the top documents listed by search engines.
(c) unlike with IR systems, commercial search engines usually limit the length of queries to no more than ten terms per query (n=10).
(d) users tend to submit queries consisting of few terms, usually between two to three or four terms.
(e) search engines are designed to ignore delimiters and stopwords.
(f) users prefer editorially correct copies.
These facts can be used to propose the following selection rules in an iterative approach for growing sequence of terms
1. select terms from the top N ranked documents, for instance from docs ranked in position one (r=1), two (r=2), three (r=3) and so forth.
2. ignore stopwords and delimiters between iterations
3. stop iterations when n = 4
The following examples illustrate such iterative approach.
Examples
Iterative Sample I
I'm going to use "--->" to indicate the discovery of an ngram. Hence, for r=1 and the initial seed "estate" I got this, at least at my end:
q(0) = "* estate" ---> real
q(1) = "* real estate" ---> commercial
q(2) = "* commercial real estate" ---> investor commercial real estate
n = 4, then stop.
The discovered sequences are
real estate
commercial real estate
investor commercial real estate
The first two sequences can be used in a copy "as is". The last sequence can be modified, for instance, by using
investor - commercial real estate
investor for commercial real estate
etc or by using derivatives of these.
Iterative Sample II
At my end I got:
q(0) = "estate *" ---> agents
q(1) = "estate agents * " --->wholesalers
q(2) = "estate * " ---> estate agents wholesalers restaurants
which clearly leads to a different subtopic. The last sequence need some copy work.
Drawbacks
With the proposed procedure we limited the selection of new terms to the visible snippet displayed by Google. This might lead to several drawback. The most obvious occurs when discovered terms are found in the copy but not shown in the SERPs. In this case the document must be inspected.
Another drawback occurs when a given snippet displays more than one candidate sequence; for example, one in the title and a different one in the descriptive text. Considering that search engines assign high importance to titles, sequences in titles are preferred. However, one might elect to follow sequences growing from both, from titles and descriptive texts.
At any given iteration step, one can made a query expansion decision, effectively taking a current sequence for an initial seed. This leads to "a spanning tree". Spanning trees are rich and complex in nature.
Spanning the Trees
To illustrate in Sample I the sequence real estate leads to the following sub branches:
q(0) = "real * estate" ---> k2
q(0) = "real estate *" ---> k2
while the sequence commercial real estate leads to the following sub branches:
q(0) = "commercial * real estate" ---> k2
q(0) = "commercial real * estate" ---> k2
q(0) = "commercial real estate *" ---> k2
To illustrate at my end I got
q(0) = "real * estate" ---> spring real (note in this case that k2 = spring real)
which generates the sequence
real spring real estate
Since n = 4, stop.
Again, to make it copy friendly, I could embed this sequence in the title of a document using delimiters:
Real Spring - Real Estate
Real Spring: Real Estate
I could also embed the sequence in a paragraph by ending one sentence with real and starting the next one with spring:
...real. Spring real estate...
...for real. Spring and Real Estate...
...for real. Spring's Real Estate in...
etc or by using derivatives of these.
In all these examples I have grown trees using r=1, only. A summary is given here (http://www.miislita.com/on-topic/on-topic-sequences.gif). A nice exercise consists in growing trees for different r values (e.g., r=2, r=3 and so forth) and comparing results.
Orion
orion
04-07-2006, 03:22 PM
Application to Taxes (For procastinators out there)
At my end I got the following results (using n=4 and r=1)
q(0) "tax *" ---> relief
q(1) "tax relief *" ---> reconciliation
q(2) "tax relief reconciliation *" ---> act :: Sequence ---> tax relief reconciliation act
q(0) "* tax" ---> nationwide
q(1) "* nationwide tax" ---> irs
q(2) "* irs nationwide tax" ---> 3326 :: Sequence ---> 3326 irs nationwide tax
q(0) "income *" ---> tax
q(1) "income tax *" ---> withholding
q(2) "income tax withholding *" ---> tax :: Sequence ---> income tax witholding tax
q(0) "income * tax" ---> housing
q(1) "income * housing tax" ---> rental :: Sequence ---> income rental housing tax
q(0) "income * tax" ---> housing
q(1) "income housing * tax" ---> credits :: Sequence ---> income housing credits tax
Other generic examples
At my end I got the following results (using n=4 and r=1)
q(0) "rentals *" ---> condos
q(1) "rentals condos *" ---> cabins
q(2) "rentals condos cabins *" ---> villas :: Sequence ---> rentals condos cabins villas
q(0) "tiger *" ---> woods
q(1) "tiger woods *" ---> pga
q(2) "tiger woods pga *" ---> tour :: Sequence ---> tiger woods pga tour
q(0) "vacations *" ---> travel
q(1) "vacations travel *" ---> packages
q(2) "vacations travel packages *" ---> travel :: Sequence ---> vacations travel packages travel
q(0) "vieques *" ---> culebra
q(1) "vieques culebra *" ---> mayaguez
q(2) "vieques culebra mayaguez *" ---> vicinity :: Sequence ---> vieques culebra mayaguez vicinity
q(0) "reggaeton *" ---> music
q(1) "reggaeton music *" ---> tones
q(2) "reggaeton music tones *" ---> urban :: Sequence ---> reggaeton music tones urban
Historical Notes
1. Mayaguez is a city in Puerto Rico -well known to NASA. Vieques and Culebra -well known to the NAVY- are small islands in the vicinity and part of Puerto Rico.
2. Reggaeton was originated in Puerto Rico in the mid 80's. Often mistaken for spanish rap, spanish reggae and urban latin music, is a hot rhythm with sensual steps that incorporates bomba, plena and salsa elements (native dances) with elements from reggae and hip hop to become a unique genre. Couples like to dance reggaeton in perreo (doggie) style. Since 2000 many countries in Latin America and Europe are dancing to perreo music. Reggaeton is now all over the world. Megastars of the genre: Daddy Yankee, Ivy Queen, Don Omar and Tego. Almost 20 years of history tell us is not just another lambada, macarena or electric slide.
Orion