Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 12-29-2004   #1
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Question Discovery of "Related" Terms

In my last post, I erroneously confused the issue I was hoping to find an answer for. My goal is to find a method to find terms that a search engine considers conceptually "related" given a specific phrase.

For example, how would I find a list of the top 20 or 50 terms or phrases that Google considers related to the phrase "weather forecast".

As a human, I might guess that terms like meteorology, rain, storm and sunny skies are all related terms, but what would be a method to use Google or other systems to find what the search engine considers the top 10,20 or 50 "related" terms?
randfish is offline   Reply With Quote
Old 12-29-2004   #2
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Wordtracker uses the method of counting the number of times a term appears in the meta tags of (the top) 200 web page results at a search engine, and uses that to calculate how 'related' the term is to the initial phrase/word.

Do you think this method is worthwhile? Accurate?
randfish is offline   Reply With Quote
Old 01-03-2005   #3
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Quote:
Originally Posted by randfish
For example, how would I find a list of the top 20 or 50 terms or phrases that Google considers related to the phrase "weather forecast".
I'd use several methods. First, I recommend the Google AdWords Keyword Tool, at...
https://adwords.google.com/select/KeywordSandbox

Note that they divide these into two main categories, "More Specific Keywords" on the left, and "Similar Keywords" on the right. The right is then further divided into "Expanded Broad Matches" and "additional keywords to consider."

There aren't any expanded broad matches for "weather forecast," but there are a bunch of more specific and additional keywords.

I've occasionally done a fair amount of cross-checking with the Overture Tool, and though there are no search numbers assigned to the Google terms, I'm pretty sure that each of the three lists is listed in descending order of searches. I haven't really figured an easy way to collate the three lists into one.

Usually, I'll compare the Google suggestions with both Overture and Wordtracker, keeping in mind that the most frequently searched terms are probably skewed higher in Overture by automated searches, and that the least frequently searched terms are undependable in Wordtracker because of small database size, and I'll come up with a rough ordering.

This is probably the best that you can do. The numbers simply aren't dependable. The Google phrases, I'm fairly certain, come from actual Google searches, and are as granular as you can get for free.

Quote:
Originally Posted by randfish
Wordtracker uses the method of counting the number of times a term appears in the meta tags of (the top) 200 web page results at a search engine, and uses that to calculate how 'related' the term is to the initial phrase/word.

Do you think this method is worthwhile? Accurate?
This kind of lateral searching of meta tags has been around for a long time, and there are a lot of sites that have free tools to give the same information.

The problem is that this isn't actual search data... it's conjectural targeting data. I've seen some pretty off the wall meta tags in my time, so I doubt you'll dependably get any real linguistic similarities... but if you keep in mind that you are being guided by your competition's guesses, not by actual search data, you might be able to come up with some suggestions that you can then analyze on Google and Overture. But it's in no way an accurate measure of relationship.

I suppose someone might argue that there's likely to be a similar distribution of phrases targeted and phrase usage in search, but I'd be skeptical about it. If that were the case, the number of pages targeting a phrase in Google should always be proportional to the number of searches containing a phrase, and that's often not true. In fact, one of Wordtracker's other tools purports to help you find good targets by finding frequently searched phrases for which there are not many competing pages. So, the methodologies are inconsistent.

Another trick you can try is to play with the tilde (~) operator in Google to look for synonyms.
Robert_Charlton is offline   Reply With Quote
Old 01-03-2005   #4
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Some References

There are many tools that provide results based on searchers' behaviors, somehow associated with ads or based on search logs (e.g., the Google Keyword Tool, Overture's Term Suggestion Tool, Wordtracker, etc). However, so far these tools do not provide users with occurrence or co-occurrence data. For instance, the Google Keyword Tool does not tell users

1. which individual terms are used the most with popular queries.
2. which terms targeted by searchers are also targeted by top N ranked results.

Despite of its popularity, the Keyword Tool has several limitations. The first obvious one is that term discovery is somehow conditioned by search behaviors and ads in Google. Second, some combinations of terms produce partial results or no results at all. This is understandable since the Keyword Tool was designed to improve the relevance of terms associated with ads, not to work as a general-purpose term discovering tool.

Keep in mind that engine based-tools produce terms pre-qualified by their own users and stats, so this not always goes hand to hand with semantics.

In addition to On-Topic Analysis, there are already many IR methods and techniques available for conducting this type of research


1. Relevance Feedback
2. Terms Clustering
3. Local Feedback
4. Local Context Analysis


REFERENCES

1. G. Salton and C. Buckley; Improving retrieval performance by relevance feedback Journal of the American Society for Information Science, 41:288-297, 1990.

2. K. Sparck Jones and D. M. Jackson; The use of automatically-obtained keyword classifications for information retrieval. Information Processing and Management, 5:175-201, 1970.

3. R. Attar and A. S. Fraenkel; Local feedback in full-text retrieval systems. Journal of the ACM, 24(3):397-417, July 1977.

4. J. Xu and W. B. Croft; Improving the Effectiveness of Informational Retrieval with Local Context Analysis http://citeseer.ist.psu.edu/cache/pa...0improving.pdf



Orion
orion is offline   Reply With Quote
Old 01-03-2005   #5
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
On a related note, does anyone know what criteria G uses for it's "related:" command?

At one point I thought, you know, that it might show sites that are ummm, related or something

When I checked, I was proud (and confused) to find my SEO site in the illustrious industry of white-water rafting real estate sleep apnea agents....

A recent check actually shows that this command now seems to work pretty darn well. Enough that I can now add it to my link building strategy.

It currently gives only 31(?) responses, so it's value is limited, but it can be used to branch out pretty easily. I would not replace tradtional link building methods with it, but it can be a good start or addition. I don't know when this was upgraded.

Whatever criteria they are now using, it seems pretty accurate.

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 01-03-2005   #6
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation ~ searches

Before it is taken out of context, the statement

"This is understandable since the Keyword Tool was designed to improve the relevance of terms associated with ads..."

was given in the context of selecting terms to target ads in Google. I though its meaning may not be that clear, especially to new readers.

About the related command, Google says at http://www.google.com/help/refinesearch.html this

"
" ~" Searches

You may want to search not only for a particular keyword, but also for its synonyms. Indicate a search for both by placing the tilde sign ("~") immediately in front of the keyword.

For example, to search for food facts as well as nutrition and cooking information, use:

~food ~facts

"

end of the quote.


I hope this help


Orion
orion is offline   Reply With Quote
Old 01-04-2005   #7
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Quote:
Originally Posted by mcanerin
On a related note, does anyone know what criteria G uses for it's "related:" command?
I believe it shows sites that are linked to by a site that also links to you.

PS: The tilde operator is not related to the related: operator...
(but it may be related to the "~related: ~operator". )

Last edited by Robert_Charlton : 01-04-2005 at 02:00 AM. Reason: Adding PS
Robert_Charlton is offline   Reply With Quote
Old 01-04-2005   #8
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Exp results

I agree that command is not the right descriptor for the tilde.

The ~ search finds documents with terms having some sort of synonymity association, which was one aspect of randfish's initial post.

About ~ searches Google has stated

Quote:
Originally Posted by Google
You may want to search not only for a particular keyword, but also for its synonyms....
Since the search is done using the default mode, the queried and related terms can be anywhere in the document; i.e, links, body, titles, etc..

ANALYSIS

We found intriguing as to why Google specifically mention the terms “nutrition”, “cooking”, “information”. Why these three terms are mentioned as having a synonymity association in this example?

An on-topic analysis for the query Q = ~food~facts for terms extracted from the top N=100 titles in Google provides the answer


UNIQUE TERMS:230 TOTAL TERMS:445; N=100 QUERY: ~FOOD~FACTS
UNIQUE TERMS/N=2.30 TOTAL TERMS/N=4.45

Pi (%) TERM (i)
10.11 FOOD
7.19 NUTRITION
4.94 INFORMATION
3.15 RECIPES
2.92 HEALTH
2.47 COOKING
2.02 FACTS
1.35 TIPS
1.12 TRIVIA
1.12 SITE

Pi is the probability distribution. Google’s sample query Q = ~food~facts shows that out of 445 terms,

#2 nutrition
#3 information
#6 cooking
#7 facts

Analysis using only the top N=10, 30, 50, titles indicates these terms “persist” across the distribution. For example, for N=10 we obtain

#1 food
#2 information
#4 facts
#5 nutrition

Now “cooking” does not show. Evidently, this term is found after the top N=10 titles (in what we consider the bulk of the search results).

EXPANDED ANALYSIS

Redefining passages as the text strings found in the anchor text plus urls of the corresponding top N titles provides even more interesting results. Standard cluster analysis applied to the final data identifies the correct tree-like data structures. The point is that on-topic analysis as well as many other IR techniques correctly identify this type of term relationships.


Orion

Last edited by orion : 01-04-2005 at 10:46 AM. Reason: refining lines, typos
orion is offline   Reply With Quote
Old 01-04-2005   #9
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Quote:
Originally Posted by Robert_Charlton
The problem is that this isn't actual search data... it's conjectural targeting data. I've seen some pretty off the wall meta tags in my time, so I doubt you'll dependably get any real linguistic similarities... but if you keep in mind that you are being guided by your competition's guesses, not by actual search data, you might be able to come up with some suggestions that you can then analyze on Google and Overture. But it's in no way an accurate measure of relationship.
Do you think it would be worthwhile to do an analysis of the on page text in this manner to find "related" terms, rather than just the meta tags?

If I were to analyze the top 25 occuring 1, 2 and 3 word phrases (excluding stopwords) from the top 100 or 200 search results for a phrase, could that data be more valuable?
randfish is offline   Reply With Quote
Old 01-06-2005   #10
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
Quote:
Originally Posted by randfish
Do you think it would be worthwhile to do an analysis of the on page text in this manner to find "related" terms, rather than just the meta tags?

If I were to analyze the top 25 occuring 1, 2 and 3 word phrases (excluding stopwords) from the top 100 or 200 search results for a phrase, could that data be more valuable?
not fully my cup of tea here, but some of the words within that group may have a negativite discrimination value and may not help you figure out the focus of a particular page or the relations between the pages.

term frequency in and of itself is not necissarily a 1:1 corrilation with relevancy. some of the most important words may be skipped by looking just at term frequency. in fact words which occur too frequently are often combined into phrases to help make them better discriminators.

not sure if this is good or bad logic on my part, but you may also want to collect the top x terms which are not part of the top x phrases.

ideally well organized search results should help pull up the documents which are somewhat rich in good discriminators though, but then again a large part of relevancy is due to off the page factors. some pages accidentally rank well.
__________________
The SEO Book

Last edited by seobook : 01-06-2005 at 12:44 AM.
seobook is offline   Reply With Quote
Old 01-06-2005   #11
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
term frequency in and of itself is not necissarily a 1:1 corrilation with relevancy. some of the most important words may be skipped by looking just at term frequency. in fact words which occur too frequently are often combined into phrases to help make them better discriminators.

not sure if this is good or bad logic on my part

Hi, Aaron.

Actually, this part of your logic is good and right on target.

Term frequency (tf, also known as crude term occurrence) is not necessarily as good as iinitially IR folks thought. It may be good with TREC training sets but not necessarily fine for the real scenario as is the commercial Web.

With commercial search engines, this was quickly demonstrated back in mid 90's when primitive term vectors were used and abused. The tf term in term vector models is easy to exploit in those schemes. (this is known as term overrepetition or keyword spamming)

In mid 70's models that accounted for term co-occurrence were proposed. Term co-occurrence (frequency of co-word citation in documents performs better for concept association.

By late 90's, document concept frequency (frequency of noun phrases) was proposed. The work of Dr. Bruce Croft in this area -called LCA or local context analysis- is a good reference.

In my view, normalized co-occurrence (i.e., c-indices) perform even better than plain co-occurrence, especially when comparison are made between disimilar documents and disimilar dbs) As someone conducting research in this area I probably sound a bit biased.

Still is perfectly ok if others want to test other approaches or even try to reinvent the wheel. After all, the scientific method is often part of that. Learning about all these topics, even about methods already proved/disproved or found right or wrong doesn't necessarily hurt seos/sems, in my opinion.

Orion

Last edited by orion : 01-06-2005 at 11:18 AM. Reason: typos
orion is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off