Special thanks to:
|
#1
|
|||
|
|||
|
In my last post, I erroneously confused the issue I was hoping to find an answer for. My goal is to find a method to find terms that a search engine considers conceptually "related" given a specific phrase.
For example, how would I find a list of the top 20 or 50 terms or phrases that Google considers related to the phrase "weather forecast". As a human, I might guess that terms like meteorology, rain, storm and sunny skies are all related terms, but what would be a method to use Google or other systems to find what the search engine considers the top 10,20 or 50 "related" terms? |
|
#2
|
|||
|
|||
|
Wordtracker uses the method of counting the number of times a term appears in the meta tags of (the top) 200 web page results at a search engine, and uses that to calculate how 'related' the term is to the initial phrase/word.
Do you think this method is worthwhile? Accurate? |
|
#3
|
|||
|
|||
|
Quote:
https://adwords.google.com/select/KeywordSandbox Note that they divide these into two main categories, "More Specific Keywords" on the left, and "Similar Keywords" on the right. The right is then further divided into "Expanded Broad Matches" and "additional keywords to consider." There aren't any expanded broad matches for "weather forecast," but there are a bunch of more specific and additional keywords. I've occasionally done a fair amount of cross-checking with the Overture Tool, and though there are no search numbers assigned to the Google terms, I'm pretty sure that each of the three lists is listed in descending order of searches. I haven't really figured an easy way to collate the three lists into one. Usually, I'll compare the Google suggestions with both Overture and Wordtracker, keeping in mind that the most frequently searched terms are probably skewed higher in Overture by automated searches, and that the least frequently searched terms are undependable in Wordtracker because of small database size, and I'll come up with a rough ordering. This is probably the best that you can do. The numbers simply aren't dependable. The Google phrases, I'm fairly certain, come from actual Google searches, and are as granular as you can get for free. Quote:
The problem is that this isn't actual search data... it's conjectural targeting data. I've seen some pretty off the wall meta tags in my time, so I doubt you'll dependably get any real linguistic similarities... but if you keep in mind that you are being guided by your competition's guesses, not by actual search data, you might be able to come up with some suggestions that you can then analyze on Google and Overture. But it's in no way an accurate measure of relationship. I suppose someone might argue that there's likely to be a similar distribution of phrases targeted and phrase usage in search, but I'd be skeptical about it. If that were the case, the number of pages targeting a phrase in Google should always be proportional to the number of searches containing a phrase, and that's often not true. In fact, one of Wordtracker's other tools purports to help you find good targets by finding frequently searched phrases for which there are not many competing pages. So, the methodologies are inconsistent. Another trick you can try is to play with the tilde (~) operator in Google to look for synonyms. |
|
#4
|
||||
|
||||
|
There are many tools that provide results based on searchers' behaviors, somehow associated with ads or based on search logs (e.g., the Google Keyword Tool, Overture's Term Suggestion Tool, Wordtracker, etc). However, so far these tools do not provide users with occurrence or co-occurrence data. For instance, the Google Keyword Tool does not tell users
1. which individual terms are used the most with popular queries. 2. which terms targeted by searchers are also targeted by top N ranked results. Despite of its popularity, the Keyword Tool has several limitations. The first obvious one is that term discovery is somehow conditioned by search behaviors and ads in Google. Second, some combinations of terms produce partial results or no results at all. This is understandable since the Keyword Tool was designed to improve the relevance of terms associated with ads, not to work as a general-purpose term discovering tool. Keep in mind that engine based-tools produce terms pre-qualified by their own users and stats, so this not always goes hand to hand with semantics. In addition to On-Topic Analysis, there are already many IR methods and techniques available for conducting this type of research 1. Relevance Feedback 2. Terms Clustering 3. Local Feedback 4. Local Context Analysis REFERENCES 1. G. Salton and C. Buckley; Improving retrieval performance by relevance feedback Journal of the American Society for Information Science, 41:288-297, 1990. 2. K. Sparck Jones and D. M. Jackson; The use of automatically-obtained keyword classifications for information retrieval. Information Processing and Management, 5:175-201, 1970. 3. R. Attar and A. S. Fraenkel; Local feedback in full-text retrieval systems. Journal of the ACM, 24(3):397-417, July 1977. 4. J. Xu and W. B. Croft; Improving the Effectiveness of Informational Retrieval with Local Context Analysis http://citeseer.ist.psu.edu/cache/pa...0improving.pdf Orion |
|
#5
|
||||
|
||||
|
On a related note, does anyone know what criteria G uses for it's "related:" command?
At one point I thought, you know, that it might show sites that are ummm, related or something ![]() When I checked, I was proud (and confused) to find my SEO site in the illustrious industry of white-water rafting real estate sleep apnea agents.... A recent check actually shows that this command now seems to work pretty darn well. Enough that I can now add it to my link building strategy. It currently gives only 31(?) responses, so it's value is limited, but it can be used to branch out pretty easily. I would not replace tradtional link building methods with it, but it can be a good start or addition. I don't know when this was upgraded. Whatever criteria they are now using, it seems pretty accurate. Ian
__________________
International SEO |
|
#6
|
||||
|
||||
|
Before it is taken out of context, the statement
"This is understandable since the Keyword Tool was designed to improve the relevance of terms associated with ads..." was given in the context of selecting terms to target ads in Google. I though its meaning may not be that clear, especially to new readers. About the related command, Google says at http://www.google.com/help/refinesearch.html this " " ~" Searches You may want to search not only for a particular keyword, but also for its synonyms. Indicate a search for both by placing the tilde sign ("~") immediately in front of the keyword. For example, to search for food facts as well as nutrition and cooking information, use: ~food ~facts " end of the quote. I hope this help Orion |
|
#7
|
|||
|
|||
|
Quote:
PS: The tilde operator is not related to the related: operator... (but it may be related to the "~related: ~operator". )Last edited by Robert_Charlton : 01-04-2005 at 03:00 AM. Reason: Adding PS |
|
#8
|
||||
|
||||
|
I agree that command is not the right descriptor for the tilde.
The ~ search finds documents with terms having some sort of synonymity association, which was one aspect of randfish's initial post. About ~ searches Google has stated Quote:
ANALYSIS We found intriguing as to why Google specifically mention the terms “nutrition”, “cooking”, “information”. Why these three terms are mentioned as having a synonymity association in this example? An on-topic analysis for the query Q = ~food~facts for terms extracted from the top N=100 titles in Google provides the answer UNIQUE TERMS:230 TOTAL TERMS:445; N=100 QUERY: ~FOOD~FACTS UNIQUE TERMS/N=2.30 TOTAL TERMS/N=4.45 Pi (%) TERM (i) 10.11 FOOD 7.19 NUTRITION 4.94 INFORMATION 3.15 RECIPES 2.92 HEALTH 2.47 COOKING 2.02 FACTS 1.35 TIPS 1.12 TRIVIA 1.12 SITE Pi is the probability distribution. Google’s sample query Q = ~food~facts shows that out of 445 terms, #2 nutrition #3 information #6 cooking #7 facts Analysis using only the top N=10, 30, 50, titles indicates these terms “persist” across the distribution. For example, for N=10 we obtain #1 food #2 information #4 facts #5 nutrition Now “cooking” does not show. Evidently, this term is found after the top N=10 titles (in what we consider the bulk of the search results). EXPANDED ANALYSIS Redefining passages as the text strings found in the anchor text plus urls of the corresponding top N titles provides even more interesting results. Standard cluster analysis applied to the final data identifies the correct tree-like data structures. The point is that on-topic analysis as well as many other IR techniques correctly identify this type of term relationships. Orion Last edited by orion : 01-04-2005 at 11:46 AM. Reason: refining lines, typos |
|
#9
|
|||
|
|||
|
Quote:
If I were to analyze the top 25 occuring 1, 2 and 3 word phrases (excluding stopwords) from the top 100 or 200 search results for a phrase, could that data be more valuable? |
|
#10
|
|||
|
|||
|
Quote:
term frequency in and of itself is not necissarily a 1:1 corrilation with relevancy. some of the most important words may be skipped by looking just at term frequency. in fact words which occur too frequently are often combined into phrases to help make them better discriminators. not sure if this is good or bad logic on my part, but you may also want to collect the top x terms which are not part of the top x phrases. ideally well organized search results should help pull up the documents which are somewhat rich in good discriminators though, but then again a large part of relevancy is due to off the page factors. some pages accidentally rank well.
__________________
The SEO Book Last edited by seobook : 01-06-2005 at 01:44 AM. |
|
#11
|
||||
|
||||
|
Quote:
Hi, Aaron. Actually, this part of your logic is good and right on target. Term frequency (tf, also known as crude term occurrence) is not necessarily as good as iinitially IR folks thought. It may be good with TREC training sets but not necessarily fine for the real scenario as is the commercial Web. With commercial search engines, this was quickly demonstrated back in mid 90's when primitive term vectors were used and abused. The tf term in term vector models is easy to exploit in those schemes. (this is known as term overrepetition or keyword spamming) In mid 70's models that accounted for term co-occurrence were proposed. Term co-occurrence (frequency of co-word citation in documents performs better for concept association. By late 90's, document concept frequency (frequency of noun phrases) was proposed. The work of Dr. Bruce Croft in this area -called LCA or local context analysis- is a good reference. In my view, normalized co-occurrence (i.e., c-indices) perform even better than plain co-occurrence, especially when comparison are made between disimilar documents and disimilar dbs) As someone conducting research in this area I probably sound a bit biased. Still is perfectly ok if others want to test other approaches or even try to reinvent the wheel. After all, the scientific method is often part of that. Learning about all these topics, even about methods already proved/disproved or found right or wrong doesn't necessarily hurt seos/sems, in my opinion. Orion Last edited by orion : 01-06-2005 at 12:18 PM. Reason: typos |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|