Go Back   Search Engine Watch Forums > Member's Lounge > Beta Test
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
  #1  
Old 03-17-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Feedback for Basic C-Index Calculation Tool

http://www.socengine.com/seo/tools/c-index-tool.php

This tool is designed to pull results from Google for the number of matches for queries and run those numbers through the C-Index formula to return the C-Index PPT (parts per thousand).

The formula that is being used is:

C=(Z/(X+Y-Z))*1000

Where:
X = The number of pages containing keyword 1 (your target term/phrase)
Y = The number of pages containing keyword 2 (the term/phrase you're comparing it against)
Z = The number of pages containing BOTH keyword 1 & keyword 2

I'd appreciate feedback on functionality, speed, accuracy & of course, the usefulness and quality of the connectivity scale.

Known issues:
If you re-submit the same information twice, you may get vastly different results. This is due to Google's different datacenters returning different numbers of results. I don't, unfortunately, have a good solution for getting around this issue, and must hope that Google will stabilize results. This is a much smaller issue for more granular searches than for those searches with millions of results.
Reply With Quote
  #2  
Old 03-17-2005
orion's Avatar
orion orion is offline
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

These quotes have been removed from the c-index thread and are reposted here for those interested in reviewing this tool.


Quote:
Originally Posted by Orion
Actually, your scale is


Interpreting the C-Index Score
0 - 10 Little to No Semantic Connectivity
11 - 25 Indeterminate Levels of Connectivity
25 - 50 Low Semantic Connectivity
51 - 75 Some Semantic Connectivity
76 -100 Moderate Connectivity
101 - 150 High Connectivity
151 - 200 Exceptionally High Connectivity
200+ Practically




In my opinion, this is by far the most misleading semantic connectivity scale I ever read before.

Orion

PS. I'm sure you will rethink the whole thing. The last we want is to start misleading/diluting the concepts. I'm sure is not your intention either. The way is now, I cannot recommend your tool.


Quote:
Originally Posted by Randfish
Orion,

As long as it's calculating properly, the scale can easily be tweaked. Do you have a recommendation on what you would use?

Practically - That's just a mistake, as I said, it's not ready for prime-time. I do appreciate your assistance. I will re-create the post in beta test when it's a little more ready.
Reply With Quote
  #3  
Old 03-17-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
The new scale is:

0-5 | Little to No Semantic Connectivity
6-10 | Indeterminate Levels of Connectivity
11-20 | Possible Connectivity
21-35 | Probable Connectivity
36-60 | Definite Connectivity
61-100 | High Connectivity - Closely Related
101-150 | Very High Connectivity
151-200 | Exceptionally High Connectivity - Very Closely Related
201-400 | Extremely Related
400+ | Terms are Almost Universally Found Together
Reply With Quote
  #4  
Old 03-17-2005
Nacho's Avatar
Nacho Nacho is offline
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Quote:
Originally Posted by randfish
The new scale is:

0-5 | Little to No Semantic Connectivity
6-10 | Indeterminate Levels of Connectivity
11-20 | Possible Connectivity
21-35 | Probable Connectivity
36-60 | Definite Connectivity
61-100 | High Connectivity - Closely Related
101-150 | Very High Connectivity
151-200 | Exceptionally High Connectivity - Very Closely Related
201-400 | Extremely Related
400+ | Terms are Almost Universally Found Together
Is this the right approach to define semantic connectivity or semantic associations? Don't you think it's a bit too subjective?
Reply With Quote
  #5  
Old 03-17-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Yes it is, and I'm hoping that through feedback I can improve it. I'm basically relying on my own tests of around 100-200 different relationships, which is unreliable at best.

If you have a good idea for how to make this more relevant or accurate, I would really appreciate it. Thanks for your feedback!
Reply With Quote
  #6  
Old 03-17-2005
orion's Avatar
orion orion is offline
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Rand, I believe you are a very dedicated member of this industry. Still I think you are not understanding the theory behind semantic connectivity and c-indices.

1. Your scale is analogous to a linearized-type scale. You cannot do that, especially with semantics, topics, concepts and the notion of relatedness.

2. In the On-Topic Analysis paper, I explained the risks of blindfold computing c-indices and then drawing instant conclusions. For instance for the case of c12 indices, when one combines two terms, one being too broader and the other too specific, often a c12 value are very small. According to your scale these will be unconnected, not related, etc. This is why one needs to conduct the corresponding on-topic analysis. In the paper, I suggested 3 workarounds the above problem. Then after that one needs to do the corresponding on-topic and clustering analyses.

3. Due to point 1 and 2, try computing with Google a c12 for Yuban coffee, or Maxwell coffee or even Sanyo speakers. You will see what I mean. Compare these with c12 as classified in your table or real examples. Even better compare these with a c12-index for k1="a" and k2="b". Last Summer I warned about the blindfold calculations of c-indices (See the Keywords Co-Occurrence posts about this).

At the office, Nacho and I often discuss that computations of c-indices or EF-ratios are trivial. The beef consists in interpreting c-indices and understanding the theory behind co-occurrence, which is by far more complex.

I may need to write more documentation on how to interpret properly these metric computations before more people get misleaded. This is why I plan to put a series of workshop seminars and conferences on these and similar IR subjects, so SEOs can be trained properly.

One more point. Your results rely on Google. If you try the very same queries in other search engines your scale may be rendered useless. I suggest to remove all references to your scales.

Overall, I cannot recommend any c-index tool based on the Google API since results are unreliable. Others c-index tools report the same problems, as in http://forums.searchenginewatch.com/...ead.php?t=4267.

In the particular case of your tool, when someone use it he/she will then be tempted to draw conclusions from reading your table (scale). This will mislead many. Sorry I cannot recommend anyone to use your tool, honestly.


Orion

Last edited by orion : 03-18-2005 at 12:27 PM.
Reply With Quote
  #7  
Old 03-18-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Orion,

I understand your criticism and I appreciate it. Do you have any suggestions for how I might improve the tool? If I understand the on-topic analysis process, it requires manual human labor and could not be achieved through a tool. Is this correct and if so, is there any way to create a valuable experience for the user? The tool is something I built for myself, as I wanted a faster way to get the numbers than using Google itself and typing the queries 3 times, but I'd like to make it valuable to the community too.

Please let me know if you have some suggestions. Thanks!
Reply With Quote
  #8  
Old 03-18-2005
orion's Avatar
orion orion is offline
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, rand

One recommendation is to remove the table (scale), so your visitors will not be tempted to draw conclusions based on it when c-indices are computed. I'm not sure why you want to label/compartmentalize c-indices. Those labels are misleading. You can do better by not providing any label.

Another recommendation is to avoid the use of the Google API altogether.

With regard to on-topic analysis, I designed last year the software that does the analysis automatically. Thus, c-indices, ef-ratios, broader, narrower and very specific terms are discovered quickly. The part that does require human intervention are the clustering part, which as of today, still needs human intervention. This may change in the future.


Orion
Reply With Quote
  #9  
Old 03-19-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
I will look into attempting automatic on-topic analysis. I could possibly take the clustering results from an engine like clusty.com. The tool does not use Google's API - none of my tools do - it's too unreliable.

As for the table - I could remove it, but I am concerned that without context, users will have no idea of how to interpret their results... Some sort of scale seems neccessary, but I understand the current one is flawed. I'll try to think up something else.

Thanks Orion! I owe you a drink at the next SES
Reply With Quote
  #10  
Old 03-22-2005
orion's Avatar
orion orion is offline
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by randfish
I will look into attempting automatic on-topic analysis. I could possibly take the clustering results from an engine like clusty.com.
In my view, analyses should be conducted using data extracted from the same database to avoid combining metrics from dissimilar search engine databases. I normally do this by querying a given system, calculating the c-indices, then generating the corresponding on-topic analysis for the initial seeds. From this raw material we conduct the clustering analysis. Makes no sense trying to do clustering from Clusty if we start the exp with Google.

Quote:
As for the table - I could remove it, but I am concerned that without context, users will have no idea of how to interpret their results... Some sort of scale seems neccessary, ...
Sorry, rand but I cannot suscribe to this. The table leads your site users, especially new visitors to try to make a connection between c-indices and what is/is not a semantically connected phrase(s).

Here is the thing:

(1) c-indices are time-dependent, thus a query label with Label 1, Label 2, ... from your table may need to be relabeled next year, month or week. Say adios then to the guideline. There is a workaround this and is based on the dynamics of co-occurrence and derived semantics. I'm working on this subject but is not ready for prime time, but getting closer by the week.

(2) a given phrase(s) has different c-index values in different search engines (they are database specific), so any guide derived from your table can be rendered useless.

I feel with such tables we are doing more damage than good and confusing users that want to (a) understand co-occurrence driven tools and (b) try to understand how c-indices can be/cannot be used. Sorry but I cannot permit this to happen in this industry. (c) Last but not least, the labels are way too subjective to all sort of interpretations.

I have seen this many times. Once one allows the dilution of new ideas and metrics, many get confused. And there are always the risks posed by those that are tempted to deliver cheap shots to the theory (intentional misinformation).

Quote:
Thanks Orion! I owe you a drink at the next SES
I cheer for that. I also own you a drink, my good friend.


Orion

Last edited by orion : 03-22-2005 at 03:01 PM. Reason: fixing typos,adding new lines.
Reply With Quote
  #11  
Old 03-22-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Thank you again. Your points are well made and valid.

I think that the best system for now, since the tool is only a simple model for C-Index retrieval, will be to remove the table, write a paragraph on interpreting the results and note the fuzziness of the data.

I will also try to make it query the 4 major engines - Teoma, MSN, Yahoo! & Google so that comparison numbers are available. This should help to make the connections more valid.

The on-topic analysis is somewhat beyond me at the current time, but I will be looking into it. I'm very busy now on the creation of a related terms discovery tool - not an easy task either...
Reply With Quote
  #12  
Old 03-23-2005
randfish randfish is offline
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Orion & Nacho (and others) -

I have noticed in conducting many searches through the tool that in general, more specific terms tend to have closer relationships with each other when the C-Index goes up, while broader terms often have C-Index scores, despite being relatively un-connected. Could a possible solution to the scale involve pulling the number of results for each term and weighting on the basis of more specific vs. broad terms?

I am loathe to offer the tool with no scale whatsoever because it provides very little reference or usefulness to those who are not already familiar with C-Indices...
Reply With Quote
  #13  
Old 03-23-2005
orion's Avatar
orion orion is offline
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by randfish
Orion & Nacho (and others) -

I have noticed in conducting many searches through the tool that in general, more specific terms tend to have closer relationships with each other when the C-Index goes up, while broader terms often have C-Index scores, despite being relatively un-connected. Could a possible solution to the scale involve pulling the number of results for each term and weighting on the basis of more specific vs. broad terms?

In the On-Topic Analysis paper I explain why

Broader + specific terms tend to produce low c-indices
Specific + specific terms tend to produce high c-indices


Quote:
I am loathe to offer the tool with no scale whatsoever because it provides very little reference or usefulness to those who are not already familiar with C-Indices...
The solution to this is education. That's why I'm providing traning workshops on c-indices, co-occurrence and cutting-edge semantics topics in general. I figure a rush to providing tools or mere posting at discussion forums on c-indices will not reach the intended goal.

What is left now is: how many are interested in attending these seminars? Just let me know.

Orion

Last edited by orion : 03-23-2005 at 08:42 PM.
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off


All times are GMT -4. The time now is 02:10 PM.