|
#1
|
|||
|
|||
|
Feedback for Basic C-Index Calculation Tool
http://www.socengine.com/seo/tools/c-index-tool.php
This tool is designed to pull results from Google for the number of matches for queries and run those numbers through the C-Index formula to return the C-Index PPT (parts per thousand). The formula that is being used is: C=(Z/(X+Y-Z))*1000 Where: X = The number of pages containing keyword 1 (your target term/phrase) Y = The number of pages containing keyword 2 (the term/phrase you're comparing it against) Z = The number of pages containing BOTH keyword 1 & keyword 2 I'd appreciate feedback on functionality, speed, accuracy & of course, the usefulness and quality of the connectivity scale. Known issues: If you re-submit the same information twice, you may get vastly different results. This is due to Google's different datacenters returning different numbers of results. I don't, unfortunately, have a good solution for getting around this issue, and must hope that Google will stabilize results. This is a much smaller issue for more granular searches than for those searches with millions of results. |
|
#2
|
||||
|
||||
|
These quotes have been removed from the c-index thread and are reposted here for those interested in reviewing this tool.
Quote:
Quote:
|
|
#3
|
|||
|
|||
|
The new scale is:
0-5 | Little to No Semantic Connectivity 6-10 | Indeterminate Levels of Connectivity 11-20 | Possible Connectivity 21-35 | Probable Connectivity 36-60 | Definite Connectivity 61-100 | High Connectivity - Closely Related 101-150 | Very High Connectivity 151-200 | Exceptionally High Connectivity - Very Closely Related 201-400 | Extremely Related 400+ | Terms are Almost Universally Found Together |
|
#4
|
||||
|
||||
|
Quote:
|
|
#5
|
|||
|
|||
|
Yes it is, and I'm hoping that through feedback I can improve it. I'm basically relying on my own tests of around 100-200 different relationships, which is unreliable at best.
If you have a good idea for how to make this more relevant or accurate, I would really appreciate it. Thanks for your feedback! |
|
#6
|
||||
|
||||
|
Rand, I believe you are a very dedicated member of this industry. Still I think you are not understanding the theory behind semantic connectivity and c-indices.
1. Your scale is analogous to a linearized-type scale. You cannot do that, especially with semantics, topics, concepts and the notion of relatedness. 2. In the On-Topic Analysis paper, I explained the risks of blindfold computing c-indices and then drawing instant conclusions. For instance for the case of c12 indices, when one combines two terms, one being too broader and the other too specific, often a c12 value are very small. According to your scale these will be unconnected, not related, etc. This is why one needs to conduct the corresponding on-topic analysis. In the paper, I suggested 3 workarounds the above problem. Then after that one needs to do the corresponding on-topic and clustering analyses. 3. Due to point 1 and 2, try computing with Google a c12 for Yuban coffee, or Maxwell coffee or even Sanyo speakers. You will see what I mean. Compare these with c12 as classified in your table or real examples. Even better compare these with a c12-index for k1="a" and k2="b". Last Summer I warned about the blindfold calculations of c-indices (See the Keywords Co-Occurrence posts about this). At the office, Nacho and I often discuss that computations of c-indices or EF-ratios are trivial. The beef consists in interpreting c-indices and understanding the theory behind co-occurrence, which is by far more complex. I may need to write more documentation on how to interpret properly these metric computations before more people get misleaded. This is why I plan to put a series of workshop seminars and conferences on these and similar IR subjects, so SEOs can be trained properly. One more point. Your results rely on Google. If you try the very same queries in other search engines your scale may be rendered useless. I suggest to remove all references to your scales. Overall, I cannot recommend any c-index tool based on the Google API since results are unreliable. Others c-index tools report the same problems, as in http://forums.searchenginewatch.com/...ead.php?t=4267. In the particular case of your tool, when someone use it he/she will then be tempted to draw conclusions from reading your table (scale). This will mislead many. Sorry I cannot recommend anyone to use your tool, honestly. Orion Last edited by orion : 03-18-2005 at 12:27 PM. |
|
#7
|
|||
|
|||
|
Orion,
I understand your criticism and I appreciate it. Do you have any suggestions for how I might improve the tool? If I understand the on-topic analysis process, it requires manual human labor and could not be achieved through a tool. Is this correct and if so, is there any way to create a valuable experience for the user? The tool is something I built for myself, as I wanted a faster way to get the numbers than using Google itself and typing the queries 3 times, but I'd like to make it valuable to the community too. Please let me know if you have some suggestions. Thanks! |
|
#8
|
||||
|
||||
|
Hi, rand
One recommendation is to remove the table (scale), so your visitors will not be tempted to draw conclusions based on it when c-indices are computed. I'm not sure why you want to label/compartmentalize c-indices. Those labels are misleading. You can do better by not providing any label. Another recommendation is to avoid the use of the Google API altogether. With regard to on-topic analysis, I designed last year the software that does the analysis automatically. Thus, c-indices, ef-ratios, broader, narrower and very specific terms are discovered quickly. The part that does require human intervention are the clustering part, which as of today, still needs human intervention. This may change in the future. Orion |
|
#9
|
|||
|
|||
|
I will look into attempting automatic on-topic analysis. I could possibly take the clustering results from an engine like clusty.com. The tool does not use Google's API - none of my tools do - it's too unreliable.
As for the table - I could remove it, but I am concerned that without context, users will have no idea of how to interpret their results... Some sort of scale seems neccessary, but I understand the current one is flawed. I'll try to think up something else. Thanks Orion! I owe you a drink at the next SES ![]() |
|
#10
|
||||
|
||||
|
Quote:
Quote:
Here is the thing: (1) c-indices are time-dependent, thus a query label with Label 1, Label 2, ... from your table may need to be relabeled next year, month or week. Say adios then to the guideline. There is a workaround this and is based on the dynamics of co-occurrence and derived semantics. I'm working on this subject but is not ready for prime time, but getting closer by the week. (2) a given phrase(s) has different c-index values in different search engines (they are database specific), so any guide derived from your table can be rendered useless. I feel with such tables we are doing more damage than good and confusing users that want to (a) understand co-occurrence driven tools and (b) try to understand how c-indices can be/cannot be used. Sorry but I cannot permit this to happen in this industry. (c) Last but not least, the labels are way too subjective to all sort of interpretations. I have seen this many times. Once one allows the dilution of new ideas and metrics, many get confused. And there are always the risks posed by those that are tempted to deliver cheap shots to the theory (intentional misinformation). Quote:
Orion Last edited by orion : 03-22-2005 at 03:01 PM. Reason: fixing typos,adding new lines. |
|
#11
|
|||
|
|||
|
Thank you again. Your points are well made and valid.
I think that the best system for now, since the tool is only a simple model for C-Index retrieval, will be to remove the table, write a paragraph on interpreting the results and note the fuzziness of the data. I will also try to make it query the 4 major engines - Teoma, MSN, Yahoo! & Google so that comparison numbers are available. This should help to make the connections more valid. The on-topic analysis is somewhat beyond me at the current time, but I will be looking into it. I'm very busy now on the creation of a related terms discovery tool - not an easy task either... |
|
#12
|
|||
|
|||
|
Orion & Nacho (and others) -
I have noticed in conducting many searches through the tool that in general, more specific terms tend to have closer relationships with each other when the C-Index goes up, while broader terms often have C-Index scores, despite being relatively un-connected. Could a possible solution to the scale involve pulling the number of results for each term and weighting on the basis of more specific vs. broad terms? I am loathe to offer the tool with no scale whatsoever because it provides very little reference or usefulness to those who are not already familiar with C-Indices... |
|
#13
|
||||
|
||||
|
Quote:
In the On-Topic Analysis paper I explain why Broader + specific terms tend to produce low c-indices Specific + specific terms tend to produce high c-indices Quote:
What is left now is: how many are interested in attending these seminars? Just let me know. Orion Last edited by orion : 03-23-2005 at 08:42 PM. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|