03162005  #1 
Member
Join Date: Feb 2005
Posts: 238

cindex
Hi again,
Orion I definately would reccomend "Modern Information Retrieval" by Ricardo BaezaYates and Berthier RiberoNeto. I didn't understand at all the cindex you were talking about, so I checked out my trusty book to find what you referred to, and I found on pages 36, 37 some very familiar calculations and diagrams. Directly copied from these pages in the section "Fuzzy Information Retrieval": Ci,l = Ni,l / Ni,l + Nl  Ni,l "The elusive "c" comes from the "normalised correlation factor". Ki and Kl also refer to keywords and Nl is the number of documents which contain the term Kl. Ni,l is the number of documents which contain both terms. "such a correlation is quite common and has been used in clustering algorithms" "Fuzzy set models for information retrievalhave been discussed mainly in the literature dedicated to fuzzy theory and are not popular among the information retrieval community.Further, the vast majority of the experiments with fuzzy set models has considered only small collections which make comparisons difficult to make at this time." I was quite sure that the method wasn't used in IR. You failed to understand the "algeabraic sum" and the correlation matrix. Without these, the method is incomplete. My thoughts are that you read this and tried to adapt it to SEO work, but it is used for document similarity measure. The termterm correlation matrix is used to construct a thesaurus "whose rows and columns are associated to the index terms in the document collection". Without using this, the "cindex" is incomplete. Your diagram seems oversimplified to me as you don't include the conjunctive components which are binary weighted terms which are found by using a booleantype query. The measure will be between 0 and 1 of course, because it is normalized. The method is in fact simply called fuzzy information retrieval. I am not convinced that useing the Fuzzy information retrieval method described in BaezaYates will produce meaningful results for you, however I am open to discussion. Modern information retrieval contents Last edited by xan : 03162005 at 01:26 PM. 
03162005  #2 
Oversees: Search Technology & Relevancy
Join Date: Jun 2004
Posts: 1,044

Xan, you are taking reference from Ricardo's book (which I'm inviting to the conference we are putting together) and misquoting my implementation of cooccurrence. How come you can criticize a research I'm conducting if you are not aware of what we are doing in our research environment?
The cindices is a intersection/union ratio. The equation you quote from Ricardo's book is a special case for two and only two terms, k1 and k2. In the particular case of 1. two and only two terms 2. c12 = c21 The cindex metric reduces to the Jaccard Coefficient and the equation you misquote. In the case of N > 3 one cannot longer talk about pairwise similarity, still the cindex metric we developed can be applied. Before you keep criticizing our research work, you better ask or review carefully the subject, sir. Orion 
03162005  #3 
Member
Join Date: Feb 2005
Posts: 238

Orion,
I'm sorry you don't like what I'm saying, but as you may know criticism is very usual in these circumstances. Ricardo is an IR person and all in this area of research frequent the same events and meet up. So I am very pleased you are inviting him, he will have lots of interesting things to talk about. Let me know about that, I'm interested. I don't think I misquote you, the method is the same. "In the case of N > 3 one cannot longer talk about pairwise similarity, still the cindex metric we developed can be applied." This is just recursive. I still am sorry that you don't accept your work to be criticised, its the best way to share it most often. 
03162005  #4  
Oversees: Search Technology & Relevancy
Join Date: Jun 2004
Posts: 1,044

Quote:
If Baeza and the folks at AIRWeb finally accept the invitation, you are more than welcome to join us. Orion 

03162005  #5 
Member
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436

I'm a little confused about how the criticism is being directed. Xan, are you suggesting that SEOs should not use the CIndex Formula:
C=Z/(X+YZ) Where: X = The number of pages containing keyword 1 (your target term/phrase) Y = The number of pages containing keyword 2 (the term/phrase you're comparing it against) Z = The number of pages containing BOTH keyword 1 & keyword 2 If so, can you tell us why? What are the flaws and what is this actually measuring (if not semantic connectivity)? It appears very logical to me  simple, but that's what's great (at least from an SEO practical perspective). I have certainly found outliers and results I didn't like (like car and tree which seem far too connected), but overall it appears to be a great relative measurement of how "related" a search engine might consider two words or phrases. If you can suggest a better methodology for this type of calculation, I would be very open to it. Just seeking some clarification  thanks! 
03162005  #6  
Member
Join Date: Feb 2005
Posts: 238

Quote:
Hi Rand! No, I'm not suggesting you don't use it, you are right it is very simple, but as you point out there are inaccuracies. I simply feel that the method is incomplete, its a fuzzy method for information retrieval, so there are basically some other calculations and main methods to follow in order to make the technique more precise. There are other methodologies, but this one is proabably the most simplistic, so it may well be what you need seeing you are looking for which keywords are related, but I wouldn't use it as gospel. The equation is quite commonly found, but in the older literature, because tests were not good for IR. Perhaps now part of it has another use. 

03162005  #7 
Member
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436

Xan,
Even if it is a rough measurement, we (as SEOs) don't have another system by which to measure connectivity, so it will probably continue to be used. I'm glad some discussion on this subject started so we could identify the strengths and weaknesses of the formula. For a long time, many of us took PageRank as gospel and only through greater understanding and studying of the SERPs were we able to see it for what it was  as Mike says "little green fairy dust". It's clear that CIndex has a much greater application and value, but knowing the limitations are good too. Do you have any suggestions about how the formula could be improved to give a more accurate number? If so, that would be a great contribution 
03162005  #8  
Oversees: Search Technology & Relevancy
Join Date: Jun 2004
Posts: 1,044

Quote:
The cindex metric for two terms and when c12 = c21 reduces to the Jaccard Coefficient for pairwise similarity. The Jaccard coefficient itself is a very old equation, which is the one described in the literature. The cindex is a generalized equation. So far I have introduced the metric for the very simplistic case of just two terms, k1 and k2. For three terms, one must compute 4 different cindices, three of which experience an additive property. For multiple term queries (4, 5, 10 terms, etc) the situation is far more complex. We must not confuse cindices with mere Jaccard indices as computed from cooccurrence matrices. With regard to the metric, back in the summer of 2004 in the Keyword Cooccurrence thread, I warned many not to take the metric for a gospel or a silver bullet either. The metric must be viewed as a guideline. The cindex alone is not enough, as I have many times pointed out to you and many others. You must also do an ontopic analysis for the extraction of the corresponding data structures and the corresponding clustering analysis. I'm planning in opening a new thread soon on the cindices, localized cooccurrence and fractal cooccurrence metrics, how, when use or not to use computed cindices. As I mentioned at SES, NY, there are three types of sources of cooccurrence Global (databases, collections) Local (answer sets, individual documents) Fractal (word distributions, passage segmentation) Orion Last edited by orion : 03162005 at 08:35 PM. 

03162005  #9 
Member
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436

Orion,
Thank you. I appreciate the details and recap from previous threads. I have so far only applied cindices to two terms/phrases at a time, but it would be interesting to apply it further  perhaps even be able to check an entire document's relationships  would that be local cooccurrence or am I confused? I have a very basic cindex tool that is almost ready for launch  I hope to have it ready for critiquing by Monday at the latest. 
03162005  #10 
Oversees: Search Technology & Relevancy
Join Date: Jun 2004
Posts: 1,044

Thank, rand.
Actually there are several tools already for computing cindices but using the unrealiable results from the Google API. The simplest way to compute cindices is with just an EXCEL spreadsheet template. After that, there is no need for designing anything, in my opinion. But, hey, you are welcome to experiment. Orion 
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)  
Thread Tools  

