Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 03-16-2005   #1
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
c-index

Hi again,

Orion I definately would reccomend "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribero-Neto.

I didn't understand at all the c-index you were talking about, so I checked out my trusty book to find what you referred to, and I found on pages 36, 37 some very familiar calculations and diagrams.

Directly copied from these pages in the section "Fuzzy Information Retrieval":

Ci,l = Ni,l / Ni,l + Nl - Ni,l

"The elusive "c" comes from the "normalised correlation factor". Ki and Kl also refer to keywords and Nl is the number of documents which contain the term Kl. Ni,l is the number of documents which contain both terms.

"such a correlation is quite common and has been used in clustering algorithms"

"Fuzzy set models for information retrievalhave been discussed mainly in the literature dedicated to fuzzy theory and are not popular among the information retrieval community.Further, the vast majority of the experiments with fuzzy set models has considered only small collections which make comparisons difficult to make at this time."

I was quite sure that the method wasn't used in IR. You failed to understand the "algeabraic sum" and the correlation matrix. Without these, the method is incomplete.

My thoughts are that you read this and tried to adapt it to SEO work, but it is used for document similarity measure. The term-term correlation matrix is used to construct a thesaurus "whose rows and columns are associated to the index terms in the document collection". Without using this, the "c-index" is incomplete. Your diagram seems over-simplified to me as you don't include the conjunctive components which are binary weighted terms which are found by using a boolean-type query. The measure will be between 0 and 1 of course, because it is normalized. The method is in fact simply called fuzzy information retrieval.

I am not convinced that useing the Fuzzy information retrieval method described in Baeza-Yates will produce meaningful results for you, however I am open to discussion.

Modern information retrieval contents

Last edited by xan : 03-16-2005 at 01:26 PM.
xan is offline   Reply With Quote
Old 03-16-2005   #2
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Xan, you are taking reference from Ricardo's book (which I'm inviting to the conference we are putting together) and misquoting my implementation of co-occurrence. How come you can criticize a research I'm conducting if you are not aware of what we are doing in our research environment?

The c-indices is a intersection/union ratio. The equation you quote from Ricardo's book is a special case for two and only two terms, k1 and k2. In the particular case of

1. two and only two terms
2. c12 = c21

The c-index metric reduces to the Jaccard Coefficient and the equation you misquote. In the case of N > 3 one cannot longer talk about pairwise similarity, still the c-index metric we developed can be applied.

Before you keep criticizing our research work, you better ask or review carefully the subject, sir.

Orion
orion is offline   Reply With Quote
Old 03-16-2005   #3
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Orion,

I'm sorry you don't like what I'm saying, but as you may know criticism is very usual in these circumstances. Ricardo is an IR person and all in this area of research frequent the same events and meet up. So I am very pleased you are inviting him, he will have lots of interesting things to talk about. Let me know about that, I'm interested.

I don't think I misquote you, the method is the same.

"In the case of N > 3 one cannot longer talk about pairwise similarity, still the c-index metric we developed can be applied."
This is just recursive.

I still am sorry that you don't accept your work to be criticised, its the best way to share it most often.
xan is offline   Reply With Quote
Old 03-16-2005   #4
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by xan
I still am sorry that you don't accept your work to be criticised, its the best way to share it most often.
Not at all, if done properly.

If Baeza and the folks at AIRWeb finally accept the invitation, you are more than welcome to join us.

Orion
orion is offline   Reply With Quote
Old 03-16-2005   #5
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
I'm a little confused about how the criticism is being directed. Xan, are you suggesting that SEOs should not use the C-Index Formula:

C=Z/(X+Y-Z)

Where:
X = The number of pages containing keyword 1 (your target term/phrase)
Y = The number of pages containing keyword 2 (the term/phrase you're comparing it against)
Z = The number of pages containing BOTH keyword 1 & keyword 2

If so, can you tell us why? What are the flaws and what is this actually measuring (if not semantic connectivity)? It appears very logical to me - simple, but that's what's great (at least from an SEO practical perspective).

I have certainly found outliers and results I didn't like (like car and tree which seem far too connected), but overall it appears to be a great relative measurement of how "related" a search engine might consider two words or phrases. If you can suggest a better methodology for this type of calculation, I would be very open to it.

Just seeking some clarification - thanks!
randfish is offline   Reply With Quote
Old 03-16-2005   #6
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by randfish
I'm a little confused about how the criticism is being directed. Xan, are you suggesting that SEOs should not use the C-Index Formula:

C=Z/(X+Y-Z)

Where:
X = The number of pages containing keyword 1 (your target term/phrase)
Y = The number of pages containing keyword 2 (the term/phrase you're comparing it against)
Z = The number of pages containing BOTH keyword 1 & keyword 2

If so, can you tell us why? What are the flaws and what is this actually measuring (if not semantic connectivity)? It appears very logical to me - simple, but that's what's great (at least from an SEO practical perspective).

I have certainly found outliers and results I didn't like (like car and tree which seem far too connected), but overall it appears to be a great relative measurement of how "related" a search engine might consider two words or phrases. If you can suggest a better methodology for this type of calculation, I would be very open to it.

Just seeking some clarification - thanks!

Hi Rand!

No, I'm not suggesting you don't use it, you are right it is very simple, but as you point out there are inaccuracies. I simply feel that the method is incomplete, its a fuzzy method for information retrieval, so there are basically some other calculations and main methods to follow in order to make the technique more precise. There are other methodologies, but this one is proabably the most simplistic, so it may well be what you need seeing you are looking for which keywords are related, but I wouldn't use it as gospel. The equation is quite commonly found, but in the older literature, because tests were not good for IR. Perhaps now part of it has another use.
xan is offline   Reply With Quote
Old 03-16-2005   #7
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Xan,

Even if it is a rough measurement, we (as SEOs) don't have another system by which to measure connectivity, so it will probably continue to be used. I'm glad some discussion on this subject started so we could identify the strengths and weaknesses of the formula.

For a long time, many of us took PageRank as gospel and only through greater understanding and studying of the SERPs were we able to see it for what it was - as Mike says "little green fairy dust". It's clear that C-Index has a much greater application and value, but knowing the limitations are good too.

Do you have any suggestions about how the formula could be improved to give a more accurate number? If so, that would be a great contribution
randfish is offline   Reply With Quote
Old 03-16-2005   #8
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by xan
Hi Rand!

No, I'm not suggesting you don't use it, you are right it is very simple, but as you point out there are inaccuracies. I simply feel that the method is incomplete, its a fuzzy method for information retrieval, so there are basically some other calculations and main methods to follow in order to make the technique more precise. There are other methodologies, but this one is proabably the most simplistic, so it may well be what you need seeing you are looking for which keywords are related, but I wouldn't use it as gospel. The equation is quite commonly found, but in the older literature, because tests were not good for IR. Perhaps now part of it has another use.
Randfish,

The c-index metric for two terms and when c12 = c21 reduces to the Jaccard Coefficient for pairwise similarity. The Jaccard coefficient itself is a very old equation, which is the one described in the literature. The c-index is a generalized equation. So far I have introduced the metric for the very simplistic case of just two terms, k1 and k2. For three terms, one must compute 4 different c-indices, three of which experience an additive property. For multiple term queries (4, 5, 10 terms, etc) the situation is far more complex. We must not confuse c-indices with mere Jaccard indices as computed from co-occurrence matrices.

With regard to the metric, back in the summer of 2004 in the Keyword Co-occurrence thread, I warned many not to take the metric for a gospel or a silver bullet either. The metric must be viewed as a guideline.

The c-index alone is not enough, as I have many times pointed out to you and many others. You must also do an on-topic analysis for the extraction of the corresponding data structures and the corresponding clustering analysis.

I'm planning in opening a new thread soon on the c-indices, localized co-occurrence and fractal co-occurrence metrics, how, when use or not to use computed c-indices.

As I mentioned at SES, NY, there are three types of sources of co-occurrence

Global (databases, collections)
Local (answer sets, individual documents)
Fractal (word distributions, passage segmentation)


Orion

Last edited by orion : 03-16-2005 at 08:35 PM.
orion is offline   Reply With Quote
Old 03-16-2005   #9
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
Orion,

Thank you. I appreciate the details and re-cap from previous threads. I have so far only applied c-indices to two terms/phrases at a time, but it would be interesting to apply it further - perhaps even be able to check an entire document's relationships - would that be local co-occurrence or am I confused?

I have a very basic c-index tool that is almost ready for launch - I hope to have it ready for critiquing by Monday at the latest.
randfish is offline   Reply With Quote
Old 03-16-2005   #10
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Thank, rand.


Actually there are several tools already for computing c-indices but using the unrealiable results from the Google API.

The simplest way to compute c-indices is with just an EXCEL spreadsheet template. After that, there is no need for designing anything, in my opinion. But, hey, you are welcome to experiment.

Orion
orion is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off