Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Closed Thread
 
Thread Tools
Old 06-11-2004   #81
Anthony Parsons
Rubbing the shine of the knobs who think they're better than everyone else...
 
Join Date: Jun 2004
Location: Melbourne Australia
Posts: 478
Anthony Parsons will become famous soon enough
Quote:
Originally Posted by rankforsales
" Unless everybody decide to keep the results for himself "

-Would anybody in this great SEO community ever do that?

I guess some might.... I know I won't. If we start implementing this on a test site (read: a dummy URL), I will make our findings public, both on this board and in my articles and in our newsletter..... deal?
Somehow Serge I don't think we have the rights to freely publish what comes of this board in our newsletters and such as it belongs to SEW once listed. A quote or snippet yer sure, with a link to the source most definately.
Anthony Parsons is offline  
Old 06-11-2004   #82
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Sorry but I could not resist. I forget to mention in last post that the example, calculations and conclusions presented in the webpronews thread at

http://www.webproworld.com/viewtopic.php?t=21161

are simply incorrect. The thread's originator, without defining the query mode conditions (horror!) presents the case of

k1=dog
k2=canine
k12=dog canine

and compares it with

k1=dog
k2=pooch
k12=dog pooch

When comparing, he uses THE SAME n12 results for both calculations (i.e., n12=999,000; another horror!) and then ends with the wrong c12-indices. He then draws wrong conclusions. This starts a meaningless discussion in the thread and all sort of wrong formulae proposals and speculations.

We did a search in Google in FIND ALL mode today and this is what we found, in parts per thousands (ppt).

k1=dog=53,300,000
k2=canine1,890,000
k12=dog canine=1,020,000
c12-index=18.83 ppt

k1=dog=53,300,000
k2=pooch=268,000
k12=dog pooch=129,000
c12-index=2.41 ppt

Clearly, in Google, dog canine has more degree of connectivity than dog pooch, right?

Since 18.83 >>> 2.41, his conclusions cannot be sustained.

Certainly, c-indices will change in time, but the overall trend, as obtained from a time series analysis we have for the above terms suggest a well separation of semantic connectivity trends in each case.

I don't want to sound harsh. That's simply not my style, but

pleeeeaaassssee

Before using a theory or debating about a theory, learn the basics first.


Wait for my summary, please.


Orion
orion is offline  
Old 06-11-2004   #83
detlev
Member
 
Join Date: Jun 2004
Posts: 48
detlev is on a distinguished road
c12-index is a measuring device

Hello everyone,

I would like to put all this in perspective as I understand it.

According to the figures pointed out by Orion above, there is much stronger relatedness between dog canine versus dog pooch in Google. That sounds very correct but it does not mean SEOs should run out and use dog canine in their copy if dog pooch is the copywriting style of the Website. It might vagulely imply but does not prove more people search for dog canine versus dog pooch and it does not mean you will rank better in Google if you try for dog canine versus dog pooch as a keyword phrase when the opposite might be truer.

It means we can measure the relatedness of dog canine versus dog pooch in the Google index. What this means for SEOs is still unclear and I am waiting for Orion to present some thought about what this means for copywriting style and SEO. I hope this helps some who are looking for a SEO magic bullet here. If one exists it has not yet been presented.

*cheers*
-detlev
detlev is offline  
Old 06-11-2004   #84
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

To detlev:

1. "I would like to put all this in perspective as I understand it. According to the figures pointed out by Orion above, there is much stronger relatedness between dog canine versus dog pooch in Google." Precisely.
2. "That sounds very correct but it does not mean SEOs should run out and use dog canine in their copy if dog pooch is the copywriting style of the Website. " Precisely, it doesn't mean that. I haven't discussed yet copy style.
3. "I hope this helps some who are looking for a SEO magic bullet here." Precisely.
4. "If one exists it has not yet been presented." Precisely and well put.

Let's keep everything in perspective as detlev has stated. As I've been mentioned before, don't try to look for instant grats here. If someone is looking for magic bullets for rankings is reading the wrong thread.

Orion
orion is offline  
Old 06-12-2004   #85
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

THREAD SUMMARY

DEFINITION OF A C-INDEX

Let n1 be # of search results containing a term k1 and n2 be # of search results containing a different term k2. Let assume a query consisting of k1 and k2 does not produce results n12 containing k1 and k2; ie. n12 = 0. Thus n1 and n2 can be taken for two mutually exclusive events. In terms of Venn Diagrams (Fuzzy Set Theory) this can be represented by two nonoverlapped circles of total area

Atotal = n1 + n2

I use the notion of "circle areas" for visualization purposes, only. Now if there is overlap (intersection of events; the two circles have a common region), we are describing a case where a query for k12 (k1 followed by k2) yields n12 documents in which k1 and k2 co-occur. Thus, n12 > 0 and the total area occupied by the circles is

Atotal = n1 + n2 - n12

The overlapping fraction is simply a ratio I elect to call "The c-index". Thus for the present scenario

c12 = n12/(n1 + n2 - n12)

A c-index can then be defined as the fraction of documents in which queried terms co-occur. More technically, this fraction is an intersection/union ratio. This scenario can be expanded to include more terms and set of results. For instance, in FIND ALL, for three terms co-occurring in a certain number of results (for now, we are taking 3 at a time from a pool of 3),

c123 = n123/Atotal

where Atotal is a different Atotal, not the previous Atotal, obviously. (See "Combinations and Permutations" post for other c-indices. I will expand on clusters of co-occurrences later, a phenomenon not found in the k1,k2 case.)

In general,

1. For semantically connected or topically-related terms and concepts, co-occurrence can be taken for a measure of semantic association within a given IR system or search engine database. For thesaurus-based terms, c-indices are a measure of synonymity.
2. c-indices can be used for building libraries of synonyms, find similar, query expansion in clustering algorithms.
3. Co-occurrence trends and patterns can be discovered by measuring c-indices as a function of time (time series analysis). Term co-occurrence, trends, patterns may be different across queried IR systems and search engines. Such studies provide non trivial datamining knowledge we can tap into.


TERMS CO-OCCURRENCE AND QUERY CONDITIONS

1. Co-occurrence experiments should be done in both FIND ALL and EXACT modes. Transpose indices should also be computed --reasons were given, accordingly.
2. Term co-occurrence experiments in FIND ANY should be avoided.
3. Bougus terms (stopwords, delimiters, etc) should be avoided in query experiments. Ideally, we use synonyms or topically connected terms extracted from a dictionary, thesaurus (a keyword tracker utility showing topically-connected (and sounded) combinations of queried terms come handy--more on this is coming).
4. FIND ALL queries produce results with term co-occurrence incidents without regard for sequence.
5. EXACT queries produce result with term co-occurrence with regard for sequence.
6. EXACT queries can be interpreted different by different queried systems. What a human considers a phrase is not what an IR system considers a "phrase". For example, a system ignoring "a", periods, hyphens, etc may interpret

...rap. A music...
...rap. Music...
Rap - Music
...rap music...

etc. as co-occurrence instances for the query "rap music". (Switch to country music, rock or salsa, if you wish).

How an IR system interprets an exact sequence affects c-index calculations, especially for queries conducted in EXACT mode.

In general, term co-occurrence could be defined differently across IR systems and search engines since

1. each system parses information differently.
2. IR and commercial search engines are constantly updating their document databases.
3. IR and commercial search engines may be changing their parsing algorithms.
4. we may be dealing with non validated systems (faulty systems).


LIMITATIONS OF C-INDICES

1. When computing a c-index, one merely computes a fraction extracted from retrieved results from a queried system without considering the actual structure of the information contained in those results.
2. Crude c-indices tells nothing about WHERE precisely the terms co-occur in a document, how far apart are the terms from each other, and how many instances of co-occurrence (frequency of co-occurence) are taking place in a given document. Other type of correlation indices are necessary.
3. C-indices are not silver-bullet solutions, magic pills or one-size-fit-all (more than a limitation of the theory this is a limitation of the tester)

Finally we have the dreaded PRECISION and RECALL issues (IR folks know what I'm talking about). In addition, I haven't discussed issues related with #results shown and #results present in a collection but not retrieved or not shown. Thus c-indexes we calculate are estimated values. Time series analysis help to assess many other issues not addressed by mere number crunching n1, n2, n12, etc...values. Let stick to the basics first.


C-INDEX MISINTERPRETATIONS

c-index experiments can be conducted with synonyms or topically-connected terms or for conducting query expansion experiments. But it should not be applied indiscriminately. Why?

Simply put, because term co-occurrence and connectivity are one of the most misunderstood areas of semantics. And certainly these are not equivalent concepts. As in criminal courts, evidence of association is not a proof that a crime was committed. However, repetitive incidence of associations, patterns and trends of co-occurences measured in time (time series co-occurrence) raises a red flag for any law and order or homeland security investigator. Right? [Incidentally, semantic co-occurrence research is a hot topic in the gov. See references in this thread.]

A c-index is just a tool and as any tool it can be used incorrectly and can lead to wrong conclusions. For example if I query in FIND ALL a system that accepts single letter-words; i.e., do a test for k1 = a k2 = u and k12 = a u true that I may end with a c-index value or with strong or weak term co-occurrence, but what any good that test is -let say- for improving semantic connectivity in a web document?

I can run similar tests for letters, numbers, delimiters, stopwords, etc, all sort of nonse k's and yet measure co-occurrence. So what? I can even intermix terms from disimilar languages and extract c-indices. So What? Certainly these results may interest linguistics folks but they probably do no good for average web documents. Right? Then the tools is no longer a tool or even a toy but an artifact.

There are now many "c-index" tools being tested in the background by what I call "keyworkers" and "keymarketers" or flying around online. One of such tools is found at

http://graphnical.com/cindex/

Records reveal all sort of good and bad selection of keywords; from stopwords to carefully crafted combinations of terms. The page claims the tool returns results in the FIND ALL and EXACT query modes but shows no way for users to specify the modes. At least I couldn't see a way to do this. Unless a selection feature for the modes is added, those results must be put into question. Still I welcome this and any other tools. We need more of these. Me? I stick to the original, my "The C-Index Calculator", unless a better one is constructed by those fine developers outthere.


WHAT'S AHEAD


1. The k1,k2,k3 case revisited: CLUSTERS OF CO-OCCURRENCES
2. Prefixes, stems and copy style
3. Term Co-occurrence at the document level (GRANULARITY OF CO-OCCURRENCES)

Point 2 may interest writers and 3 may interest those conducting "keyword density" experiments.

Feel free to comment on the above before we start with new material. Please forgive me any irritating typo or rational horror I may have committed. Have a great weekend all.


Orion

Last edited by orion : 06-12-2004 at 11:30 PM.
orion is offline  
Old 06-12-2004   #86
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Errata (Pardon, please)

In previous post I changed the c123 expression to read

c123 = n123/Atotal

which is the correct one (and is more complete). As I mentioned in the post the k1,k2,k3 case results in a co-occurrence clustering effect, not found in the k1,k2 case. How Atotal is defined depends on the particular clustering case. This phenomenon requires a complete set of c-indices, as we will see. Again, I ask for your indulgence.

Orion

Last edited by orion : 06-12-2004 at 11:39 PM.
orion is offline  
Old 06-13-2004   #87
DanThies
Keyword Research Super Freak
 
Join Date: Jun 2004
Location: Texas, y'all
Posts: 142
DanThies is a name known to allDanThies is a name known to allDanThies is a name known to allDanThies is a name known to allDanThies is a name known to allDanThies is a name known to all
I expect that you'll find changes to the tool that was posted shortly, as we're all enjoying this quite a bit. We're working on something a little different, which will hopefully be posted soon as well.
DanThies is offline  
Old 06-14-2004   #88
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

To Dan:

1. "I expect that you'll find changes to the tool that was posted shortly, as we're all enjoying this quite a bit. We're working on something a little different, which will hopefully be posted soon as well." Hi Dan. Hope you had a great weekend. I honestly wish more tools hit the market, soon. I endorse the idea and welcome all your excellent projects. The more analytical tools outthere, the better. I cannot wait to read your series of articles.


This post may be a little abstract. In order to explain things to a wider audience, I am trying to conduct this thread in non-technical terms; thus, not trying to use standard IR nomenclature. Let's then retake the discussion. Please feel free to comment. The post is organized as follow

1. k1,k2,k3 case revisited
2. clusters of co-occurrences


THE K1,K2,K3 CASE REVISITED


In a general sense, if we use Venn Diagrams and the notion of area A (for visualization purposes, only), for non mutually exclusive events we have for the k1,k2, k12 case a working expression of the form

Atotal = n1 + n2 - n12

If I want express this in terms of probabilities, then I can recite Theorem 9.5a from "Handbook of Applied Mathematics for Engineers and Scientists"; Max Kurtz, McGrawHill, 1991:

"Let E1 and E2 denote two overlapping events. If an event E results from the occurrence of E1 and E2 or both, the probability of E is "

P(E) = P(E1) + P(E2) - P(E1 and E2)

which is of the same form of as our simplistic working expression. The c12-index is then an intersection/union fraction representing the degree of term co-occurrence between k1 and k2

c12 = n12/(n1 + n2 - n12)

A similar treatment applies to the transpose case (c21). In both cases there is only one co-occurrence region. Draw two overlapping circles and convince yourself.

The k1,k2,k3 case is not that simplistic. As we will see this case involves different term co-occurence scenarios I like to call "clusters of co-occurrence". Reciting from Theorem 9.5b from "Handbook of Applied Mathematics for Engineers and Scientists"; Max Kurtz, McGrawHill, 1991:

"Let E1, E2 and E3 denote three overlapping events. If an event E results from the occurrence of E1, E2, or E3 or any combination of them, the probability of E is "

P(E) = P(E1) + P(E2) + P(E3) - P(E1 and E2) - P(E1 and E3) - P(E2 and E3) + P(E1, E2 and E3)

In terms of our working expression we can write

Atotal = n1 + n2 + n3 - n12 - n13 - n23 + n123

therefore we end with...


CLUSTERS OF CO-OCCURRENCES

If we define c-indices as intersection/union ratios, then we need to write the following indices

c123 = n123/Atotal
c12 = n12/Atotal
c13 = n13/Atotal
c23 = n23/Atotal

Thus if we talk about co-occurrence we need to be very careful since we need to know all terms in Atotal (and we haven't yet considered combinations, permutations and transpositions for this scenario!).

From the practical standpoint, this represents a problem. if I instruct an IR system to only FIND ALL n123 documents (i.e. containing the co-occurrence instances of k1, k2 and k3), I also need to know all the terms in the Atotal expression in order to calculate the c-indices.

As discussed in previous posts of this thread, I have found that by redefining a new pair of k1 and k2 as

new k1 = k1 and k2
new k2 = k3

I can "reduce" the scenario to a query expansion case. Still there are cases in which this may not be done, since defining new k1 = k1 + k2 imposses a predefined sequence for the candidates new k2 = k3.

For this reason, I try to redefine the new k1 using previously tested k1 and k2 terms since I know a priori their degree of connectivity (from my thesaurus). I also use a correlation matrix of co-occurrences when performing such tests. This approach allows me to conduct co-occurrence reinforcement tests of previously tested pairs.


CHAINED CO-OCCURRENCES

Suppose that

1. we have 3 terms k1,k2 and k3.
2. There is no co-occurrence between k1 and k3 at all (or if there is, it is negligible)
2. There is co-occurrence between k1 and k2 and k2 and k3. So let's call k2 a "semantic bridge" term.

Visualize this as a circle overlapping two circles, each one at opposite sides of the circle in the middle (the "bridge"), so there are only two intersection regions at opposite extremes.

If we know

1. n1, n2 and n3 by querying separately k1, k2 and k3
2. n12 = # results containing k1 and k2
3. n13 = # results containing k2 and k3

Atotal = n1 + n2 + n3 - n12 - n23

Defining a c-index as an intersection/union ratio we can write

c12 = n12/Atotal
c23 = n23/Atotal

[This simple scenario leads to several combinations and permutations of the c-indices. Can you formulate these?]

The idea of terms acting as "semantic bridges" allows me to:

1. measure the degree of semantic commonalities and differences of terms in IR systems
2. conduct exploration experiments with large chains of term co-occurrence incidents in databases and in individual documents.
3. use "bridge" terms to conduct semantic connectivity enhancement studies for lossy terms in documents or in collection of documents. [Let's call this contextual information reinforcement studies.]

Please feel free to comment.


Orion

Last edited by orion : 06-14-2004 at 07:32 PM.
orion is offline  
Old 06-15-2004   #89
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

I want to thank Garret French Editor of iEntry's eBusiness channel for recognizing he made some errors in his calculations and interpretations of the c12-indices discussed in post #83 of this thread,

http://www.webproworld.com/viewtopic...bdd3ede43ec516

My gratitude and respects goes to Garret and Webpronews. We are all here trying to understand how emerging semantic search engine technologies are hitting the market and how these "smart" IR systems will be using association concepts when indexing, organizing, retrieving an assigning semantic relevance to pieces of information.

Taking small steps now will make the learning curve less sharp later. I am very happy that the SearchEngineWatch Forum is not just another marketing forum. Talking about the new breed of semantic search engines soon or later SEOs/SEMs will have to deal with... Here is a revealing technical report of the SCORE architecture.

http://lsdis.cs.uga.edu/lib/download...2-SCORE-IC.pdf

The "Managing Semantic Content on the Web" paper describes with diagrams the IR architecture of SCORE and how it should work. Quote from paper:

"The benefits of semantic associations are best realized in applications that integrate data, metadata, and knowledge queries."

So, such semantic associations can be explored with the use of term co-occurrences. Furthermore, the use of terms acting as commands (not mere boolean operators) reminds me the concept of semantic "bridges" I introduced in my previous post. Using semantic bridges for connecting terms with weak co-occurrence not only reinforce semantics in a document but simplifies the "flow" and grouping of similar or alike concepts and ideas. I'm currently looking for someone interested in teaming with me or in funding this type of research (emarketers, IR folks, university centers, gov folks, all are welcome).

For those interested in HAL and Microsoft's MINDNET Project, the following links

http://userweb.piasanet.com/tyale/prospect.htm
http://userweb.piasanet.com/tyale/mindnet.htm

are a 'must read'. HAL attempts to use associations "to have a default body of knowledge to better deal with any knowledge the average user"... The idea of using associations is to "provide HAL with common sense, to better recognize the implications of a user's statement to its entire accumulated body of knowledge." But HAL does more than this.

For Microsoft fans, the above reference states that there is a "semantic connectivity project at Microsoft Research, called MindNet, with currently more than seven million word associations." Clearly, semantic connectivity machines are here to stay.

On other matters:

A friend suggested me to provide an example of "bridge" terms. Good point. In theory any term can be a "bridge". As any physical bridge, the best one are those well constructed and frequently used. Thus I found that the best and versatile "bridges" are

1. those semantically connected to the terms to be associated (which should be losely connected when co-queried; i.e., n12 = 0 or negligible).
2. those with high frequency across the target IR system or intended search engine.

Here are some examples I tested yesterday in Google in FIND ALL mode. Results may change since then. (Read previous post before proceeding with the test cases). In all cases k2 is the "semantic bridge". Although I'm using some primitive examples, the cases may be relevant to copywrite style, I think (but I could be wrong).

Case 1

I'm trying to semantically connect or improve the semantic association between k1=pharmaco and k3=narcotraffic. I selected the term "drug", a term with high frequency and somewhat associated with pharmaco and narcotraffic

k1=pharmaco = 117,000
k2=drug = 39,700,000
k3=narcotraffic = 679

k12=pharmaco drug = 38,900
k23=drug narcotraffic = 401
k13=pharmaco narcotraffic = 0


Case 2

I'm trying to semantically connect or improve the semantic association between k1=effervescing and k3=narcotraffic. I selected the term "drug", a term with high frequency and somewhat associated with effervescing and narcotraffic (BTW effervescing comes from effervescence meaning the formation of bubbles in a solution. Note: effervescence narcotraffic is also disconnected in Google.)

k1=effervescing = 6,460
k2=drug = 39,700,000
k3=narcotraffic = 679

k12=effervescing drug = 551
k23=drug narcotraffic = 403
k13=effervescing narcotraffic = 0

While not the best examples, they illustrate the concept of "bridges".

Challenge: Find a k3 in which k1=nigritude and the bridge is k2=ultramarine.

Talking about challenges, it'll be of no surprise if someone comes with a contest such as "Find two non bougus terms with the strongest degree of co-occurrence (c12-index) in Google".

Finally, about the famous NU contest. We tested the semantic connectivity of k1=nigritude
k2=ultramarine over time and for a while. The time series for the c12-index was enlightening.

Last values were (in Google, FIND ALL mode)

k1=nigritude = 1,160,000
k2=ultramarine = 1,170,000
k12=nigritude ultramarine = 525,000
c12-index = 290.86 ppt

Thus the activity behind the NU phenomenon was measurable.

What's ahead: Prefixes, suffixes, stems

Feel free to comment.


Orion

Last edited by orion : 06-15-2004 at 10:03 PM.
orion is offline  
Old 06-15-2004   #90
nuclei
Your Link Broker
 
Join Date: Jun 2004
Posts: 66
nuclei is on a distinguished road
Quote:
Originally Posted by orion
I'm currently looking for someone interested in teaming with me or in funding this type of research (emarketers, IR folks, university centers, gov folks, all are welcome).
interesting
__________________
The TextLink Brokerage
Text Link Advertising at its finest!
nuclei is offline  
Old 06-16-2004   #91
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

PAUSING...

Before discussing co-occurrence of stems and prefixes I would like to make some refinements on the use of c-indices as a measuring device for the so-called Google Bombs phenomenon.

I also want to make clear I DON'T PROMOTE the idea of "bombing" Google or any search engine or even like such idea. Still I feel I need to present the following information since it affects upcoming subjects on c-index extractions.

As mentioned in this thread, I defined a c-index as an intersection/union ratio. Such ratios are present in many forms and can be measured for non mutually exclusive events. I avoided its probabilistic definition and derivation to simplify the discussion. For now, knowing that c-indices are a measure of the degree of co-occurrence is enough.

For terms co-occurrence, measuring such ratios from data extracted from commercial IR systems and search engines is tricky since one must preselect the proper query mode. If we are interested in conducting specialized searches, we also need to define WHERE to search. This is a topic I delayed in order to present the basics first. (I know SEOs "cannot wait" for the topic).

If I want to conduct precise c-index calculations in conjuction with specialized searches, I can order the system to search, for example in a title, url, or links. [I will expand on these type of searches soon since it may interest keyword researchers]. For now let stick to searches in links.

For the Nigritude Ultramarine "bomb", in Google I can conduct a "search only in links", in both FIND ALL or EXACT mode. In Google --using FIND ALL, in-links only, in ppt and two decimal places-- this is what I found today

k1=nigritude || n1 = 53,000
k2=ultramarine ||n2 = 55,200
k12=nigritude ultramarine || n12 = 11,600

c12=120.08 ppt

Compare with the c-index calculated in previous post.

The difference in c-indices is due to the fact that we are measuring terms co-occurrence in links only, which is in harmony with the original concept of link bombing. Strickly from the concept of bombing Google via links (not with the "global" information present in document), this is a more accurate ratio and a better way of tracking Google bombs.

Still If I am interested in tracking and identifying the onset of a Google bomb" I would use both c-index calculations (find-in-links and find-anywhere-in). Here is a list of some link bombs c-indices (as of today's conditions: in Google, FIND ALL, in-links-only, case insensitive, 4 decimal places)

Nigritude Ultramarine
k1=nigritude=53000 k2=ultramarine=55200 k12=nigritude ultramarine=11600
c12 = 120.0828 ppt

Miserable Failure
k1=miserable=6420 k2=failure=21500000 k12=miserable failure=555
c12=0.0258 ppt

Talentless Hack (BTW. This is one of the first bombs described by Adam Mathes)
k1=talentless=271 k2=hack=15600000 k12=talentless hack=21
c12= 0.0013 ppt

Feel free to replicate tests with "President Waffles" or your favorite bomb, keeping in mind that c-indices can be different when calculated using in-entire documents, in-title only, in-link, in-urls, etc. In fact, different type of information and analysis can be extracted from such "localized" indices. Such information could interest SEOs/SEMs. More on this is coming.

Observations

1. Notice how a c12-index assess contributions of individual k's to a two-terms bomb. (for a one-term bomb we are out of luck).
2. Notice the degree of success of the bombs. While the dramatic differences could be the result of purging actions from Google since the bombings, notice the dramatic presence of the nigritude ultramarine bomb still in the retrieved link collection.

Finally, if I use a time series analysis of c-indices,

1. I'm providing Google's researchers with a simple method for monitoring the onset of potential Google bombing activities. [That's a freebie I'm giving to Google researchers.] However there is a drawback since...

2. someone can use the method herein described to measure the success of a Google bomb or purging and remedy actions from Google after or during a bomb.

3. someone may be tempted to use the above procedure to measure the success of competing Google bomb contests starting at a give time, t.

Feel free to comment. Anyone interested in commenting through private email, feel free to do so through the private email feature of the SEW forum.

I'm working on the stems and prefixes material.


Orion

Last edited by orion : 06-16-2004 at 01:16 PM.
orion is offline  
Old 06-16-2004   #92
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

To runarb's thread, post#1:

Welcome and feel at home. Excellent reference material.

To Red5's thread, post#1:

"I am avidly following Orion's excellent discussion regarding Keywords Co-occurrence and Semantic Connectivity, and I'm aware that he's just about to start talking about word stemming and prefixes, so I hope I don't preempt him too much here. Sorry, Orion, if I do!" Hi, Red5. Happy to see someone interested in this area of IR. The more IR concepts we introduce to the mainstream the better. Please feel free to elaborate on the topic of stems. BTW, excellent references.

To Incubator, Garrett's thread, post #2:

Welcome and feel at home. Excellent referenced link, http://javelina.cet.middlebury.edu/c...ork_Graphs.pdf, on LSI which touches many areas of SC (Semantic Connectivity).

To Garrett's thread, post #1:

1. "Does a search engine algorithm interact with semantic connectivity or is it merely a measure of something that exists in a database?" Hi, Garrett. Welcome and feel at home. Terms co-occurrence measured with c-indices intents to measure the degree of semantic connectivity. We can talk about this at two different levels, i.e. (a) at the database level and (b) at the level of individual documents. So far I am introducing the concept at the database level. Discussions at document level are coming. In theory, SC can be measured without regard for the IR or database under consideration and at both levels.

The semantic connectivity concept is not new and is somehow implicit in most IR and ranking algorithms based on Salton's Models in which cosine similarities are used. This includes most current IR systems and search engines that use a Salton component for retrieval and ranking. My model is somewhat different in the sense that expands on the idea of intersection/union ratios (of non mutually exclusive events) as a simplified component for semantic measurements.

2. "What is the unit of measurement of semantic connectivity? Is it the ppt? What are the other measures." Good question. There is no unit for semantic connectivity, at least not in my c-index model, since merely is a dimensionless ratio. Since this ratio runs from 0 to 1 and usually leads to small values, I elected for expressing it in parts per thousands (ppt) by merely multiplying it by 1000. I could have multiplied the ratio by 100, then expressing it as a %.

3. "What are stop words" Stream of characters to be ignored by the IR system during parsing, indexing and retrieval. They are defined depending on how the IR system was programmed. Stream of characters may include too common term(s), but is not limited to that. It all depends on how the IR system was programmed. For example, an IR system (i.e., a vertical portal) about "jobs" will probably consider "jobs" a stop word since querying the term
in its IR system is probably redundant.

4. "What are query mode conditions?" Selection of query modes. Most search engines use FIND ALL, FIND ANY, EXACT, etc.

5. "If everyone let semantic connectivity drive their copy writing how would this affect a database? Would that database lose its value to the searchers?" This question describes a speculative scenario. I leave that question open for others to speculate or to comment on. Any comment is welcome. Hope this help, Garrett.

To Dodger:

Thank for the posts at other forums.


I'm trying to keep discussions relevant to this thread in this thread. Still I welcome other threads on IR topics. The more IR topics we have in The SEW Forum, the better.


Orion

Last edited by orion : 06-17-2004 at 10:40 AM.
orion is offline  
Old 06-16-2004   #93
Incubator
Member
 
Join Date: Jun 2004
Location: toronto
Posts: 260
Incubator has a spectacular aura aboutIncubator has a spectacular aura aboutIncubator has a spectacular aura about
Quote:
Originally Posted by orion
To Incubator, Garrett's thread, post #2:

Welcome and feel at home. Excellent referenced link (http://javelina.cet.middlebury.edu/...work_Graphs.pdf) on LSI which touches many areas of SC (Semantic Connectivity).

Orion
Thanks for the acknowledgemet, i also found the references (links) at the end of that .pdf very informative as well

Cheers

Wayne
Incubator is offline  
Old 06-17-2004   #94
yleewolf
Newbie
 
Join Date: Jun 2004
Posts: 3
yleewolf is on a distinguished road
interpretation in layman's terms??

Hi everyone. Great thread so far even if it is a little heavy on the numbers in places! I was really looking for a simple interpretation of the following c-indices results.

k1=baccarat
k2=gambling
k12=baccarat gambling
c-index=46

k1=baccarat
k2=baccarat gambling
k12=baccarat baccarat gambling
c-index=435

I really don't know what this is telling me. If I search for baccarat,does it mean that there is a much higher occurence of the phrase "baccarat gambling" in the results than there is the word "gambling"?

Can k1 be used as the keyword that you want a page to be optimised for and then use k2 to find phrases that appear more often in the results for k1?

Anybody else in the same boat as me? I appreciate that this is not a simple black and white tool with a pot of gold on the end of it but it would be nice to have some basic theoretical uses for it.

Thanks for your help in advance.

Wolf
yleewolf is offline  
Old 06-17-2004   #95
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Welcome Wolf to this thread. It's an honor to having you here.

I'll try to answer your question to the best of my knowledges. First, I'll need more information

1. What query mode conditions did you use?
2. From which IR/search engine you extracted the c-indices?

Without this information it is hard for us to assess the question.

Regarding: "If I search for baccarat,does it mean that there is a much higher occurence of the phrase "baccarat gambling" in the results than there is the word "gambling"?" c-indices do not measure occurrences. A c-index measures the degree of co-occurrences of two or more k's in a given database. Occurrence and co-occurrence are two different things.

Both the database and query mode must be defined. Let assume you do a search in FIND ALL mode. A search for "baccarat" in a database should return documents containing "baccarat" as well as "baccarat gambling". A search for "baccarat gambling" should return documents containing "baccarat gambling" regardless of sequence. Regardless of the occurrence numbers, we still need to measure their co-occurrence and have something to compare with.

On the other hand, if I do the query experiment in EXACT mode, a search for "baccarat gambling" should return documents with regard for sequence and containing the phrase "baccarat gambling" as well as documents interpreted as containing this as a "phrase" (for a "phrase" as perceived by an IR system -not a human-, see previous posts). These should be a subset of the results obtained in FIND ALL mode.

Regarding: "Can k1 be used as the keyword that you want a page to be optimised for and then use k2 to find phrases that appear more often in the results for k1?" Excellent question. I haven't yet discussed semantic connectivity at the document level. We're still discussing it at the database level. Two different things. That's coming.

To answer the question without getting into details, in theory the answer is a conditional "yes". Why conditional? One must consider the whole picture; i.e., proper copyright style for the target market space, topic and demographic, what exactly a client want to target, etc., etc. c-indices are not silver bullets. I'm still unveiling the basics at the database level. When we discuss term co-occurrences and semantics at the document level, we will discuss these and other topics as well. The whole thesis of this thread is to introduce SEOs/SEMs to analytical tools and perhaps in the process try to remove some trial-and-error or 2n-guessing approaches from the scene.


Note. A word on repeating k's for c-index extractions. c-indices intent to measure co-occurrence. Repetition of terms can impose a false bias in the c-index values which otherwise would not be there. Certainly using something like k1 = k2 = T, where T is a term, produces results with no semantic significance (at least not from the co-occurrence standpoint).

I hope this help.


Orion

Last edited by orion : 06-17-2004 at 10:38 AM.
orion is offline  
Old 06-17-2004   #96
yleewolf
Newbie
 
Join Date: Jun 2004
Posts: 3
yleewolf is on a distinguished road
Hi Orion

Thank you for your speedy reply. In answer to your first questions, I put the terms into the k1 and k2 boxes here: http://graphnical.com/cindex/ . What I am trying to do is to use the same k1 and measure the indices against different k2 phrases. So, referring back to my earlier post, if a google user queries for baccarat, do my c-index results tell me that I am probably better to use the phrase "baccarat gambling" in my content rather than just "gambling" due to its much higher correlation factor or are the results distorted by the repetition of the word "baccarat" in the k12?

Sorry if I'm pre-empting a future discussion.

Regards

Wolf
yleewolf is offline  
Old 06-17-2004   #97
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

To Wolf:

"Thank you for your speedy reply. In answer to your first questions, I put the terms into the k1 and k2 boxes here: http://graphnical.com/cindex/ . What I am trying to do is to use the same k1 and measure the indices against different k2 phrases. So, referring back to my earlier post, if a google user queries for baccarat, do my c-index results tell me that I am probably better to use the phrase "baccarat gambling" in my content rather than just "gambling" due to its much higher correlation factor or are the results distorted by the repetition of the word "baccarat" in the k12?" Sorry I couldn't answer as speedy as before. Hope this help.

You may have answered your our question ("are the results distorted by the repetition of the word "baccarat" in the k12?"). Certainly.


First let's define the query conditions for measuring co-occurrence in the queried database: Target is Google, experiment is conducted in FIND ALL and find-anywhere-in-document. (For FIND ALL in-title, url-only, etc... see previous posts).

Your question involves two different issues

(a) how user's formulate queries to the database (user's behaviors)
(b) what is interpreted as being present in the queried database (IR behaviors)

If we want to consider co-occurrence not just from what is in the database, but what users actually search for (users' behaviors), then we should compute c-indices using the query mode conditions used by average users. Thus, the discussion that follow applies to average users (a full discussion on user's query behaviors is ahead of us).

Average users tend to use the default query mode of the queried database or IR system when searching for information.

Fortunately, the default mode used by Google is FIND ALL any-where-in-document (and without regards for sequence), not FIND ANY(also known as the OR mode). In Google FIND ANY or OR is the "at least one of the words" option of the Advanced Search tool.

Consequently in FIND ALL mode we should expect documents containing all terms of the query more likely to be found and returned. Thus

1. A search for "baccarat gambling" in Google should return documents containing "baccarat" and "gambling" without regards for sequence, location, how many times the terms occur in the documents and how.
2. Similarly a search for "baccarat baccarat gambling" should return documents containing "baccarat" and "gambling" without regards for sequence, location, how many times the terms occur in the documents and how. Any difference between 1 and 2 should be relatively small. See results below.

The relative difference between 1 and 2 is 2,000/945,000 = 0.002 or about 0.2%.

Target: Google
Mode: FIND ALL
Date/Run: 06-17-04/11:19 AM

k1=baccarat ; n1=1,860,000
k2=baccarat gambling ; n2=947,000
k12=baccarat baccarat gambling ; n12=945,000
c12=507.52 ppt

k1=baccarat ; n1=1,860,000
k2=gambling ; n2=17,300,000
k12=baccarat gambling ; n12=947,000
c12=52.00 ppt

Observations: 507.52 >> 52.00 since

gambling=17,300,000 >> baccarat gambling=947,000

Reusing "baccarat" in "baccarat gambling" ads bias in the denominator of the calculated c value.


Finally,

A search for "baccarat" in Google should return documents containing "baccarat", "baccarat gambling" and also "baccarat baccarat gambling".

Certainly, these results should differ if we conduct the experiment at other query conditions. If we conduct the experiment in EXACT mode (with regard for sequence, see postS #52 and #86), then, a search for "baccarat baccarat gambling" should return only documents containing that sequence. This could include documents containing the "phrase" and delimiters; something like

.....baccarat baccarat gambling...
.....baccarat. baccarat gambling...
.....baccarat - baccarat gambling...
.....baccarat | baccarat gambling...

...etc..

See posts in this thread on what is/is not an EXACT search and a "phrase". It all depends how the IR system was programmed to parse the information. If I instruct a system not to ignore hyphens or pipes and use EXACT mode, documents containing

.....baccarat - baccarat gambling...
.....baccarat | baccarat gambling...

will probably be ignored when I search for "baccarat baccarat gambling".


A c-index is a tool and as any tool it can be misused or its results misinterpreted or artificially inflated (see post #86 of this thread). If we want to make correlations, draw conclusions from what the c-index values represent, and include user's query behaviors, then we need to consider just that. How many average users search for "baccarat baccarat gambling" instead of "baccarat gambling", ... etc...

As mentioned in this thread, a combination of user's keyword trackers (what we search), c-index values (what/where we search) and actual users behaviors (how/where we search) is necessary. This is where we are heading to.

To conclude, c-indices are intersection/union ratios of non mutually exclusive events, or if you wish, probability ratios. The higher that ratio the higher the probability of co-occurrence of randomly selected terms.

For a c12-index, the extreme case c = 1 with

k1 = k2 = k12

is an illusion and does not exist, since it would require that

c = n12/(n1 + n2 - n12) = n12/n1, in which case we cannot talk in terms of non mutually exclusive overlapping events.

Consequently, it is not surprising that, say, a c12-index values approaches 1 as more term repetitions are included in the associated k's.

This is not necessarily a drawback of the theory. If I run a search engine, I would suspect of artificially inflated c values

1. as the result of an imposed artificial bias in the query, which I of course cannot control or avoid or...
2. of truly co-occurrence spamming/bombing activities taking place in the database collection, which I can detect, control and monitor over time (see posts #90, #92 and "Google Bombs" material).


For the online tool you and others may be using, check post #86 of this thread. While a useful tool, it can only be used with Google. They should have placed some instructions on how to use it or how to interpret results. Still I welcome the development of this and additional tools.

Hope this help.


Orion

Last edited by orion : 06-17-2004 at 05:12 PM.
orion is offline  
Old 06-17-2004   #98
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Errata (Oops, Pardon, Perdon)

I am correcting previous post. At some point I originally wrote the following, and quote

"For a c12-index, the extreme case c = 1 is an illusion and does not exist, since it would require that

k1 = k2 = k12
c = k12/(k1 + k2 - k12) = k12/k1, in which case we cannot talk in terms of non mutually exclusive overlapping events."

I should read as follow (already corrected)

"For a c12-index, the extreme case c = 1 with

k1 = k2 = k12

is an illusion and does not exist, since it would require that

c = n12/(n1 + n2 - n12) = n12/n1, in which case we cannot talk in terms of non mutually exclusive overlapping events."

Note the use of n's in the c-expression and lines sequencing. I edited the post to reflect my intentions. At this time of the day, boy, I'm tired. Again I ask for your indulgence for this gross rational horror.


Orion
orion is offline  
Old 06-17-2004   #99
yleewolf
Newbie
 
Join Date: Jun 2004
Posts: 3
yleewolf is on a distinguished road
Man! Thank you for your detailed response! You kind of lose me halfway through but I appreciate you trying to explain it to me! There obviously is no simple answer to my question but just so we're absolutely clear:

I am not talking about the SE user searching for anything other than "baccarat". This keeps it simple as it is EXACT and FIND ALL. What I thought the c-indices that I posted earlier were telling me was that my content on a page optimised for the word "baccarat" would benefit more from the inclusion of the words "baccarat gambling" rather than the single word "gambling".

I now know that this is not what the indices are saying as the repetition of "baccarat" creates a distortion in the results. Can we conclude then that this tool cannot help us when trying to pick relevant words for content optimisation? Can I stick to my knowledge of the english language, common sense and all the other things we content builders have to rely on?

Orion - I really appreciate the time and effort you have gone to in trying to answer my question. I shall continue to read with interest!

Regards

Wolf
yleewolf is offline  
Old 06-17-2004   #100
DanThies
Keyword Research Super Freak
 
Join Date: Jun 2004
Location: Texas, y'all
Posts: 142
DanThies is a name known to allDanThies is a name known to allDanThies is a name known to allDanThies is a name known to allDanThies is a name known to allDanThies is a name known to all
Orion, all I can say is just keep going.

What we're working on is a potential aid to keyword research. Well, it's really just an excuse for me to play around with this stuff but it might become a tool some day. Here's what we're working on, I'd like to hear from anyone here, if you have ideas to improve this:

1) Starting with a single search term, we take the top 10 results from each of 3 search engines (Google, Teoma, and Yahoo), which gives us a list of up to 30 URLs.

2) For each URL in the list, we use our spider to index the page, and extract a list of 1, 2, and 3 word search terms from the page, ignoring stop words. This list represents "candidate" search terms that may be related to our original search term.

3) For each search term in our list, we perform a c-index calculation using Google results, with exact phrase matching, sort and present the results.

I assume that our results will improve as we expand the number of URLs to crawl. We haven't done so yet, but we're also considering a deep crawl based on, say, an entire Open Directory category.

So far, this hasn't proven extraordinarily effective, but it has helped us discover some related search terms when our standard tools failed us. We may need to bolster the automated (crawler-driven) discovery with a lexical database like WordNet (http://www.cogsci.princeton.edu/~wn/), not sure when we'll get around to that.
DanThies is offline  
Closed Thread


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off