PDA

View Full Version : Keywords Co-occurrence and Semantic Connectivity


orion
06-02-2004, 11:14 AM
Hello. I'll introduce myself as Orion. I'm a formal scientist, with special interest in AI applied to IR technology. Let's start this thread with a brief description of keywords semantic connectivity and what it can do for improving success across search engines. My goal is that SEO/SEM R&D departments once for all start using scientific tools rather than mere rumors and 2nd-guessing thoughts disguised as "seo expert tips". Sorry for my 2 cents in advance. I'll like to present the key concepts then anyone interested can start commenting.

According to Fuzzy Set Theory ("Modern Information Retrieval"; Baeza-Yates; Ribeiro-Neto, Addison, 1999), the degree of term co-ocurrence in a database is a measure of semantic connectivity (SM) and can be used to build thesaurus for the database. Some engines use term co-occurence in their query expansion algorithms. Understanding how one can measure term co-occurence could be used to carefully select keywords semantically connected in a given search engine database. As an added benefit, SM makes unnecessary the excesive repetition of keywords (keywords spamming).

Let's us start with the simple case of two keywords (k1 and k2). Later on we can expand on other cases (more than 2 keywords, keywords transposition, entropy relevance, etc).

Let n1 and n2 be the number of search results containing k1 and k2, respectively and n12 is the number of search results containing both terms. (One actually does a search for k1 then for k2 and finally a composite query consisting of k1 and k2). Using geometry arguments and fuzzy sets, it can be demonstrated that there exists an index, termed correlation index, c, such that

c = n12/(n1 + n2 - n12)

Thus c oscillates between 0 and 1. Term correlations increases as c approaches 1. This allows us for in a given search engine or IR database

a. test the best combination of paired keywords from a pool of keywords with the highest semantic connectivity (for that database).
b. build a thesaurus of synonisms targeting that database
c. build a query expansion or find similars library.
d. carefully craft titles and descriptions of web pages

Enough for now. Anyone interested in commenting? Excuse in advance any typo.

Regards

Orion

Anthony Parsons
06-02-2004, 11:32 AM
Enough for now. Anyone interested in commenting?

Damn Straight.

Now if I understand this correctly, you’re trying to tell us (in a NASA kind of way) that what we already know and use, I relate as "applied semantics" & "latent semantic indexing", you are trying to make more complicated?

If I read this correctly, what we do by placing synonyms and also structured thesaurus terms within our page copy, then we should rank higher? If this is correct, then any good SEO Copywriter should be performing this already.

I know that latent semantic indexing was more a myth than anything, though the testing off placing thesaurus terms within the text and writing well structured pages, clearly reflected the positives within the rankings to make it believable beyond just a myth.

Am I on the right track here Orion?

orion
06-02-2004, 12:03 PM
Damn Straight.

Now if I understand this correctly, you’re trying to tell us (in a NASA kind of way) that what we already know and use, I relate as "applied semantics" & "latent semantic indexing", you are trying to make more complicated?

If I read this correctly, what we do by placing synonyms and also structured thesaurus terms within our page copy, then we should rank higher? If this is correct, then any good SEO Copywriter should be performing this already.

I know that latent semantic indexing was more a myth than anything, though the testing off placing thesaurus terms within the text and writing well structured pages, clearly reflected the positives within the rankings to make it believable beyond just a myth.

Am I on the right track here Orion?

Thank you for responding to the post.

I agree with most of your points. Any proper use of thesaurus driven-terms must be pondered with proper copyright style and with what works well for a targeted product or service.

Fuzzy Set theory (not an IR theory) and the c-index merely is a tool used by IR researchers to build thesaurus and query expansion libraries. The central point of the post was about how SEO can use the c-index to properly identify semantically connected terms from a pool of candidate terms for a given database search engine. (c-values can be computed with a simple calculator). The expression "for a given search engine" is very important. A c-index for two keywords is not necessarily the same in Google, Yahoo, MSN, etc. It varies from engine to engine.

Certainly SEOs should stick to anything that works well for their clients. I'm simply trying to propose analytical tools well known to IR scientists, most of which predate the first wave of search engines and go back to the early 70's and 60's. Understanding how IR analytical tools work, does not hurt.

I'll try to keep my posts very simple and basic.

Orion

Anthony Parsons
06-02-2004, 12:08 PM
This is very interesting Orion. I love to learn new things, and this is it today. I look forward to reading your further posts.

Anthony Parsons
06-02-2004, 12:18 PM
Ok, I am heaps confused, I think, but very interested to learn this one. It sounds similar to what Word Tracker produces, but not quite.

Ok, lets use examples, I love examples.

Search Engine: Google
k1: web design
k2: webdesign

n1: 8,910,000
n2: 10,400,000
n12: 1,300,000

c = n12/(n1 + n2 - n12)

c = 1300000/(8910000 + 10400000 - 1300000)

c = 1300000/18010000

c = 0.0722

Ok, now can you explain those results too me please Orion.

orion
06-02-2004, 12:47 PM
Ok, lets use examples, I love examples.

Search Engine: Google
k1: web design
k2: webdesign

Ok, now can you explain those results too me please Orion.

I'll love to, Tony.

The beauty of c-indices is that are easy to compute with a calculator or simple javascript (we have a lot of them for mathematical semantic analyses).

First, the c-index for term co-occurrence should be used for synonisms (equivalent terms found in a dictionary). Second, the k1 and k2 terms should be single terms and as found in a dictionary. However, the concept can be extended to cases like the example above, provided one instructs the engine to recognize phrases as a single term (eg., using " ").

General Procedure

1. A pool of let say five candidate terms (candidate k2's) is selected. We want to determine which of the five terms co-occurs the most with a preselected term k1 in a given DBa (e.g., Google). Co-occurence is a kind of evidence of semantic connectivity.

2. Test k1 and each of the individual candidate k2's separately in Google.

3. Compute c-index for each case, always using the same k1 and different k2's.

4. The optimum combination of k1 and k2 is that one with the highest c-index.

5. Emphasize k1 and k2 in the document as required (eg. in titles, descriptions, etc).

Repeat recipe for other search engines. c-indices may change, which indicates that semantic connectivity is different across databases. (Very important for SEOs!)

Note.

We haven't discussed term transpositions, yet.
We haven't discussed cases with more than 2 keywords, yet.
We haven't discusses linguistic characteristics (i.e., c-index for Spanish, French keywords)

Orion

Anthony Parsons
06-02-2004, 01:10 PM
I actually understand that.

orion
06-02-2004, 01:19 PM
Excellent, Tony.

Love to talk more about the concept, including the connection with rankings and time evolution of c-indices (c's change from time to time). For now, I need to get off the forum to attend rutinary work and life issues. T'll next time.

Orion

orion
06-03-2004, 12:19 PM
About c-Index Calculations

c-Indices are excellent tools for building thesaurus, find-similar-libraries, query expansions and clustering

algorithms. Before presenting some examples of c-index calculations I would like to point out that

a. c-indices runs from 0 to 1; we are comparing small relative quantities. Thus, c-indices can be expressed

as % or ppt (parts per thousands).

b. In a strict sense and due to Recall and Precision arguments the number of results in the c-index

expression (the n's) are number of documents in an IR database containing the corresponding queried term.

This is not necessarily the same as number of results retrieved and shown by the IR system. However, assuming

good retrieval performance and strict adherence to pattern matching of regular expressions, the n's could be

taken for number of results produced by a search engine.

c. when consisting of more than one term, the k1 and k2 terms should be expressed withing quotes (""). In

this way the IR system will interpret k1 and k2 or any k as a single keyword (a phrase keyword).

d. we need to distinguish between c-indices for synonyms and c-indices for query expansions or query

refinement. Example: The thesaurus utility of my MS Word shows auto and automobile as synonyms for the term

car. But for the term calculator, it produces the following "synonyms": data processor, mainframe, mini

computer, PC, multitasking computer, computer, CPU, analytical engine, artificial intelligence, which clearly

correspond to query refinement and clustering considerations rather than to synonyms.

e. intepretation of c-indices are not in "black-and-white". One must consider semantic, language and usage

characteristics. A single term may occur in other languages, may have different meanings in different

languages, countries or demographics. Carefully crafted c-indices can work as semantic discriminants. Wrongly

crafted c-indices can produce messy results. Welcome to the art of analytical semantics.


Having said that, let do some simple calculations.

Case 1: Single terms (synonyms, similar terms)

By querying Google, for car, auto and automobile we obtain

k1=car = 224,000,000
k2=auto = 124,000,000
k12=car auto = 13,000,000
c=0.0388 or 38.8 ppt

k1=car = 224,000,000
k2=automobile = 50,400,000
k12=car automobile = 10,500,000
c=0.0399 or 39.9 ppt

Results: Thus in Google, k1=car and k2=automobile seem to have a greater synonymity association (semantic

connectivity) than k1=car and k2=auto. Note. The large number of results for k2=auto is not surprising; (a)

auto is considered a word in other languages (eg. Spanish) (b) auto is a root for automobile, automatic and

derivative terms.


Case 2: Single terms (query refinement with similar concepts)

k1=car = 224,000,000
k2=insurance = 111,000,000
k12=car insurance = 9,000,000
c = 0.0276 or 27.6 ppt

k1=auto = 124,000,000
k2=insurance = 111,000,000
k12=auto insurance = 8,660,000
c = 0.0383 or 38.3 ppt

Results: In Google, auto insurance has a greater c-index than car insurance, thus having a greater semantic

association (semantic connectivity).

A Final Note.

If we double quote the k12's the c-indices will change, since quoted k12 results are a subset of unquoted k12

results. For example. In the above cases we obtain.

“car insurance” = 4,810,000 with c = 0.0146 or 14.6 ppt
“auto insurance” = 4,460,000 with c = 0.0193 or 19.3 ppt

Yet in Google the results still indicate that "auto insurance" appear to be more connected than "car

insurance".


About Language, Geolocation and Demographic Characteristics

Car in Mexico and Puerto Rico means auto and is also a stem of other terms and derivatives. The popular term

for car is not auto but actually coche, in Mexico and carro, in Puerto Rico. Thus geolocation and demographic

data interpretations are better confirmed with c-indices extracted from regional directories.

For a review of c-indices, read Baeza-Yates and Ribeiro-Neto's "Modern Information Retrieval"; (1999, Addison,

Chapters 2 and 5). c-index analyzers are excellent analytical tools for doing semantic connectivity analysis and for targeting keywords. They are also easy to build. I have written several applications.


Orion

Anthony Parsons
06-03-2004, 08:12 PM
No wonder it takes me so long to do my keyword research. I do some of this now with great details, but not in this exact science. I think this would potentially have the greatest benefit in my first example in the web design field being the most competitive.

Perfecting this area on something so highly competitive could make an extraordinary difference in the way your rankings are achieved. Interesting.

Whats even funnier Orion, is that I understood everything you just said...still. I think that the first few clarifications helped me a lot actually. My understanding is good to continue. I liked those examples you have used. That helped heaps.

It really can get to quite and exact science when you get down to the nitty gritty of it. Interesting to know this and much appreciated.

orion
06-03-2004, 09:35 PM
Perfecting this area on something so highly competitive could make an extraordinary difference in the way your rankings are achieved. Interesting.

It really can get to quite and exact science when you get down to the nitty gritty of it. Interesting to know this and much appreciated.

First, I would like to thank SearchEngineWatch.com for giving me the opportunity of introducing AI applied to IR to the dedicated members of the SEO/SEM community like you, Tony.

Second, you all excuseme in advance if I make too many typos.

Tony, you are right. SEOs may need to start tapping onto c-indices right away. I'm all against rumors and 2nd-guessing arguments disguised as "seo expert advices". There is no need for 2nd-guessing search engines or using trial-and-error approaches when there are many analytical tools outthere in a kind of black box protected by IR scientists. Many of these tools predate the Internet and search engines and are well known to IR grad students and search engine engineers. SEOs can benefit from them.

Now about c-indices for semantic connectivity research.

The technique is extremely simple and elegant. However, in my years of using c-indices and similar tools for systematically optimizing document relevancy I have learned that the main risk (or drawback?) consists in interpreting them. But once one understand the nitty gritty, is just a matter of

(a) computing values for candidates keywords using a sound critical thinking
(b) clearly identifying what the client(s) want to target and emphasize (i.e. sell or offer as products or services) and then do some number crunching.

Over the weekend I will discuss the effect of keyword transpositions on semantic connectivity for a given DBa (search engine) and why this is important when crafting textual information and doing keyword targeting (eg. titles, descriptions, meta tags). An introduction to combinatorial theory will be given. Once we understand the basics we can elaborate on complex cases (eg. c-indices for more than 2 keywords).


Orion

Anthony Parsons
06-04-2004, 12:57 AM
Mate, I am looking forward to it. Thanks for the learning curve thus far. I love learning new things.....

cariboo
06-04-2004, 03:54 AM
Thanks a lot for this thread Orion...

I'm trying to build a "smart" search engine for a complex database for one of my sites, and i'm working with semantics to improve relevancy. I'm very interested by what you said because I use a technique similar to the one you described above.

I will have many questions and problems to submit here, because I'm just a beginner in this science.

I agree with you when you say SEO's can really improve their results by using a "scientific" approach like this. Many SEO's work like "craftsmen" and use only their "know how". But optimization has nothing to do with magic, and a scientific approach will not do any harm to their clients... ;)

strategicrankings
06-04-2004, 09:36 AM
Nice to read threads like these. Thanks Orion, i will make it a must to follow this thread.

Riley

rankforsales
06-04-2004, 01:39 PM
This is all very interesting, but personally, before I jump into any of this, I need more proof and more evidence that this may improve rankings in any way.

Also, what's Google's view on this? Are there any Google members in this great new forum that may wish to talk a bit more about this, without giving away the 'Google secret recipe'? ;-) (evil grin)

Serge Thibodeau,
Professional SEO,
Rank for $ales

Opie1Canopie
06-04-2004, 02:29 PM
Orion - this is all very fascinating, although my head spins thinking about how I would actually find enough time to do this level of research for my keyword lists.

You said you are working on a search engine - how about building us a program that does this analysis? :) I know I'd be willing to fork over some money if this made my keyword picks more solid.

Opie1Canopie

NFFC
06-04-2004, 02:48 PM
Orion, nice post but as an informal scientist [very informal some would say] I would like to point out some assumptions you have made that IMHO are not entirely accurate.

>My goal is that SEO/SEM R&D departments once for all start using scientific tools rather than mere rumors and 2nd-guessing

Trust me, this is a huge area of activity for SEO's. Many IR/AI "experts" already earn a considerable % of their income from servicing the needs and goals of SEO's. Just because you don't see it posted on a forum don't assume it doesn't exist.

>Understanding how one can measure term co-occurence could be used to carefully select keywords semantically connected in a given search engine database

As rankforsales points out, it only matters if any of the major engines use it. Any evidence that any do apart from crude stemming?

I think the major false assumption you have made is that SEO is a branch of science, I consider it to be an art. For sure certain tools can be used to help focus the artist on his craft but ultimately it comes down to an individual making decisions, very often based on just a "feel". I see SEO's as closely equated to record producers. You can have all the technology in the world but the difference between a hit record and a stinker comes down to an individual sitting in front of the mixing desk and moving those sliders until it is "just right". You can't do that with an oscilloscope, you need an "ear".

>I'll try to keep my posts very simple and basic.

hehe

orion
06-04-2004, 09:48 PM
What we have presented so far.

1. Term co-occurrence is a measure of semantic connectivity and easy to compute.
2. Its geometrical nature can be established through Fuzzy Set Theory.
3. A mathematical treatment is provided in "Modern Information Retrieval" (see book).

What we have explained or pointed out so far.

1. Term co-occurence is a tool for building libraries of synonyms, find similar, query expansion and clustering algorithms.
2. For dictionary-based terms, c-indices are a measure of synonymity.
3. For rutinary queries, co-occurrence is a measure of how topically connected are terms in a given dba.

What we have explained or mentioned so far.

1. How to calculate c-indices for the particular case of c12; i.e., two terms, only.
2. c-indices can be calculated for IR systems as well as for commercial search engines.
3. c-indices often are different from engine to engine.
4. c-indices can be time-dependent.

What we haven't explained, yet.

1. Time-dependent semantic connectivity (query relevance dynamics).
2. Almost everything else.
3. Special cases in which c-indices are not enough, requiring other tools for analysis.


Assumptions made.

1. The n's are treated as number of retrieved results.
2. The database respond to pattern matching of regular expressions.
3. For now, that c-indices are time-invariant.

Assumptions, statements, or claims I haven't made in this forum.

1. That SEO is a science.
2. That I work for a search engine company (Not even close).
3. That I am building a search engine (However I have constructed and use an IR system reserved for research and intelligence and for remote searching databases).
4. That I am the "owner" of the truth or cannot make rational mistakes while presenting my thoughts (indeed I make too many typos).

Clarifications I have made and further refinements

1. An IR system and a commercial search engines are different settings.
2. Many analytical tools used in IR can be used to optimize information contained in commercial online documents (e.g., Web pages)
3. n's are number of results containing queried terms.

Let me expand on point 3.

In a strict IR sense, the n's (above) are number of results containing the queried term(s).

Let

n(i,db) = # retrieved results containing the queried term i in a given database db (Google, Yahoo, MSN, etc)
n(0,db) = # retrieved results not containing the queried term, i.
n(it,db) = # of total results retrieved by querying i in a given db.

n(it, db) = n(i, db) + n(0, db)

Assuming good IR performace of recall and precision (see reference textbook) and strict adherence to pattern matching of regular expressions, n(0, db) should be negligible, thus the main assuption is that

n(it, db) = n(i, db).

As previously mentioned, over the weekend I will elaborate a bit on other cases. I didn't plan to post today, since I scheduled the day to attend ongoing research and meetings; however recent posts convinced me to do so.

Talking about recent posters.

Welcome to this thread, Cariboo, Riley, Serge, Opie, CounterPoint. It is an honor to having you all here and interested in the discussion and eager to share thoughts and ideas.


Caribbo - I agree 100% with your posts, especially the part that says "optimization has nothing to do with magic". Well put.

Riley - "Nice to read threads like these." Thanks and again, welcome Riley. We need more threads dedicated exclusively to scientific issues. SearchEngineWatch.com editors must be praised for such a great decision. More, more, more threads are needed.

Serge - "Are there any Google members in this great new forum ...?" I don't know.

To Opie:

1. "You said you are working on a search engine". I never said that. But now that you mentioned, I already have an IR engine. (see previous lines, above).
2. "how about building us a program that does this analysis?" I already have done that.
3. "I'd be willing to fork over some money if this made my keyword picks more solid." Now we are talking business. I'm listening and eager to team with anyone interested in taking optimization to the next level. Your're in. Any suggestion?

To NFFC

1. "Many IR/AI "experts" already earn a considerable % of their income from servicing the needs and goals of SEO's." Very true and well put, NFFC.
2. "Just because you don't see it posted on a forum don't assume it doesn't exist." I haven't assumed that anywhere in the forum; not even close. Still, I agree with you that some IR/AI "experts" are in the business.
3. "As rankforsales points out, it only matters if any of the major engines use it." I agree with you both. The main thesis of this thread is the presentation of analytical tools well known by IR researchers and how these tools can be used to eliminate or at least minimize trial-and-error and 2nd-guessing approaches. As mentioned to Tony, (see previous posts), SEOs should stick to what works well for their clients and that includes good copyright styles and any proven SEO technique.
4. "Any evidence that any do apart from crude stemming?" Stemming issues will soon be addressed. Let's take some baby steps and concentrate on the basic first. Not everyone is at the same pace in this forum.
5. "I think the major false assumption you have made is that SEO is a branch of science,..." I never said anywhere in this thread that SEO is a branch of science (not even close).
6. "...I consider it to be an art." I agree with you 100%, NFFC. SEO is an art. In fact, optimization, is the art of finding a happy medium or as many technical dictionaries says, "optimization consists in finding the best possible solution to a problem within a feasible region".


My final thoughts.

This section of the forum, I think, has been reserved to heuristic and science search, as determined by the editorial staff of SearchEngineWatch.com and the Editor of this forum.

Accordingly, I feel posts should reflect the spirit and intention of the editors until they decide to change the "rules of engagement" for this threat and the forum in general.

I'm not a writer but a scientist (LOL to myself, like if anybody care!). I will keep elaborating on the main thesis of this thread and try to be as polite as possible and explain things to the best of my knowledge. So, I know that in order to write as articulated as many of you or the editors I'll need editing help. A lot.


Orion

Dodger
06-05-2004, 06:14 AM
Orion, thank you for this thread it is quite interesting and I for one am following it with interest. Any knowledge concerning the inner workings of a search engine is worth listening to, no matter how minor it may play in the grander scheme of things -- but it goes a long way in a better understanding of the beast as a whole with every little bit that gets stored away in the back of your mind.

Search engines are large databases filled with records, and those records are all accessed quicker than the blink of an eye. My understanding of quick access is the use of Index servers, which their sole purpose is the storage of a key value and an index pointing to one record in the database.

Your c-index deals directly with this aspect of database indexing. It is a basic building block. I have noticed some here who are building their own search engine, and this type of information will be invaluable to them -- that I have no doubt.

The name of this forum is Search Engine Watch. That is a broad term, and it is not exclusive to just "achieving great rankings" and I did not take your post to openly state that it will. It is an interesting topic, and I am looking forward to more of the same.

DanThies
06-05-2004, 12:43 PM
Well, I thought that I could safely ignore these new forums for a while... not so. Thanks Orion for the interesting posts.

We've been working our way down a similar road with respect to relevance. It's sort of assumed that folks like Applied Semantics etc. are already using techniques like this for a number of purposes.

Our main effort to apply this particular branch of IR theory has been attempting to develop a click-through model for organic search listings. The relative frequency with which search terms appear on the web gives us a sense of whether the specific search term is broad or narrow.

One thing that occurs to me, in attempting to use Google (or any other search engine), is that we don't know how much co-occurence is influenced by SEO, and how much is really due to natural patterns.

For example, someone might decide that "car insurance" is a really important search term and construct 10,000 doorway pages in an attempt to influence the search results. Whether or not they succeed in influencing rankings at the top of the search results, they can certainly influence the perceived semantic relationships.

It doesn't make much difference in terms of searching within the database, but you also have a certain amount of skew in language, just based on what sorts of information are being put onto the web vs. what might occur in other forms of communication. Certain topics may lend themselves to larger numbers of documents, or may be more likely to have information published online.

orion
06-05-2004, 09:51 PM
Welcome to this thread, Dodger and DanThies. It is an honor having you here and eager to share ideas. From your well articulated posts I can tell we are in the same frequency of thoughts and very interested in delving into the subjects. Excellent.

To Dodger:

1. "My understanding of quick access is the use of Index servers, which their sole purpose is the storage of a key value and an index pointing to one record in the database." Well put. You're right on the money.

2. "The name of this forum is Search Engine Watch. That is a broad term, and it is not exclusive to just "achieving great rankings" and I did not take your post to openly state that it will." Excellent point and agree. We need more threads exclusively for IR/AI. I'm happy that SearchEngineWatch.com has separated a thread for discussing heuristic issues and search technology topics rather than mere rutinary marketing issues (still overlaps cannot be avoided).


To DanThies:

1. "We've been working our way down a similar road with respect to relevance....The relative frequency with which search terms appear on the web gives us a sense of whether the specific search term is broad or narrow." Excellent! However, we haven't added frequency to the mix, yet. Frequency and frequency of co-occurrence is better addressed with Association Clusters. On the other hand, localized co-occurrence (not if co-occurrence take place but where co-occurrence incidents actually take place in a document) is better addressed with Metric Clusters. Finally, synonymity associations between local stems (or terms) in a document can be explained by considering the neighborhood (contextual information) and so-called Scalar Clusters. (Baeza-Yates & Ribeiro-Neto, Chapter 5 page 125-127). Let's take baby steps; we will get there. And I still need to explain what c-indices are/aren't tools for.

2. "One thing that occurs to me, in attempting to use Google (or any other search engine), is that we don't know how much co-occurence is influenced by SEO, and how much is really due to natural patterns. For example, someone might decide that "car insurance" is a really important search term and construct 10,000 doorway pages in an attempt to influence the search results. Whether or not they succeed in influencing rankings at the top of the search results, they can certainly influence the perceived semantic relationships." Very true, provided that the 10,000 doorway pages are indexed by Google or the intended engine. The problems you propose can be addressed with other clustering algorithms as mentioned briefly in 1, above.

3. "Certain topics may lend themselves to larger numbers of documents, or may be more likely to have information published online." I agree. Still c-indices can be used to construct correlation matrices as exploratory tools for measuring the degree of "membership" of documents with respect to a keyword, i; see Baeza-Yates & Ribeiro-Neto, Chapter 2, page 36. I'll elaborate on cases where c-indices are not enough.

NOTE TO THE EDITORS. So far this thread is in the General category. My perception from these and previous posts is that this thread has striked a cord since

1. It was picked by SEW as a featuring thread.
2. Some of the thread's posters are inclined to search engine engineering or are already involved in IR/search engine projects.
3. that there are outthere dedicated SEOs/SEMs hungry for a formal category in which IR/AI topics could be discussed.
4. that some posters are more advanced than others, the more advanced being in a "hurry up with the discussion" --which actually is a very positive attitude.

Thus, what the chances are -if any- of having a playground category for this hungry audience?

Orion

Dodger
06-06-2004, 10:36 AM
Welcome to SEW Dan.

Here is an article that appeared at ResourceShelf where they interviewed Gary Flake at Yahoo Labs R & D that might interest everyone if they have not seen it yet. In it, Flake gets into a little bit about how they use unstructured databases (if you can believe that) and "implied data" on a web page.

Interview Part I (http://www.resourceshelf.com/archives/2004_06_01_resourceshelf_archive.html#108619373518 642360) and Part II (http://www.resourceshelf.com/archives/2004_05_30_resourceshelfextra_archive.html/#108619460453157321)

Dodger
06-06-2004, 10:45 AM
1. It was picked by SEW as a featuring thread.

I don't think it was picked manually ... it was rated by the thread followers who deemed it of value and automatically included in the Featured area of the board.

Of course, I could be wrong on that.

DanThies
06-06-2004, 01:02 PM
Thanks Dodger. For the welcome, and the pointer to an interview that I probably would have missed. This past week has been crazy - we launched a new version of our site, and things have been hectic on the home front.

I hope this thread is an indication of how the SEW forums will be different. The world really doesn't need another forum for META tag talk, etc. Search Engine Watch has always been about more than search engine optimization/marketing.

Orion:
This thread has been quite popular, largely because of the number of outside sources (webProNews, etc.) linking to it. Keep going, we're "all ears." :)

chris
06-06-2004, 03:04 PM
It's interesting and I can see a few applications of it for those writing search engines/data mining tools (in fact it's prompted me about how to do something). However, I'm left with an "um...so what feeling" when it comes to SEO.

1. It seems that we are walking round the block to get to the house next door. Whilst the explanation is impressive, the title seems to say as much. i.e. are we not just basically saying that words that tend to appear together more often on pages are more likely to be semantically connected than words that don't?

2. Assume that I'm Mr Average. I've researched a topic by reading other pages about it (as most SEOs do) and I pick a variety of related terms for my page. Being Mr Average, is it not probable that I would pick those that are most semantically connected? Presumably if co-occurence is a factor that is in proportion to semantic connectivity then that must hold true?

3. If I have a natural tendency to know which keywords to pick then why calculate them unless this is a very large factor in search engine rankings (in which case we would already know about it).

4. How would you propose this aligns with keyword search frequency?

Finally, it seems a bit like a tale that's forgotten it's ending. As others have said it's all well and good but you talk about SEO and it increasing ranking without proving or demonstrating the ranking benefits. Sorry to be harsh but at the moment it ranks as a "mere rumor" and a "2nd-guessing thought" :) Could you fill in the gaps (perhaps chasms is a better choice of word)? A start like that without an ending is a bit like a post like this that suddenly

orion
06-06-2004, 05:14 PM
Welcome Chris to this thread.

To Dodger:

Yes. You are probably right.

To Dan:

Thank you very much. I'm happy you are following the thread.

To Chris:

You're Damn right in your exposition, Chris! Hope this post help a bit. I'll discuss frequency soon. The "rumors" part is not so. References and IR literature is available elsewhere. It just happen that most of it has been written in "rocket scientist" style, from term vector theory to the "tools of the trade' I'm trying to present to SEOs.

Now, talking about this post.

This post is organized as follow. I will elaborate on

1. c-indices and how we can/cannot use them.
2. Term Transpositions.

I will then present a real case and experiment, which others can test.

ASSUMPTIONS AND CONVENTIONS

1. For now, we stick to the special case of paired terms (and the k's are not stopwords).
2. We are assuming we are dealing with an IR responding to regular expressions.
3. We are not discussing yet user's query behaviors (how users query a dba).

In point 3, we are simply using specialized query conditions to extract correlation values from dbs, which can then be used to improve the semantic of documents to be indexed by the queried db. Thus understanding how IRs and commercial dbs "rationalize" semantics and use correlation indices won't hurt SEOs.


WHAT IS/IS NOT A C-INDEX?

As mentioned before, term co-occurrence measures can be used to construct a thesaurus, library of find-similars, for doing query expansions and with clustering algorithms.

First, What is...

1. c is a correlation index of term co-occurrence in the queried dba.
2. c is a measure of query correlation between k1 and k2.
2. c is a measure of semantic connectivity when the k's are synonyms, similar phrased concepts or involve query expansion (query refinement)

Second, what is not...

1. in a mathematical sense, "evidence of", "correlation" is not a "proof of"; thus while co-occurrence may suggest "evidence of" or "the presence of" semantic connectivity, it is not a proof of semantic connectivity.
2. crude paired c-indices tells nothing about WHERE the terms co-occur in a document, how far apart are the terms from each other, and how many instances of co-occurrence (frequency of co-occurence) are taking place in a given document.

Point 2 is well obvious, since one merely compute search results without considering the actual structure of the information contained in the results. To measure all that, we need to

(a) consider the actual content of documents
(b) compute a new kind of correlation indices, measure inverse correlation distances and discuss Metric Clusters. Baby steps first. We will get there, if SEW allows it.


FURTHER REFINEMENTS OF N12: USING THE RIGHT IR CONDITIONS TO EXTRACT THE RIGHT C-INDICES.

Let n1 and n2 be the number of search results containing the query k1 and k2, respectively and n12 be the number of search results containing both terms and obtained by querying k12. (see original post, #1).

For now let assume that each k1 and k2 are single terms (not phrases). If we want to use phrases for either k1 or k2 then the k's must be quoted. How about k12?

In order to present the basic ideas of c-index calculations, I used in the introductory posts very primitive (or noisy) conditions. Now it is time to make some refinements

1. When quering an IR system for k12, however, we need to use a "FIND ALL" terms mode (not a "FIND ANY" to insure that both terms are found in the n12 results. What if the system still returns some results with only one k, which then causes false c-indices? If we are dealing with a validated IR system chances are this will not occur. But what if it does occur?

2. To insure that the IR system returns documents containing co-occurrence of terms one should query in an "EXACT" mode by using a boolean operator or using quotes ("") with the k12 query. This however, forces the IR system to return results with regard of sequence. Results with regard for sequence may include

(a) strickly, a phrase consisting of k1 and k2.
(b) k1 and k2 separated by delimiters that are ignored by the queried system (often the case of commercial search engines and IR systems pre-programmed in that way.)

Thus, the n12 results will consist of documents containing k1 followed by k2 --distance d between k1 and k2 is considered to be zero. Consequently, n12 should be a subset of results in which term-occurrence take place without regard for sequence (k1 and k2 co-occuring in a document but in different sections of it; i.e., distance d is not zero)


CO-OCCURRENCE AND TRANSPOSITIONS

For a subset in which d = 0 (and with k12 double quoted) we can define a c-index

c12 = n12/(n1 + n2 - n12)

Consequently, for the very same k1 and k2 co-occuring in the same IR system but transposed we obtained

c21 = n21/(n1 + n2 - n21)

Consequently

1. semantic connectivity can be improved if documents respond to more than one c-index.
2. c-indices can provide important information with regard to the optimum sequence of k's in a db.

Points 1 and 2 help us to avoid keyword spamming. Also one can test many candidates keywords and sequences until obtaining the optimum ones by carefully inspecting the associated c-indices.

A similar treatment can be applied to the "unquoted" scenario in which web users behave. Still, cases arises in which sequencing may not help at all (e.g. wrong selection of keywords, keyword popularity, proper style, limited dbs, etc)

A REAL CASE AND EXPERIMENT

Try this simple test.

Using double quotes for k12 and k21, do a search in Google (or your favorite IR or engine) for "car insurance" and "insurance car". Then do a search for "auto insurance" and "insurance auto". For the documents ranking high, check

1. how ranking results change.
2. how the IR or engine ignores delimiters to respond to term sequencing.
3. the visible information displayed by the engine or IR in the entries reserved for titles and descriptions --compare this information for the top and poorly ranked results.
4. don't forget to calculate c-indices, accordingly and compare them.
5. Repeat 1-4 without double quoting k12 or k21.

Can you

1. tell commonalities of documents ranking high?
2. see how sequencing is interpreted or delimiters are ignored?
3. check d values?
4. check for evidence of why your current page (or client's page) cannot rank high and how you can improve its semantic connectivity?


EXPERIMENT

Now try the following General Procedure

1. Compute c-indices for a pool of candidate terms for your page or client's page. Use the same k1 and different k2's.
2. Identify the best paired terms that produce the highest c-index, c12.
3. For the best paired terms, test its transpose index, c21.
4. From the c-indices, identify which sequence is the optimum.
5. Repeat experiment without double quoting k12 or k21.

Check against whom you are going to compete for high rankings for the intended keywords. Emphasize keywords and sequence, accordingly.

Note. In controlled experiments, I ussually test candidates terms from a pool of terms extracted from a thesaurus. Once steps 1-5 are completed, I proceed to place k12 and k21 in a separate library. The selected keywords can then be used for query expansion for other cases and or for clustering tests.


SOME NOTES WORTH TO POINT OUT ABOUT QUERY TRANSPOSITION RESULTS.

If one query an IR system in "FIND ANY" mode (no double quotes) and if the system follow rigorous pattern matching rules, then any difference in search results (and c-indices) should not be the result of transposition, since "finding this k OR that k" implies without regard for sequence. Any difference may be due to

1. changes taking place in the dba at the time of testing (purging/new results indexed)
2. changes taking place in the algo's at the time of testing.
3. faulty IR responses
4. combination of the above?

Under controlled conditions, if 3 is the sole reason, then the IR system may need some validation tests for the "FIND ANY". Thus this simple test can be used for quality control purposes.


This is it for now. Comments?

WHAT'S AHEAD

1. The k1, k2, k3 case.
2. correlation indices involving stems
3. frequency and co-occurence topics.

Final Thoughts

Co-occurrence is a concept that involves the notion of association. Yet co-occurrence is not a mathematical proof of association or connectivity. It so happen that for topically connected terms, co-occurrence is evidence of semantic connectivity.

Although not intended for the following purpose, the idea of measuring co-occurrence using correlation indices can be extended to include Link-Link Co-occurrence and Email-Email co-occurrence in commercial dbs, i.e.

k1 = mydomain1.com and k2 = mydomain2.com
k1 = blabla_1@mydomain1.com and k2 = blabla_2@mydomain2.com

Yet, care must be taken when interpreting results. I am doing ongoing research in this area, which overlaps with Business Intelligence and marketing. Anyone interested in researching this area, sharing research, teaming or interested in helping me is very much welcome.


Orion

Dodger
06-06-2004, 07:59 PM
Orion:
This thread has been quite popular, largely because of the number of outside sources (webProNews, etc.) linking to it. Keep going, we're "all ears." :)

Yes indeed! It is a pretty good article too.

Link to WPN Article (http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20040604FormulaForKeywordConnectivity.html)

chris
06-06-2004, 08:46 PM
The "rumors" part is not so. References and IR literature is available elsewhere. It just happen that most of it has been written in "rocket scientist" style, from term vector theory to the "tools of the trade' I'm trying to present to SEOs.

Orion, that references in literature are available wasn't in question. The application in terms of a tool of SEO is. Let me put it this way, give me any random seven year old, any set of terms and I reckon I could fairly consistently get them to pick the best ones according to your formulas. That being because semantic understanding is inherent within us (by definition).

In terms of interest, this is great. In terms of information for application coders, this is great. In fact, if you put it all together in a paper sometime then I'd love to see a copy. But in terms of an SEO tool fundamental questions still remain: how is this of benefit? What advantage do these numbers have over our inherent understanding of semantics in terms of SEO? Or to put it another way - when, within the next few days, somebody produces a few tools on the basis of what you've said (call me psychic but I just know it's going to happen) then what is it that should have me rushing across to their site to use it?

DanThies
06-06-2004, 10:07 PM
Chris:

This isn't really about SEO so much. It's probably not going to immediately help you pick keywords, write copy, or build links. But it's interesting on its own, and some of us do have SEO-related purposes for wanting to understand this stuff better.

For example, one of our ongoing projects at SEO Research Labs is building a click-through model for organic listings on the different search engines. We have no shortage of "data" (hundreds of sites provide log files), and we've produced "information" on the average click-through rates for the 1-10 positions on Google, MSN, and Yahoo. We now use these averages to build traffic projections into our keyword research reports.

Averages are nice, to help project the overall outcome of a successful SEO campaign, but not so helpful when it comes to a specific search term. We know that there will be large variances in click-through rate between different search terms. By looking at the variances, we see one thing with our "human intelligence," which is that very specific searches have a different click-through profile than very broad searches. We know this pattern exists, but a useful model would have to express this mathematically.

An applied understanding of the methods that Orion is discussing here could lead us in a couple directions. First, the possibility of developing a mathematical model for how broad or specific a search term is. Second, the possibility of relating a given search term to another search term for which we have actual click-through data.

Whether either of these, or something we haven't yet thought of, will lead us to a better click-through model, I don't know. But the free lecture on IR is much appreciated in any case. We may well learn something that applies to other aspects of SEO/SEM, such as conversion rate projection, mapping search terms to buying modes, etc.

As competition for paid and organic listings increases, the decision to target a search term represents a larger and larger investment. Keyword selection drives everything else, from content development to linking strategy. Investing in any promising area of research to improve our metrics is time well spent, or at least it is for me.

Anthony Parsons
06-06-2004, 10:43 PM
I am very interested in learning the exact connectivity of terms and how they correlate through a mathematical equation. As an ex Industrial Electronics Specialist and Microprocessor control technician, I am interested to see how mathematically the semantics of two unique words can differ. Will this make an impact? I believe it will, especially when you're going after the high end of the competitive market.

The examples Orion has used "auto" "insurance" "auto insurance" vs. "insurance auto" could make that slight impact, hence make an SEO's life a little easier when writing and structuring a page to target the highly competitive market. Every little bit can count IMO and when you talking that sort of competition and money, then why not.

I am extremely interested in this Orion and look forward to picking yours and Dans mind for keyword semantics. Thanks you for the information and knowledge you are sharing.

orion
06-06-2004, 11:20 PM
To Chris:

1. "Orion, that references in literature are available wasn't in question." My mistake. I misunderstood the question.

2. "Let me put it this way, give me any random seven year old, any set of terms and I reckon I could fairly consistently get them to pick the best ones according to your formulas." Very true; Blindfold calculations of c-indices computed from randomly selected paired terms are meaningless.

You may want to revisit the following lines from my previous posts:

From post #26: "crude paired c-indices tells nothing about WHERE the terms co-occur in a document, how far apart are the terms from each other, and how many instances of co-occurrence (frequency of co-occurence) are taking place in a given document.....To measure all that, we need to

(a) consider the actual content of documents
(b) compute a new kind of correlation indices, measure inverse correlation distances and discuss Metric Clusters."

From post #9: "Carefully crafted c-indices can work as semantic discriminants. Wrongly crafted c-indices can produce messy results."

3. "But in terms of an SEO tool fundamental questions still remain: how is this of benefit? What advantage do these numbers have over our inherent understanding of semantics in terms of SEO?" Chris, this are all very good questions which I will try to address to the best of my knowledge. For one, one can use co-occurrence indices for datamining, building find similars libraries and clustering algorithms and building IR systems that respond to synonymity measures. Semantic connectivity of terms is just one of many things that influence relevance, yet semantic connectivity is not a definition for relevance since one can blindfold compute correlation indices and gets no relevancy.

4. "Or to put it another way - when, within the next few days, somebody produces a few tools on the basis of what you've said (call me psychic but I just know it's going to happen) then what is it that should have me rushing across to their site to use it?" Sincerely, I wish this happen and those tools are welcome. I have my own c-Index Calculator and other IR tools which are used for research.

To Dan:

Chris concerns are valid issues I faced when first was trying to rationalize how the phenomenon of co-occurrence may influence semantics. Chris' questions reminded me what I wrote in the dedication of my doc thesis: "If I have a theory, but no experimental results, I may have nothing. And if I have a theory without practical applications, I may have an artifact"

About your project at SEO Research Labs. Your project is very appealing to me as it has leggs. You are on the right track. And I can think of a couple of things in which some theories can be applied and tested. Tell me more.

To Tony:

1. "The examples Orion has used "auto" "insurance" "auto insurance" vs. "insurance auto" could make that slight impact, hence make an SEO's life a little easier when writing and structuring a page to target the highly competitive market. Every little bit can count IMO and when you talking that sort of competition and money, then why not." Well put, Tony. BTW, have you tried the CASE presented in post #26? Have you noticed how for "auto insurance" and its transpose "insurance auto" are almost subliminally present in many top results?

By subliminally I mean ending one sentence with a k and starting the next one with the other. Check how co-occurrence as expressed for two c-indices (one and its transpose) is present in the top results. Test with unquoted results transpose indices. Finally, check also the paid results at the right of the corner while performing the experiment. This may interest not just SEOs but SEMs as well, as well stated by Chris.


On other matters. In the Link-Link and Email-Email co-occurrence line (almost at the end of post #26) I forget to mention that the queries are without double quotes. Try with Google and other web property or with the competition of your clients. This may has other applications in business intelligence; for instance in cases where standard "who links to whom" is not enough since one may need to know "who is associated to whom", by either links or emails published outthere.


Orion

orion
06-07-2004, 12:15 AM
For those of you interested in practical applications, here is an applied example of semantic connectivity in a Homeland Security project.

http://lsdis.cs.uga.edu/proj/SAI/SAI-NSF-Report-2003.htm

And here is its summary, which I am quoting:

"Project Summary

Role of information technology (IT) is recognized to be a critical component in the effort of improving national security, including homeland defense. Applications of importance to national security, such as aviation security, pose significant challenges to current information technology and provide excellent source for further research in developing next generation IT solutions. This project looks at the opportunity to discover complex relationships from of large amounts of heterogeneous data. To achieve this, it relies on the new capabilities in automatically extracting the semantic metadata (i.e., large scale semantic annotations) represented in RDF graphs, and defines meaningful complex relationships called semantic associations. It also looks at methods of computing semantic associations with the relevant issues of applying context and ranking the results."


I'll give new readers time to digest previous posts (and my typos). Then I will start with the scheduled subjects as stated in post #26.

Orion

DanThies
06-07-2004, 12:15 AM
Orion:

The click-through model is part of a larger project we call "Keyword Research 2.0," about which I will be publishing a series of articles over the next couple months.

The long-range goal is to put SEO/SEM on an equal footing with other means of marketing, by creating new methods and metrics that would, among other things, reduce the amount of uncertainty inherent in SEO. All of which ultimately revolves around keyword selection.

Among the questions that need better answers:

1. How to determine the popularity of different search terms, including how that varies over time, seasonally, etc. Because current data sources are limited, there is the potential to apply IR theory to assess the level of uncertainty/error in the available data.
2. How to project the likely click-through traffic from search engine listings, both paid and organic. As stated above, there are several possibilities for applying IR theory.
3. How to estimate the level of effort (cost) required to compete for organic listings, both in terms of attaining rankings, and maintaining them over time. Again, semantics can come into play, because there are possible synergies between search terms when it comes to link building.
4. How are different search terms related to different buying modes by the searcher? Another way of looking at this is, what kind of information/resource does the searcher want? Naturally, a little assist from the IR team could help us make headway here.

We've made good headway on a lot of these questions, but obviously this is a long term project. Since we can't do anything more about question #1 at this point, our major focus has been on questions 2 and 3.

With click-through, we have these data sources:
1. Log file feeds and "paid inclusion" traffic reports from a large number of independent web sites, which of course includes search engine referrals.
2. A process for matching up referrals with known rankings - this is part of the paid inclusion reports, and available through other methods when paid inclusion is not in use.
3. Search term popularity, from both Wordtracker and Google's Adwords. Adwords provides a more accurate value, obviously, but at a cost, so it's not available for all search terms.

So we have averages, but we need a more complete model. The information you've presented is interesting because it has potential applications in a lot of different areas.

AussieWebmaster
06-07-2004, 11:10 AM
I asked a senior engineer over at Google to have a look at this thread and see what he thought or what comments he could give. He came back with an interesting reply.

My apologies for not being more helpful
here; I'd like to respectfully decline to comment. Any comment that might
eventually find its way to a message board would only get me into trouble.
;-) This is GoogleGuy's area, so I leave that fully to him.

orion
06-07-2004, 05:50 PM
Welcome to this thread, Aussie. It is an honor to having you here.

The phenomenon of term co-ocurrence and how it affects semantic connectivity (in the case of terms, stems or phrases with a degree of synonymity, association or "membership" to a topic) is very appealing to me. So, anyone interested in commenting on the subject is welcome in this thread--Google or GoogleGuy or anyone from any search engine.

The subject is not new at all, predates Google, and is relevant to all search engines and "smart" IR systems. It is a hot topic in IR circles and I can understand why someone from Google don't feel the need for commenting on the subject. That's smart.

That the subject is not a trivial one in IR research circles is evident. A query in Google, MSN, Yahoo! or other search engines for "semantic connectivity" (with or without quotes) shows research work already done in the field. Here are few reference links

1. http://web.mit.edu/jdevans/topic.html It discusses physical, schema and semantic connectivity.

2. http://www.uark.edu/misc/lampinen/mcevoy99.html Uses semantic connectivity to address memory effects and retrieval

3. http://www.technologyreview.com/articles/rnb_052104.asp For those interested in current technology, this is somewhat a recent news. It reports how researchers from Palo Alto Research Center developed ScentIndex and "have given ebooks a more comprehensive index and table of contents tool that combines keyword searching and concept searching". This has been done using semantic connectivity and term co-occurrence concepts.

In terms of work published in the field, there is no surprise here. What surprise me is that only a handfull of SEOs are familiar with a subject that has been around for so long, while others appear to see it as silver-bullet solutions or instant gratification. The purpose of this thread is to make a contribution and try to change all that. It occurs to me that formal workshops on these and similar subjects could be incorporated in professional SEO/SEM conferences.

On a personal note, I feel the more SEOs know about IR topics, the better --and it won't hurt them. If there are analytical tools outthere, why not use them? I am very happy that SearchEngineWatch Forum is not just another seo/marketing forum. We need more thread dedicated to AI/IR topics, I think.


Orion

AussieWebmaster
06-07-2004, 06:57 PM
The better educated we are the more of a scientific approach we can take to all this. The days of flying by the seat of your pants coming up with something a step or two in front are long gone... people want solid sites with all the proper work done to help steadily improve their positions whereever possible.
I am reading the thread and realize I have some yet reading ahead of me.

Dodger
06-07-2004, 07:46 PM
You got that right AW!

I was reading John Batelle's blog today and he was commenting (http://battellemedia.com/archives/000693.php) on a couple of quotes from Bill Gates who was a speaker at the D Conference (http://d.wsj.com/) dinner last night.

In it he said that Gates blamed search's shortcomings on its keyword-based approach, and argued that natural language and contextual semantic approaches will be the next leap forward.

There is no better time than the present to learn about his stuff in my opinion. A little knowledge today will go a long way tomorrow, because it is inevitable that the day will come. If Big Blue can beat a World Chess Master, then anything is possible.

AussieWebmaster
06-08-2004, 12:06 AM
But who becomes Big Blue and who is the Chess Master?
Is it Big Blue=Goolge algo and SEOs Chess Master or visa versa?

Dodger
06-08-2004, 12:33 AM
Wrong game Mate. While in some respects, Google is bigger than Big Blue ... I would not classify SEO's as Chessmasters. The game is more like checkers if you ask me. They jump Google, then Google jumps them.

The moves are simplistic in nature with a basic set of rules and no abstract thought involved.

Anthony Parsons
06-08-2004, 01:06 AM
I won't give specifics, though I am putting this info to test now on a clients site in California. With over 45000 pages indexed, I want over 40000 #1 spots for them. The technique is proving well at present for the competitive terms. Obviously get the competitive terms right, and they will pass through in some aspects helping the non-competitive phrases.

What I can say is that I have already made some slight changes in the way a few terms are fed to the engines, ie. specific order. The connectivity of the terms worked out to favour a reverse in some terms and when checked further, was right for 4 out of the 5 terms I changed vs. word tracker and SE results.

Keep teaching me Orion, this is coming in very useful. Thankyou very much for parting this knowledge upon us.

orion
06-08-2004, 10:01 AM
WHAT'S AHEAD

1. Brief review (very brief) of computing co-occurrence indices (c-indices) without regard for sequence (FIND ALL) and with regard for sequence (EXACT).
2. What is/is not and EXACT search. It all depends on which IR (or search engine) you query.
3. Proper identification of terms with best connectivity. (no blindfold selection is allow. c-indices are not silver-bullet or instant grats tools).
4. Combinatory Theory: Combinations, Permutations and co-occurrence.
5. The k1, k2, k3 case.

If space allows it, semantic connectivity of prefixes and stems. This part, I think, may be relevant to copy style.

I will try to post this at the end of the day today or early tommorow. For now I need to get off the forum to attend work, research meetings and life issues. Feel free to post on previous posts or add to the mix.

Orion

nuclei
06-08-2004, 02:32 PM
While I agree that applied semantics is working toward this rather quickly, it has not shown itself to be of much use in the here and now. A few good links with the proper anchors will still weigh far more heavily than this will. Granted to cover all bases, it is probably a good idea to work it into your pages in some small way. But to go thru that much trouble as you have outlined above for negligible results, to me, is out of the question.

thememaster
06-08-2004, 05:34 PM
Hello I'm new to this forum and this thread.

This is extremely fascinating stuff! One thing I'd like to say is that, although I have not used exactly the type of research Orion had been discussing, I have used my own AI research and find that information implemented from the results of such "advanced" research can and do indeed translate directly into improved Search engine rankings - and yes that includes top ten rankings in Google on very competitive terms.

I have long thought that this type of knowledge needs to be introduced to the average SEO professional. This is the direction search engines have to be moving in after all. Thanks Orion for contributing your expertise to the SEO community.

AussieWebmaster
06-08-2004, 05:57 PM
While I agree that applied semantics is working toward this rather quickly, it has not shown itself to be of much use in the here and now. A few good links with the proper anchors will still weigh far more heavily than this will. Granted to cover all bases, it is probably a good idea to work it into your pages in some small way. But to go thru that much trouble as you have outlined above for negligible results, to me, is out of the question.
While agree that there are more variable in action, this is defeinitely worth exploring and adding eventually to the regular process.

nuclei
06-08-2004, 06:07 PM
While agree that there are more variable in action, this is defeinitely worth exploring and adding eventually to the regular process.

Yeh, a tool could be made easily enough using several of the online thesaurus's to do it somewhat automatically. Then at least it wouldnt take nearly as much time per site. I am bored anyways, maybe I will take a stab at it later tonight.

Mikkel deMib Svendsen
06-08-2004, 07:33 PM
This is a very interesting thread, allthough I must admit some of it is too complicated for my simple mind to grab. But it's always good to learn.

Allthough this may not have direct impact on how the average SEO-joe does SEO today I do think an exciting techical search discussion like this belong in these forums. Where else ? :)

thememaster
06-08-2004, 07:38 PM
Regarding tools: a few thoughts:

Every tool is based on an underlying theory:
Be careful that any tool you use/develop is based on sound theory. Keep in mind also that multiple theories can give very good explanation for the same phenomenon. None of us (who don't work for a search engine) knows what's under the hood in the code/alogrithm of a search engine for certain. But if a particular theory tracks well with the results, has great predictive power, and yields results that improve your ranking, you've got a good foundation. Whether you care if it reveals *exactly* what is going on under the hood of the search engine, depends on whether you're a scientific realist or instrumentalist with respect to such theory and research. (cf. http://www.routledge-ny.com/rep/q094sam.html).

You must also keep in mind something else given what Orion has said:

Repeat recipe for other search engines. c-indices may change, which indicates that semantic connectivity is different across databases. (Very important for SEOs!)


Your tool and theory must take into account the differences among the various search engines. My tool does this - and was the hardest part of its development. If you guys start with sound theories and track results well, I'm sure you can develop something quite useful for yourself and for your clients.

In just in case you weren't sure, there are SEO professionals that are already using this kind of research in their everyday activity. I'm one of them :)

AussieWebmaster
06-08-2004, 08:09 PM
Regarding tools: a few thoughts:

Every tool is based on an underlying theory:
Be careful that any tool you use/develop is based on sound theory. Keep in mind also that multiple theories can give very good explanation for the same phenomenon. None of us (who don't work for a search engine) knows what's under the hood in the code/alogrithm of a search engine for certain. But if a particular theory tracks well with the results, has great predictive power, and yields results that improve your ranking, you've got a good foundation. Whether you care if it reveals *exactly* what is going on under the hood of the search engine, depends on whether you're a scientific realist or instrumentalist with respect to such theory and research. (cf. http://www.routledge-ny.com/rep/q094sam.html).

You must also keep in mind something else given what Orion has said:


Your tool and theory must take into account the differences among the various search engines. My tool does this - and was the hardest part of its development. If you guys start with sound theories and track results well, I'm sure you can develop something quite useful for yourself and for your clients.

In just in case you weren't sure, there are SEO professionals that are already using this kind of research in their everyday activity. I'm one of them :)
Welcome to the board Mr Marshall

detlev
06-08-2004, 11:16 PM
Hello everyone,

My thanks go to Orion for bringing this up!

As I understand it, the SEO application for co-occurence means the documents that contain both terms can be calculated using advanced search syntax in todays engines, and better choices can be made when choosing keywords for SEO Web copywriting. The terms can be said to be semantically related (by virtue of the Web), but are they then better keywords to target?

Since this comes from IR research, it really seems to have applications for better Web search and not so much for the SEO. The reason I think the SEO cannot really gain much from this information is that they would be targeting terms that are related insofar as they are intertwixt by the Web's documents at a high rate (thereby being more competitive)!

An engine that uses co-occurence technology blended in their Web search application can be targeted by the SEO and possibly with good results. Since there is at least one person who posted here describing using this calculation with good results, it stands to reason that the engine that was targeted is one that blends some form of co-occurence now. Would that not be true Orion? How do SEOs benefit from this information in a practical sense unless the engines are blending semantic connectivity somehow?

*cheers*
-detlev

Dodger
06-08-2004, 11:31 PM
I think he is going to get into that detlev.

http://forums.searchenginewatch.com/forum/showpost.php?p=1219&postcount=41

thememaster
06-09-2004, 12:09 AM
Since this comes from IR research, it really seems to have applications for better Web search and not so much for the SEO. The reason I think the SEO cannot really gain much from this information is that they would be targeting terms that are related insofar as they are intertwixt by the Web's documents at a high rate (thereby being more competitive)!


The results of the research I conduct are not used for "targeting keywords" in my SEO campaign, but rather for supporting the keywords I have already chosen to target. The results are used to select the best terms and phrases that will build a more semantically relevant *context* around my keywords which will support them and increase ranking.

Since there is at least one person who posted here describing using this calculation with good results, it stands to reason that the engine that was targeted is one that blends some form of co-occurence now. Would that not be true Orion? How do SEOs benefit from this information in a practical sense unless the engines are blending semantic connectivity somehow?


I obviously don't know what is going on under the hood at Google, for example, but I do know that my results track well with what is going on in Google and they do help improve ranking for myself and for my clients. You are right to say that unless the SE is doing something related to the technologies behind my theories (which are related to but not the same as Orion's) my efforts would be pretty useless and I wouldn't be seeing consistent results.

I'm really looking forward to seeing more of Orion's ideas as he unfolds them. I have seen with my own eyes what the benefit can be gained in SEO from insights in the fields of Artificial Intelligence, Information Retrieval, and Information Theory . . . even Chaos Theory.

orion
06-09-2004, 01:18 AM
Welcome to the forum, Nuclei, TheMeMaster, Mikkel, and Detlev. It is an honor to having you all here and sharing your honest thoughts. Most of the points you all raise are valid points. I will try to address them to the best of my knowledge. This is a quite long post and I even didn't finished as expected. My apologies in advance. Perhaps, going a bit slow won't hurt new readers.

First things first,

To Nuclei:

1. "But to go thru that much trouble..." (post #43) and "Yeh, a tool could be made easily enough using several of the online thesaurus's to do it somewhat automatically. Then at least it wouldnt take nearly as much time per site." (post #45). A c-Index Calculator we developed does the work in a snap, complete with multiple c-indices calculations. It is a simple matrix of documents results and search engines. Up to 6 indices can be calculated. It includes the k1,k2 and k1,k2,k3 case since most users use 2 or 3 terms when querying a database. As for the simple use of a thesaurus, the material covered in this post may help to address the issue.

To TheMeMaster:

1. "Keep in mind also that multiple theories can give very good explanation for the same phenomenon" Damn very true.
2. "In just in case you weren't sure, there are SEO professionals that are already using this kind of research in their everyday activity. I'm one of them" Excellent. I'm happy you are one of them.

To Mikkel:

1. "Allthough this may not have direct impact on how the average SEO-joe does SEO today..." Let's hope education changes that.
2. "I do think an exciting techical search discussion like this belong in these forums. Where else ?" Yeah, where else if not in the most important search engine marketing strategies site on the web?

To Detlev:

1. "As I understand it, the SEO application for co-occurence means the documents that contain both terms can be calculated using advanced search syntax in todays engines, and better choices can be made when choosing keywords for SEO Web copywriting. The terms can be said to be semantically related (by virtue of the Web), but are they then better keywords to target?" Copyright and writing style will be addressed in further posts. About the "semantically related (by virtue of the Web)" part. Semantics and co-occurrence exploitation can only be done for the intended IR or DBa. Hope the material covered in this post help in some way. Yet your post is elegantly presented and beautifully exposed. I love it.

2. "The reason I think the SEO cannot really gain much from this information is that they would be targeting terms that are related insofar as they are intertwixt by the Web's documents at a high rate" A very valid question. I hope this and next post will address this.


BEFORE PROCEEDING

This thread doesn't deal with or try to propose a model for user's query behaviors, yet. We are simply using one of the many tools outthere for examining term co-occurrence and semantic connectivity in IRs and search engines. Correlation indices are just one of such tools. As mentioned in posts #26 and #31, other tools are available outhere that address things not addressed by c-indices.


LET'S START

Let's start this post by revisiting the following assumption

# returned results = # documents containing the query,

which is a fair assumption for systems responding to pattern matching of regular expressions and queried using the proper query mode conditions. The case of systems returning documents topically connected but without containing the queried terms will eventually be addressed. Baby steps first.

For now, let's stick to the case of k1 and k2 (single terms) and k12 a query consisting of k1 and k2. I am assuming readers know how to calculate c12-indices and have read all previous posts. (What we are going to say about k12 and c12-indices also applies to the transpose case).

To calculate correlation indices in the form of c-indices, one must consider the proper IR conditions. Let examine two query conditions that reinforce the above assumption: FIND ALL and EXACT searches. Here we go.


TERMS CO-OCCURRENCE UNDER "FIND ALL" CONDITIONS

For the k12 query the IR system or dba should return documents containing k1 AND k2 without regard for order. Ideally, this includes records in which k1 is followed by k2 regardless of how many terms or delimiters are between them. Unless a system has been pre-programmed to return documents in which k1 and k2 are separated by a presetted distance of terms (or delimiters) or has been pre-programmed -in the case of a natural language system- to a "Please find k1 and THEN please find k2" in a document, the system should return documents containing

1. k1 followed by k2 regardless of distance or sequence.
2. k1 and k2 or k2 and k1 as phrases.


WHY SEMS SHOULD BE INTERESTED IN TERM CO-OCCURRENCE

Under controlled conditions at the time of testing (i.e., if the system has no flaws, no purging or new entries added or no algorithmic changes) n12 and n21 should be the same at the given testing time, t.

These conditions can hardly be present in a commercial search engine, which explains why n12 and c-indices can be time-dependent. Yet, an overall trend emerges for many strongly connected terms.

Still, the above dynamical conditions pose non-trivial problem, notably for those paying for topically connected terms with or without a predefined sequence; eg, SEMs bidding on terms or phrases in a given timeframe. Thus, understanding why and when co-occurrence changes is of interest to SEMs. (Can you see here and added $$$Service to charge for? If SEMs want to do things the old fashion way they have the right to do so).

For unpaid results ("organic") the problem's severity depends on how often the DB collections are updated.


TERMS CO-OCCURRENCE UNDER "EXACT" CONDITIONS

Contrary to popular opinion, EXACT doesn't mean ABSOLUTE or "please return documents containing this phrase" only. Searching for k12 in EXACT mode can produce

1. strickly, a phrase (for a phrase we mean k1 followed by k2 both separated by a space).
2. k1 and k2 separated by delimiters (a period, semicolon, colon, hyphens, a pipe, etc)
3. k1 and k2 separated by ignored terms (stopwords).

All depends on which IR or search engine you are querying and how it was programmed. To convince yourself, try an exact search for k12 in different search engines and compare results.

Some systems are programmed to ignore periods, commas, semicolons, pipelines, etc, between terms. Even others are programmed to ignore certain class of stopwords. It all depends on the library of regular expressions and stopwords internally used.

Therefore, what an IR system sees as a "phrase" (in let say a title, description or a text link) is not necessarily what a user sees as a phrase.


A SIMPLE TEST

1. Compute c-indext for k1=earth k2=geography and k12= earth geography in EXACT mode (from the advanced search feature) in Google or your favorite IR system.
2. Check top results and see how k12 co-occurs. Surprise! It is not always what you consider a "phrase".

Consequently, blindfold comparisons of term co-occurrence and semantic connectivity across search engines and IR systems or wrong selection of query conditions may lead to wrong conclusions.

To sum up: IR systems can only give you what they have or can interpret or "see" as a term, terms or "phrases". Understanding their retrieval behavior provides an edge to SEOs.


USING C-INDICES AND IDENTIFYING COMBINATIONS OF SYNONYMS.

Blindfold usage of terms extracted from a thesaurus is damn not enough! We need to determine the relative degree in which terms from the pool of candidate synonyms co-occur in the DBa or IR system we want to target.


CASE STUDY

A client wants a web page to be designed focusing on k1 = earth and terms associated to k1. A thesaurus produces the following synonyms (or topically related) terms for k1 = earth. The pool of terms consist of:

geography, geology, terrestrial, territory, land.

Problem: Which are the one with the strongest co-occurrence in the intended IR or search engine to be targeted as requested by the client?

Test Procedure:

1. Calculate c-indices in FIND ALL mode for

k1=earth k2=geography
k1=earth k2=geology
k1=earth k2=terrestrial
k1=earth k2=territory
k1=earth k2=land

2. Order results and select results with the highest co-occurrence. Emphasize strongly connected combinations in client's page, accordingly.

3. This part, only for QC testers familiar with stats, t-tests and Q-tests, only: Use a Q-test to decide which terms could be removed from the pool at the 95% confidence level (this topic, out of scope of this thread)

If interested in targeting "phrases", repeat recipe in EXACT mode.

On a personal note.

I regret to say that due to space and time constraints, I couldn't get to points 4 and 5 as scheduled in post #41 (Combinatory Theory and the k1, k2, k3 case). I will try to cover that tomorrow or after tomorrow. It's a promise.


Orion

DataPacket.NET Brian
06-09-2004, 02:11 AM
Hello All,

Good article. Got it via email.

Doesn’t this formula just mean to use more relevant keywords, which we should be doing anyway?

If you target the keyword car, then you should also target auto, vehicle etc.. Anyway?

Brian

Dodger
06-09-2004, 02:29 AM
The results of the research I conduct are not used for "targeting keywords" in my SEO campaign, but rather for supporting the keywords I have already chosen to target. The results are used to select the best terms and phrases that will build a more semantically relevant *context* around my keywords which will support them and increase ranking.


I took the demo tour of your tool, and I was pretty impressed with the contextual words -- and especially the two word combos in one of the charts. One site that I am doing for my sis-in-law is weight loss related and that happened to be the keyword combo that was used in your demo.

The word pairings looked oddly familular to me for some reason. A lot of them I would use normally, some of them from the stock ad copy of the individual products ... but then there were the combos and other single words that a competitor of mine uses. I got to checking, and sure enough ... I think this dude is using your damn chart. :eek:

He is also one of our wholesalers too -- it is a fun game we play back and forth on some of this stuff and when I saw the chart, I about died. Well it is time to hit the pages again, I want to tweak a few of them that are a couple of spots back of him and see if I can't cut him down a couple of notches.

thememaster
06-09-2004, 03:33 AM
Doesn’t this formula just mean to use more relevant keywords, which we should be doing anyway?

That's right Brian, but the issue is which words are more relevant. Akin to how WordTracker reveals which words surfers are *actually* using in the searches so that you don't have to guess based on what words and phrases you *think* they should or would use, this kind of research helps reveal what the search engines deem as more or less relevant to a keyword/keyphrase and not just what you think should be so.

I took the demo tour of your tool, and I was pretty impressed with the contextual words . . .

Thanks Dodger :)

Pretty interesting story regarding your wholesaler/competitor too. Perhaps you'll win the next round in your game. :D

aff_dan
06-09-2004, 04:11 AM
nice comment. keep the work!

atom
06-09-2004, 08:55 AM
Firstly, thanks to Orion for his/her attempt in cross-pollinating the SEM industry with years of academic information/data theory and practice. I am sure this is a trend we can expect to see more of as the mainstream coverage of the SEM industry continues.

My hat is off to you sir/madam.

---

For those interested in actually performing Orion's c-indices exercises, feel free to use this: http://graphnical.com/cindex/

---

Lastly, for those consumed by Orion's discussion check this out:
http://www.dcs.gla.ac.uk/~iain/keith/index.htm

In addition, search these related studies:
Support Vector Machines Information Retrieval
Neural Nets Information Retrieval
Machine Learning Information Retrieval
Intelligent Web Wrappers
Semantic Web
Automated Text Summarization
Natural Language Processing
Web Agents
Knowledge Discovery and Data Mining

...the list goes on I assure you :)

For those of you who would like to discuss the above more (especially you Orion) you can find my email info (with public PGP key) here: http://www.graphnical.com (hint: mouseOver the 'g').


Enjoy!

>||:)

chris
06-09-2004, 10:00 AM
I've been thinking (it happens occassionally) about this from another perspective. Orion, I can see this being useful for things like Adsense and for things like maybe Google news where a small number of stories are grouped (perhaps symantically). However, when it comes to actual search results we're presumably talking about creating a massive (very very massive) cache or it's going to be far too slow to use. So I guess we could put the as yet unanswered question about the benefits in terms of SEO a different way: is anything like this actually in use in a large scale search engine (e.g. Google)? A sub-question of which has to be can something like this actually scale to billions of documents over hundreds of languages in terms of being used as a ranking factor?

orion
06-09-2004, 10:18 AM
Welcome to this thread Brian, AFF and Atom. Great to having you here.


To Brian:

1. "Doesn’t this formula just mean to use more relevant keywords, which we should be doing anyway?" Correlation indices for term co-occurrence (as other tools outthere for semantic connectivity measures) are exploratory datamining tools used to systematically find optimum combinations as perceived by the target, intended, particular, specific IR system. (with emphasis in "target, intended, particular, specific " IR system). You may want to revisit TheMeMaster post (post #55) "...this kind of research helps reveal what the search engines deem as more or less relevant to a keyword/keyphrase...".

2. "If you target the keyword car, then you should also target auto, vehicle etc.. Anyway?" Revisit the "earth" example and the five candidate combinations. Calculate c-indices. This may help to address the issue. With all, a "by default" approach of emphasizing all possible synonyms for a term and regardless of which IR system we are targeting may take you nowhere. The more we understand what the intended IR "perceives" as relevant, why and under which circumstances, the better.


To AFF:

Your post (post #57 of this thread) is identical to that of Serge (post #15 in this thread) which I already addressed, I think.


To Chris:

1. "So I guess we could put the as yet unanswered question about the benefits in terms of SEO a different way: is anything like this actually in use in a large scale search engine (e.g. Google)? A sub-question of which has to be can something like this actually scale to billions of documents over hundreds of languages in terms of being used as a ranking factor?" Excellent questions, which I may need to delay for now. Semantic connectivity and relevance ranking will be address eventually. I am still trying to explain just the basics. For a billion-size cache, no doubt other issues may arise (e.g, signal-to-noise ratios, lenght scales, power laws, etc) As for if Google is doing or developing tools along your lines of thoughts only Google knows. As if other have tried, you may want to check post #40 (Tony's post).


To Atom:

1. "Firstly, thanks to Orion for his/her attempt in cross-pollinating the SEM industry with years of academic information/data theory and practice." Certainly that is not my intention, sir or madam. Point aside, I didn't know someone can read my mind, my attempts or know my "intentions". Still, sir or madam, I can understand your concerns about the cross-pollination problem and I want to assure you that I hate it, too.

2. "I am sure this is a trend we can expect to see more of as the mainstream coverage of the SEM industry continues." You may want to revisit my first posts. The main thesis of this thread is how SEOs/SEMs can learn and become more educated about IR/AI. The more they know about IR and how to utilize analytical tools that have been around for so long and used by IR researchers (or practically owned by IR people) the better. It is hard for me to understand why they shouldn't know about all this or why this information should not reach the mainstream, honestly.

Still, thank for the invitation, sir or madam.


On a personal note,

In post #18 of this thread I wrote "Assumptions, statements, or claims I haven't made in this forum....That I am the "owner" of the truth or cannot make rational mistakes while presenting my thoughts "

Yesterday was a long and exhausting day for me. I almost didn't have time to post until past midnight (according to my local time --Eastern Time). While typing, I think I made one of the expected rational mistakes. Boy the day was long!

In that post I wrote under CASE STUDY >> Test Procedure the following line

"3. This part, only for QC testers familiar with stats, t-tests and Q-tests, only: Use a Q-test to decide which terms could be removed from the pool at the 95% confidence level (this topic, out of scope of this thread)"

I should haven't added any reference to QC stats but to concentrate on the current topic. I don't want to confuse more someone that is already confused or looking for "one-size-fits all solutions. So please disregard that line. My apologies. (Still QC/QA refinement tools can be added to the mix, especially when we do time series analysis of term co-occurrence). Damn me, I did it again.


Orion

orion
06-09-2004, 11:37 AM
Damn me, another rational horror. My apologies, Atom. I misunderstood your post. it is just that cross-pollination is bad word in ethical SEO circles.

I did a quick search on the suggested topic. Beautiful reference material you have suggested. I love it, especially the Machine Learning Information Retrieval.

My hat is off to you too, Atom.

I think I need some coffee and breakfast. I'll need to start the other part of my brain and do some work and take care of life issues. Be back tonight or tommorrow with the scheduled topics.


Orion

AussieWebmaster
06-09-2004, 03:35 PM
This is all very interesting, but personally, before I jump into any of this, I need more proof and more evidence that this may improve rankings in any way.

Also, what's Google's view on this? Are there any Google members in this great new forum that may wish to talk a bit more about this, without giving away the 'Google secret recipe'? ;-) (evil grin)
This is a discussion on semantics and word weighting more than a purely keyword selection talk...
Looking for the Google response will be a long time coming I think... but it has been passed along so you never know.

Mikkel deMib Svendsen
06-09-2004, 05:07 PM
I may not have presented myself here yet :)
- I am the moderator of this forum ... Hi to you all!

As I said, this discussion is very interesting and most certainly belong here. However, in order to keep it focused I'd like us to stick with the orgininal subject as posted by orion. Weather or not these theories are usefull for SEO now or not is very interesting too but I think it will be better to focus on that in another thread. This thread is already getting very long and harder to read for new visitors day by day. I just want to make sure it stay usefull for most :)

nuclei
06-09-2004, 05:55 PM
Hi Mikkel,

What seems most useful to me is whether the information has a valid use in SEO/SEM since thats what most of us here do, and what newbies want to be able to do. So it seems to me that all the topics covered here are all the same thing really.

detlev
06-09-2004, 06:34 PM
Hello everyone,

I think to keep everything on subject, I will suppose this thread is about relevancy as oringinally posted by Orion.

The thing about the c-index is that it takes 2 queries to calculate the measurement of wether or not the terms relate. Then it is a measurement of how related the terms are in a given DB by examining their co-occurence.

That's how I understand it in a nutshell.

It becomes interesting to query a public search engine and calculate the c-index, (a relatively simple thing to do), and ThemeMaster has used the results to verify that his keyword "theme" ideas are in fact on target when working on behalf of a client in his capacity as SEO. At least I think it was ThemeMaster that said so...

But more to the point, Orion is suggesting that there is merit to the idea for calculating relevancy because the measurement of the terms are in fact a measurement of semantic relatedness. And relatedness can have meaning for information retrieval. Orion now intends to expand on the theory by adding more calculations that determine further relevancy than that implied by co-occurence alone.

I suppose you can apply all sorts of ordinary relevancy ideas once you have the original c-index measurement. On the query side, you still need: k1, k2 begets k12 for the calculation. Therefore an input rule may apply: garbage in, garbage out. There is a lot of assumption going on on the query side of things, but you can measure relatedness as a DB is concerned.

The DB needs to be considered as a whole either to be complete or as some partial set of documents which could also skew the accuracy of the c-index. Or it could have meaning as a calculation for an industry niche of some sort; something someone would want to calculate meaning and relatedness as applied to a set of docs that were in fact specialised.

Since k1 + k2 begets k12, wouldn't it be interesting for a search engine to split a phrase into k1 and k2 and then calculate the relatedness of the single words in a phrase query? Maybe the way to think about using the c-index would be on a keyword database. That way a search engine can expand a search into related terms by using co-occurence of keywords - even with a single word query this could be applied.

Just thinking out loud! Am I missing anything, or am I wrong anywhere? I like this thread... thanks again O.

*cheers*
-detlev

Mikkel deMib Svendsen
06-09-2004, 06:56 PM
Maybe this is a stupid question, I must admit that some of these theories are bit over my head, but how does all this work for other languages - languages that works in very different ways from English. For example Danish thats my native language.

As an example, in Danish two keywords would very often not be written like "k1 k2" but rather combined into a new word "k1k2"

Example:
The two words "Car Accident" can be used idependantly or together but are kept as seperate words in English - with a space between them. In Danish that would become: "Caraccident". In Danish there is an infinite number of combined words - it's often correct to make up combinations.

I guess what I am asking is if you can use the same theories discussed here in other languages, like Danish, or if this is English specific?

nuclei
06-09-2004, 08:22 PM
I guess what I am asking is if you can use the same theories discussed here in other languages, like Danish, or if this is English specific?

I would think that it would be the same for any language. As you have caraccident, you could easily have insuranceclaim, etc as k2. Or maybe I am missing your question, which is very possible at this hour and 5 coronas to the wind :)

Dodger
06-09-2004, 10:03 PM
Example:
The two words "Car Accident" can be used idependantly or together but are kept as seperate words in English - with a space between them. In Danish that would become: "Caraccident". In Danish there is an infinite number of combined words - it's often correct to make up combinations.

I guess what I am asking is if you can use the same theories discussed here in other languages, like Danish, or if this is English specific?

That would throw a monkeywrench into the works, eh? I see what you are saying, the search engine would have to take into account double-word patterns written as one in order to make relationships to single words.

I think that is where AllTheWeb may have a leg-up on this aspect, maybe ... since they were a Danish company originally weren't they? Now Yahoo has them.

orion
06-10-2004, 01:31 AM
Mikkel, nuclei, detlev, dodger, aussie. All observations are well taken. Detlev, again your post is excellent.

To Mikkel:

1. "However, in order to keep it focused I'd like us to stick with the orginal subject as posted by orion. Weather or not these theories are usefull for SEO now or not is very interesting too but I think it will be better to focus on that in another thread." Agree. We need to keep the eyes on the ball. I'll do my part and my best.

2. "This thread is already getting very long and harder to read for new visitors day by day." I has been informed that the thread is being followed very close and commented by SEOs/SEMs from France where is very popular.

3. "I guess what I am asking is if you can use the same theories discussed here in other languages, like Danish, or if this is English specific?" Yes. Once we build a thesaurus library the theories can be applied to the intended language.


Orion

orion
06-10-2004, 01:51 AM
This post is organized as follow

1. Review on proper query conditions.
2. A revisit to a case example.
3. An Introduction to Combination and Permutation Theory
4. The k1,k2,k3 case.

Since 1 and 2 are old material, the presentation of it is presented with minimum explanations. Let's then proceed.


SELECTING THE RIGHT QUERY CONDITIONS

[Note: On a special note, we have seen recently several web properties trying to replicate c-indices and c-index calculators using wrong selection of terms or wrong query conditions and without for proper knowledge on the theory behind the basics. While I welcome those tools, we need to understand the underlying theory and not mere number crunching. I hope this post help to sort things out.]

This is what we commonly do:

1. We avoid the use of delimiters, too common, too vague terms and stopwords when defining k1, k2, and k12. Experiments suggest that one may ends with meaningless c-indices. Furthermore, such indices are hard to compare with c-indices extracted from somewhat valid terms found in a thesaurus or terms with branding presence on the web. To top off, query expansion explorations are questionable with such terms.

2. All experiments are performed in FIND ALL mode. This include queries for k1, k2 and k12. If we are interested in targeting "phrases" (see previous post for what a phrase is perceived by an IR or search engine), then all queries are also repeated in EXACT mode. Normally we try to do FIND ALL and EXACT queries in parallel and at a given time t.

3. We avoid the use of FIND ANY when determining co-occurence since in this mode our the central assumption

# results retrieved = # results containing the query

is no longer valid. Furthermore, FIND ANY is inherently contrary to the concept of co-occurrence. Let me expand on this...

For the k12 query, true that this mode may return documents containing k12 but it may also return documents containing k1 or k2; that is, documents with no co-occurrence at all. Extracted co-occurrence indices (c-indices) from FIND ANY results are therefore questionable. In fact, before querying an IR system or search engine one should not rest on default query modes. If the default query mode is FIND ANY, we need to change it before conducting the experiment(s).

Note that we are not concerned in the thread about modeling users behaviors. But certainly the probability to be "pick" by an IR system or engine and be found by a user searching in FIND ANY mode increases for documents with incidents of term co-occurrence. This will be revisited in other section of
this post.


4. Since c-indices in theory runs from 0 to 1 (very small numbers), we express them in parts per thousand (ppt) and usually to 2 to 4 decimal places. The number of decimal places used is somewhat arbitrary and do not seem to introduce significant error with strongly connected terms. Yet, we have found weakly connected terms in which decimal precision introduce non trivial error, especially with weakly connected terms in EXACT mode, where c-indices tend to be very small values.


DEGREE OF CO-OCCURRENCE FOR TERMS EXTRACTED FROM A THESAURUS.

Reasoning: Given a k1 term, topically associated candidate k2 terms are
extracted from a thesaurus. Yet,

1. the fact that the k2's are topically associated to k1 is not enough.
2. We need to establish the degree of association between k1 and the candidate k2's as perceived by the target IR system or search engine.

Furthermore, we also need to know the degree of semantic granularity of the target IR system. By "degree of semantic granularity" we mean how the target IR system or engine interprets specific delimiters, prefixes, stems, etc. Again, this is done in FIND ALL and FIND EXACT modes. Perhaps an example may be enough.


CASE EXAMPLE PREVIOUSLY PRESENT

The following is the result for the case introduce in post #Raw data is given below with no additional comments required. For additional information see post #51

---------------------------------------

IR/Search Engine: Google
Date/Time: 06-09-04 Around 11:25 AM
Query Mode: FIND ALL
Case: k1, k2 and k12


Results Sorted According to Degree of Co-occurrence (c12-index)

#1 k1=earth k2=land || c12 = 45.54 ppt
#2 k1=earth k2=geography || c12 = 45.31 ppt
#3 k1=earth k2=geology || c12 = 35.35 ppt
#4 k1=earth k2=territory || c12 = 16.37 ppt
#5 k1=earth k2=terrestrial || c12 = 13.73 ppt


Notes: Degree of co-occurrence (c-indices) expressed in parts per thousand and up to 2 decimal places. Results may be different in other IR systems or search engines.

---------------------------------------

Actions or recommendations to implement after collecting results

To build a customized thesaurus or for performing query expansion using previous experiments.

1. k1 and candidates k2's are obtained from a thesaurus and their corresponding c-indices computed and sorted.

2. Those combinations with high c-indices are placed in a library for further query expansion tests. We will see why in the last part of this post.

3. Those combinations with low c-indices are placed on a different library for
further query expansion tests. We will see why in the last part of this post.

4. If we want to emphasize "phrases" we repeat tests will all combinations using EXACT mode.

5. Repeat tests using transpose c-indices.


So far this is the rutinary procedure for k1,k2. Why we are concern about transpose c-indices? What about k1,k2,k3 case? Finally, let's start with new material.



ON PERMUTATIONS, COMBINATIONS AND TRANSPOSE C-INDICES

In the above sections, the EXACT mode often preceeded the FIND ALL mode in the discussion. In this section, I may need to explain the EXACT cases first. Before proceeding, some basic definitions and math.

! is a factorial operator in which the factorial of 0! = 1.

Example: For n=3, n!=3*2*1=6.


A PERMUTATION is an arrangement of a group of n items taken r at a time with regard for sequence (order is of significance). The number of possible permutations is given by

P(n, r) = n!/(n - r)! = n!/(delta)!, where delta = n - r.

If r = n, then P(n, r) = n!

In EXACT MODE, for 2 keywords taken 2 at a time, two c-index values can be obtained by querying an IR system or search engine in EXACT mode, c12 and its transpose, c21.


A COMBINATION is an arrangement of a group of n items taken r at a time regardless for sequence (order is of no significance). The number of possible
combinations is given by

c(n, r) = n!/(r!(n - r)!) = n!/(r!(delta!))

In the special case where r = n, we have c(n, r) = n!/r! = n!/n! = 1

A query in FIND ALL mode is a query regardless of sequence. Thus in theory in FIND ALL mode, for 2 keywords taken 2 at a time,

c12 = c21

Why then calculate transpose c-indices in this mode? Simply put, because reality bites.

As explained in previous posts of this thread, commercial search engines are
constantly upgrading their DB collections by mean of purging, deleting, checking, adding documents. Thus a transpose k21 may reveal documents not found through in a k12 subset. It is always a good idea to check transpose indices in FIND ALL mode. That information may be of value when we do exploratory query expansion tests or we are interested in other cases such as the k1,k2,k3 case.

Talking about....


THE K1,K2,K3 CASE

This case is where thing gets a little complicated. However, we can simplify
using query expansion strategies, as we well see. First let me present the not-so friendly part of the case.

In EXACT mode, for 3 keywords (k1,k2,k3) taken 3 at a time, we can compute up to 6 c-indices, associated to the six permutations (with regard for sequence)

c123, c132, c213, c321, c231, c312


A query in FIND ALL mode is a query regardless of sequence. Thus one should
expect that

c123 = c132 = c213 = c321 = c231 = c312

which for commercial search engines is simply impossible, for the reasons given before.

Since we need to compute in both EXACT or FIND ALL modes all possible combinations from all candidate k's and the corresponding c-indices the k1,k2,k3 case can be time consuming.

One approach we found useful consists in building upon previous connectivity
knowledge, that is, building upon k12 cases and as follow

1. One uses a library of presorted strongly connected k12 and redefines a new k1 as k1 = k12.

2. The new redefined k1 is tested with several candidates k2's. New c-indices are calculated, sorted and analyzed.

This approaches reduces the problem to a query expansion case in which k12 and its transpose k21 can be tested with additional candidate terms. This is why now the two c-index libraries previously described come handy.

For instance, for the CASE EXAMPLE above we select k1 = earth land and several candidates k2's. We do the same with k1= earth geography and with k1= earth geology. The final goal is to extract, sort, and use the triplets with the strongest term co-occurrence in documents. Such documents improve their probability to be "pick" by the IR system and be found by a user searching in FIND ANY mode.


WHAT'S AHEAD

Semantic connectivity, roots, prefixes and writing style.
Semantic connectivity and relevance ranking!!!

Before moving ahead with the new topics, I will give readers plenty of time, so we can slow down a bit and let new readers digest all previous posts. So posts, comments on this current post are welcome.


Orion

findme
06-10-2004, 03:36 AM
Hi all,

I am not a SEO person, i am interested in finding search and advertising technology companies that want large foreign clients (feel free to pitch me).

Orion, first off I am just amazed at how you keep track of so many conversations at once. And I am really impressed by your comments and insights.

a. From what I understand, Applied Semantics (Google acquired) developed semantic technologies but they also had theasauri as part of their IP. I say this because I have concerns about the processing and storage required for the processes you are mentioning. It sounds a lot like latent semantic indexing to me (i skimmed the this discussison thread briefly) which requires a load of processing and storage with huge tables. How does the approach you mention provide a scalable solution to improving search indicies?

If this is latent semantic indexing, what are your thoughts on drift?

b. How does Polysemy (words with multiple meanings such as 'lead', 'drive', etc ) affect the approach you mention?

c. You mentioned stemming, but have not gotten a chance to say more, please do. How about word morph?

..FindMe

PS: I appologise for not reading the thread in detail, I was just too anxious to post after reading the first page.

Dodger
06-10-2004, 04:19 AM
Before moving ahead with the new topics, I will give readers plenty of time, so we can slow down a bit and let new readers digest all previous posts. So posts, comments on this current post are welcome.

Might I suggest a new thread be started on each chapter or section Orion? Perhaps start it off by pointing to this thread as pre-requisite reading. I think this might keep the organization a little tidier and people can advance through each at their own pace. It would probably keep the questions more targetted and on that specific topic -- rather than having them intermixed throughout the entire thread.

orion
06-10-2004, 01:18 PM
To Mikkel:

On post #69 I wrote for your question and quote

"I guess what I am asking is if you can use the same theories discussed here in other languages, like Danish, or if this is English specific?" Yes. Once we build a thesaurus library the theories can be applied to the intended language.

I forget to mention that in addition to having a language-specific thesaurus, we need also a language-specific IR system or engine to be targeted. c-indices that address linguistic peculiarities perform better with regional directories, IRs and engines.

On other matters, Mikkel I have a question. Is it possible to post a print screen gif to clarify some ideas on Fuzzy Set and c-indices? If so, how it could be done, with proper permission from you and SEW editors? Not a biggie. Just a thought.

To Dodger:

1. "Might I suggest a new thread be started on each chapter or section Orion? Perhaps start it off by pointing to this thread as pre-requisite reading. I think this might keep the organization a little tidier and people can advance through each at their own pace. It would probably keep the questions more targetted and on that specific topic -- rather than having them intermixed throughout the entire thread." Excellent idea. I love it! I leave that call to Mikkel and SEW editors.

To Findme:

1. "From what I understand, Applied Semantics (Google acquired) developed semantic technologies but they also had theasauri as part of their IP. I say this because I have concerns about the processing and storage required for the processes you are mentioning."..."How does the approach you mention provide a scalable solution to improving search indicies?" Excellent questions. Let see. Those concerns deals more with information management project (IMP) strategies than anything else. From an IR standpoint, processing and storage requirements can be simplified with index servers and index libraries. Scalability strategies can be applied to large cache collections.

Large SEO/SEM firms with several departments (sales, R&D, coders, webdesign, mediabuyer) will need a sound IMP and assembly line. Such enviroments may need to train R&D staff or even better, hire an IR specialist or mentor. That's why I suggested IR educational activities in future seo/sem conferences. Perhaps this is where we are heading to. In that way the credibility and education of SEOs/SEMs could be enhanced.

For an average SEO/SEM-joe running a one-man-gang business, processing and storage tasks could be an undue burn. Fortunately, analytical tools have been or are currently been developed by others. It's a matter of time before such tools hit the seo market. I have several IR tools dedicated to research and Business Intelligence not yet made public which simplify most of the tasks described or mentioned in this thread. That includes my "THE C-INDEX CALCULATOR" and other AI/IR tools. I haven't make them public simply because

1. I don't need to or am desesperate.
2. I am still looking for the right firm(s) or IR group(s), with a valid reputation or compromise with IR technology to team with. It doesn't matter to me if they are for/not-for profit, as long as they are heading to the next generation of smart search engines.

I will stop this line of thought since I don't want to give any impression to anyone that I am "offering" myself. Still, with a valid project at hand, pitch me via this forum private email feature. I'm not interested in mere marketing stuff. IR research and development of new tools and intellectual property is what I love to do. As a scientist with a formal degree and patent experience, that's first but -damn life- we also need to live, close contracts, and address life issues. Right? Those of you outthere know we IR scientists are torn apart between the usual two damn forces of life: theory and reality.

2. "How does Polysemy (words with multiple meanings such as 'lead', 'drive', etc ) affect the approach you mention?"..."How about word morph?" All this, including brute force algorithms (for prefixes, stems, suffixes) I will love to address in other threads, with proper permission from SEW editors and moderators. The more SEOs/SEMs know about all this, the better.


Orion

Rob
06-10-2004, 03:32 PM
Hi Orion

I just wanted to say Hi, I'm trying to read through this incredible thread and say thanks.

I will have to re-read the entire thing a couple times to ensure I understand but what I see here is very useful. I'd like to see more of this type structured technical talk that isn't over my head.

thanks again!

Red5
06-10-2004, 03:35 PM
I'd just like to make my first post here by saying that this was a great thread to read, and I'm looking forward to many more posts and topics along these lines! I'm particularly interested in semantic analysis at the moment, and not just for SEO purposes, so thank you everyone. ;-)

orion
06-10-2004, 06:12 PM
Welcome to this thread Rob and Red5. It is a priviledge to having you here.

To Rob:

1. "I will have to re-read the entire thing a couple times to ensure I understand but what I see here is very useful. I'd like to see more of this type structured technical talk that isn't over my head." Excellent. I just wish others follow your example. Not all posting --ie., on certain sites across the web-- have a clear understanding on the underlying theory and analytical tools I am trying to introduce. But that's ok with me.

2. "I'm particularly interested in semantic analysis at the moment, and not just for SEO purposes" Great! I'm feeling in the same way.

What days we are living in! What a better time to learn about IR/AI. Here is a
sample:

According to this news, from InformationWeek, http://www.informationweek.com/story/showArticle.jhtml?articleID=21401674,

Forty percent of USA federal agencies are using data mining for a variety of activities, some of which raise significant privacy issues, a report finds,

"The top six purposes for data mining are improving service or performance (65), detecting fraud or abuse (24), analyzing scientific and research information (23), managing human resources (17), detecting criminal activities or patterns (15), and analyzing intelligence and detecting terrorist activities (14)."

One line of the report says "What the systems are good for is being able to identity what appear to be either clusters of information or an uncommonly high incidence of co-occurrence," says Andrew Feit, senior VP of marketing at intellectual-capital-management company Verity Inc. The company's K2 Enterprise software is used by the Defense Intelligence Agency, among others, to identify terrorists."

As we can see semantic connectivity is here to stay. Those of you interested in Security Intelligence and Business Intelligence may want to know the basics of what we are discussing in this forum. Putting things in perspective...

This also means that the next generation of smart search engines and IR appliances --not the current and simplistic keyword-driven search engines, but semantically-driven machines-- are coming if not are here already. Thus SEOs/SEMs interested in catering their services to this new breed of intelligence IR systems will be in demand.

To position a document in such intelligent systems (almost right at the SE corner) we may need to understand how the system "sees", indexes and extracts information, in the first place. Right? We like it or not, natural language-, co-occurrence- and semantic-driven searches are here to stay. So, let's take baby steps now before is too late. Are you up to the challenge? Me? I don't believe in 2nd-guessing or trial-by-error approaches. You? Hey, you always have the right to do things the old fashion way. That's up to you.

To give time new readers to digest my previous posts (and my typos and rational horrors) tomorrow I will wrap up a little summary on what we have learned. Then I will proceed with new material. Meanwhile feel free to comment on all previous posts.


Orion

Mikkel deMib Svendsen
06-10-2004, 08:25 PM
orion, you should be able to insert a image in your posts. I haven't done it yet as it is all still so new (remember, we are still only 10 days old). Try it, and if it dosen't work drop me a private message and I'll se what I can do.

rankforsales
06-10-2004, 08:39 PM
Orion:- I got to hand it to you: You impress me with all these numbers. I'm also impressed with the lenght of this post!

But, as I said in my previous post, I still need more 'scientific' proof or evidence if you will before we even think of doing any changes to our client's sites here.

Not that I don't believe you- I'm just too conservative, that's all.

I'm sure some other fine SEO's here think like me on this theory.

Serge Thibodeau
please no manual signatures

cariboo
06-11-2004, 10:45 AM
Well, if this is only a theory, it won't be difficult to test it.

Some of you want a scientific proof ! That's a very good reaction, but the method suggested by Orion isn't very difficult to implement, so it won't take a long time to have some feed back. Unless everybody decide to keep the results for himself :)

rankforsales
06-11-2004, 11:31 AM
" Unless everybody decide to keep the results for himself "

-Would anybody in this great SEO community ever do that?

I guess some might.... I know I won't. If we start implementing this on a test site (read: a dummy URL), I will make our findings public, both on this board and in my articles and in our newsletter..... deal?

orion
06-11-2004, 12:03 PM
To Serge:

1. "I got to hand it to you: You impress me with all these numbers. I'm also impressed with the lenght of this post!" With this one, three times asked and three times I have intentionally delayed the discussion on term co-occurrence and mere absolute ranking results. Sounds like a 'you got served' now, which I very much welcome.

Serge, I want to focus on the basics first. I can assure the fine group at RankForSale, that a proper discussion on that is coming. When we discuss co-occurrence at a granularity level of small vertical collections and specific documents, things --I hope-- will look a lot evident.

2. "Not that I don't believe you- I'm just too conservative, that's all. I'm sure some other fine SEO's here think like me on this theory." These are all valid concerns. I'm too a conservative person. I can understand why some SEO's tend to adopt a "Huh... prove it! I don't buy it, yet"--approach. As a scientist, I have assumed and still assume the same conservative position on many IR issues discussed across the web. There is nothing wrong with that attitude and is very much welcome, Serge. I love it!

To some SEOs:

This thread is about heuristic and search technology, not about marketing issues. Still we cannot avoid overlaps, which I try to keep to a minimum. I understand why some with years of marketing background may find some of my previous posts hard to digest. That's perfectly fine with me.

What I cannot understand is how certain SEOs in some discussion forums seem to dismiss the whole issue of term co-occurrence/semantic connectivity as a myth, while others are now proposing formulas for keyword co-occurrence without apparently any sound knowlege of Fuzzy Set, Venn Diagrams and Semantics. Check some posts at the following forum:

http://www.webproworld.com/viewtopic.php?t=21161

While some posts of that forum raise very valid issues, others simply propose faulty co-occurrence formulas. In the upcoming summary I will explain why it is so. So stick to this thread.

Bragging times, now.

4,000+ VIEWS AND COUNTING!!! This thread started on 06/02/04 and 10 days later has more than 4,000 views. Welcome all.

So many fine viewers are here for a reason. SEOs or not, feel at home with the SearchEngineWatch.com Forum, don't blink or go to other places. With your indulgence, if you need to do "1" or "2", do so now. I will post the summary, sometimes during the day of today or tomorrow, as scheduled. Then things will get more interesting.

Orion

Anthony Parsons
06-11-2004, 12:20 PM
" Unless everybody decide to keep the results for himself "

-Would anybody in this great SEO community ever do that?

I guess some might.... I know I won't. If we start implementing this on a test site (read: a dummy URL), I will make our findings public, both on this board and in my articles and in our newsletter..... deal?

Somehow Serge I don't think we have the rights to freely publish what comes of this board in our newsletters and such as it belongs to SEW once listed. A quote or snippet yer sure, with a link to the source most definately.

orion
06-11-2004, 02:10 PM
Sorry but I could not resist. I forget to mention in last post that the example, calculations and conclusions presented in the webpronews thread at

http://www.webproworld.com/viewtopic.php?t=21161

are simply incorrect. The thread's originator, without defining the query mode conditions (horror!) presents the case of

k1=dog
k2=canine
k12=dog canine

and compares it with

k1=dog
k2=pooch
k12=dog pooch

When comparing, he uses THE SAME n12 results for both calculations (i.e., n12=999,000; another horror!) and then ends with the wrong c12-indices. He then draws wrong conclusions. This starts a meaningless discussion in the thread and all sort of wrong formulae proposals and speculations.

We did a search in Google in FIND ALL mode today and this is what we found, in parts per thousands (ppt).

k1=dog=53,300,000
k2=canine1,890,000
k12=dog canine=1,020,000
c12-index=18.83 ppt

k1=dog=53,300,000
k2=pooch=268,000
k12=dog pooch=129,000
c12-index=2.41 ppt

Clearly, in Google, dog canine has more degree of connectivity than dog pooch, right?

Since 18.83 >>> 2.41, his conclusions cannot be sustained.

Certainly, c-indices will change in time, but the overall trend, as obtained from a time series analysis we have for the above terms suggest a well separation of semantic connectivity trends in each case.

I don't want to sound harsh. That's simply not my style, but

pleeeeaaassssee

Before using a theory or debating about a theory, learn the basics first.


Wait for my summary, please.


Orion

detlev
06-11-2004, 03:39 PM
Hello everyone,

I would like to put all this in perspective as I understand it.

According to the figures pointed out by Orion above, there is much stronger relatedness between dog canine versus dog pooch in Google. That sounds very correct but it does not mean SEOs should run out and use dog canine in their copy if dog pooch is the copywriting style of the Website. It might vagulely imply but does not prove more people search for dog canine versus dog pooch and it does not mean you will rank better in Google if you try for dog canine versus dog pooch as a keyword phrase when the opposite might be truer.

It means we can measure the relatedness of dog canine versus dog pooch in the Google index. What this means for SEOs is still unclear and I am waiting for Orion to present some thought about what this means for copywriting style and SEO. I hope this helps some who are looking for a SEO magic bullet here. If one exists it has not yet been presented.

*cheers*
-detlev

orion
06-11-2004, 04:35 PM
To detlev:

1. "I would like to put all this in perspective as I understand it. According to the figures pointed out by Orion above, there is much stronger relatedness between dog canine versus dog pooch in Google." Precisely.
2. "That sounds very correct but it does not mean SEOs should run out and use dog canine in their copy if dog pooch is the copywriting style of the Website. " Precisely, it doesn't mean that. I haven't discussed yet copy style.
3. "I hope this helps some who are looking for a SEO magic bullet here." Precisely.
4. "If one exists it has not yet been presented." Precisely and well put.

Let's keep everything in perspective as detlev has stated. As I've been mentioned before, don't try to look for instant grats here. If someone is looking for magic bullets for rankings is reading the wrong thread.

Orion

orion
06-12-2004, 11:58 AM
THREAD SUMMARY

DEFINITION OF A C-INDEX

Let n1 be # of search results containing a term k1 and n2 be # of search results containing a different term k2. Let assume a query consisting of k1 and k2 does not produce results n12 containing k1 and k2; ie. n12 = 0. Thus n1 and n2 can be taken for two mutually exclusive events. In terms of Venn Diagrams (Fuzzy Set Theory) this can be represented by two nonoverlapped circles of total area

Atotal = n1 + n2

I use the notion of "circle areas" for visualization purposes, only. Now if there is overlap (intersection of events; the two circles have a common region), we are describing a case where a query for k12 (k1 followed by k2) yields n12 documents in which k1 and k2 co-occur. Thus, n12 > 0 and the total area occupied by the circles is

Atotal = n1 + n2 - n12

The overlapping fraction is simply a ratio I elect to call "The c-index". Thus for the present scenario

c12 = n12/(n1 + n2 - n12)

A c-index can then be defined as the fraction of documents in which queried terms co-occur. More technically, this fraction is an intersection/union ratio. This scenario can be expanded to include more terms and set of results. For instance, in FIND ALL, for three terms co-occurring in a certain number of results (for now, we are taking 3 at a time from a pool of 3),

c123 = n123/Atotal

where Atotal is a different Atotal, not the previous Atotal, obviously. (See "Combinations and Permutations" post for other c-indices. I will expand on clusters of co-occurrences later, a phenomenon not found in the k1,k2 case.)

In general,

1. For semantically connected or topically-related terms and concepts, co-occurrence can be taken for a measure of semantic association within a given IR system or search engine database. For thesaurus-based terms, c-indices are a measure of synonymity.
2. c-indices can be used for building libraries of synonyms, find similar, query expansion in clustering algorithms.
3. Co-occurrence trends and patterns can be discovered by measuring c-indices as a function of time (time series analysis). Term co-occurrence, trends, patterns may be different across queried IR systems and search engines. Such studies provide non trivial datamining knowledge we can tap into.


TERMS CO-OCCURRENCE AND QUERY CONDITIONS

1. Co-occurrence experiments should be done in both FIND ALL and EXACT modes. Transpose indices should also be computed --reasons were given, accordingly.
2. Term co-occurrence experiments in FIND ANY should be avoided.
3. Bougus terms (stopwords, delimiters, etc) should be avoided in query experiments. Ideally, we use synonyms or topically connected terms extracted from a dictionary, thesaurus (a keyword tracker utility showing topically-connected (and sounded) combinations of queried terms come handy--more on this is coming).
4. FIND ALL queries produce results with term co-occurrence incidents without regard for sequence.
5. EXACT queries produce result with term co-occurrence with regard for sequence.
6. EXACT queries can be interpreted different by different queried systems. What a human considers a phrase is not what an IR system considers a "phrase". For example, a system ignoring "a", periods, hyphens, etc may interpret

...rap. A music...
...rap. Music...
Rap - Music
...rap music...

etc. as co-occurrence instances for the query "rap music". (Switch to country music, rock or salsa, if you wish).

How an IR system interprets an exact sequence affects c-index calculations, especially for queries conducted in EXACT mode.

In general, term co-occurrence could be defined differently across IR systems and search engines since

1. each system parses information differently.
2. IR and commercial search engines are constantly updating their document databases.
3. IR and commercial search engines may be changing their parsing algorithms.
4. we may be dealing with non validated systems (faulty systems).


LIMITATIONS OF C-INDICES

1. When computing a c-index, one merely computes a fraction extracted from retrieved results from a queried system without considering the actual structure of the information contained in those results.
2. Crude c-indices tells nothing about WHERE precisely the terms co-occur in a document, how far apart are the terms from each other, and how many instances of co-occurrence (frequency of co-occurence) are taking place in a given document. Other type of correlation indices are necessary.
3. C-indices are not silver-bullet solutions, magic pills or one-size-fit-all (more than a limitation of the theory this is a limitation of the tester)

Finally we have the dreaded PRECISION and RECALL issues (IR folks know what I'm talking about). In addition, I haven't discussed issues related with #results shown and #results present in a collection but not retrieved or not shown. Thus c-indexes we calculate are estimated values. Time series analysis help to assess many other issues not addressed by mere number crunching n1, n2, n12, etc...values. Let stick to the basics first.


C-INDEX MISINTERPRETATIONS

c-index experiments can be conducted with synonyms or topically-connected terms or for conducting query expansion experiments. But it should not be applied indiscriminately. Why?

Simply put, because term co-occurrence and connectivity are one of the most misunderstood areas of semantics. And certainly these are not equivalent concepts. As in criminal courts, evidence of association is not a proof that a crime was committed. However, repetitive incidence of associations, patterns and trends of co-occurences measured in time (time series co-occurrence) raises a red flag for any law and order or homeland security investigator. Right? [Incidentally, semantic co-occurrence research is a hot topic in the gov. See references in this thread.]

A c-index is just a tool and as any tool it can be used incorrectly and can lead to wrong conclusions. For example if I query in FIND ALL a system that accepts single letter-words; i.e., do a test for k1 = a k2 = u and k12 = a u true that I may end with a c-index value or with strong or weak term co-occurrence, but what any good that test is -let say- for improving semantic connectivity in a web document?

I can run similar tests for letters, numbers, delimiters, stopwords, etc, all sort of nonse k's and yet measure co-occurrence. So what? I can even intermix terms from disimilar languages and extract c-indices. So What? Certainly these results may interest linguistics folks but they probably do no good for average web documents. Right? Then the tools is no longer a tool or even a toy but an artifact.

There are now many "c-index" tools being tested in the background by what I call "keyworkers" and "keymarketers" or flying around online. One of such tools is found at

http://graphnical.com/cindex/

Records reveal all sort of good and bad selection of keywords; from stopwords to carefully crafted combinations of terms. The page claims the tool returns results in the FIND ALL and EXACT query modes but shows no way for users to specify the modes. At least I couldn't see a way to do this. Unless a selection feature for the modes is added, those results must be put into question. Still I welcome this and any other tools. We need more of these. Me? I stick to the original, my "The C-Index Calculator", unless a better one is constructed by those fine developers outthere.


WHAT'S AHEAD


1. The k1,k2,k3 case revisited: CLUSTERS OF CO-OCCURRENCES
2. Prefixes, stems and copy style
3. Term Co-occurrence at the document level (GRANULARITY OF CO-OCCURRENCES)

Point 2 may interest writers and 3 may interest those conducting "keyword density" experiments.

Feel free to comment on the above before we start with new material. Please forgive me any irritating typo or rational horror I may have committed. Have a great weekend all.


Orion

orion
06-13-2004, 12:26 AM
Errata (Pardon, please)

In previous post I changed the c123 expression to read

c123 = n123/Atotal

which is the correct one (and is more complete). As I mentioned in the post the k1,k2,k3 case results in a co-occurrence clustering effect, not found in the k1,k2 case. How Atotal is defined depends on the particular clustering case. This phenomenon requires a complete set of c-indices, as we will see. Again, I ask for your indulgence.

Orion

DanThies
06-13-2004, 11:41 PM
I expect that you'll find changes to the tool that was posted shortly, as we're all enjoying this quite a bit. We're working on something a little different, which will hopefully be posted soon as well.

orion
06-14-2004, 05:40 PM
To Dan:

1. "I expect that you'll find changes to the tool that was posted shortly, as we're all enjoying this quite a bit. We're working on something a little different, which will hopefully be posted soon as well." Hi Dan. Hope you had a great weekend. I honestly wish more tools hit the market, soon. I endorse the idea and welcome all your excellent projects. The more analytical tools outthere, the better. I cannot wait to read your series of articles.


This post may be a little abstract. In order to explain things to a wider audience, I am trying to conduct this thread in non-technical terms; thus, not trying to use standard IR nomenclature. Let's then retake the discussion. Please feel free to comment. The post is organized as follow

1. k1,k2,k3 case revisited
2. clusters of co-occurrences


THE K1,K2,K3 CASE REVISITED


In a general sense, if we use Venn Diagrams and the notion of area A (for visualization purposes, only), for non mutually exclusive events we have for the k1,k2, k12 case a working expression of the form

Atotal = n1 + n2 - n12

If I want express this in terms of probabilities, then I can recite Theorem 9.5a from "Handbook of Applied Mathematics for Engineers and Scientists"; Max Kurtz, McGrawHill, 1991:

"Let E1 and E2 denote two overlapping events. If an event E results from the occurrence of E1 and E2 or both, the probability of E is "

P(E) = P(E1) + P(E2) - P(E1 and E2)

which is of the same form of as our simplistic working expression. The c12-index is then an intersection/union fraction representing the degree of term co-occurrence between k1 and k2

c12 = n12/(n1 + n2 - n12)

A similar treatment applies to the transpose case (c21). In both cases there is only one co-occurrence region. Draw two overlapping circles and convince yourself.

The k1,k2,k3 case is not that simplistic. As we will see this case involves different term co-occurence scenarios I like to call "clusters of co-occurrence". Reciting from Theorem 9.5b from "Handbook of Applied Mathematics for Engineers and Scientists"; Max Kurtz, McGrawHill, 1991:

"Let E1, E2 and E3 denote three overlapping events. If an event E results from the occurrence of E1, E2, or E3 or any combination of them, the probability of E is "

P(E) = P(E1) + P(E2) + P(E3) - P(E1 and E2) - P(E1 and E3) - P(E2 and E3) + P(E1, E2 and E3)

In terms of our working expression we can write

Atotal = n1 + n2 + n3 - n12 - n13 - n23 + n123

therefore we end with...


CLUSTERS OF CO-OCCURRENCES

If we define c-indices as intersection/union ratios, then we need to write the following indices

c123 = n123/Atotal
c12 = n12/Atotal
c13 = n13/Atotal
c23 = n23/Atotal

Thus if we talk about co-occurrence we need to be very careful since we need to know all terms in Atotal (and we haven't yet considered combinations, permutations and transpositions for this scenario!).

From the practical standpoint, this represents a problem. if I instruct an IR system to only FIND ALL n123 documents (i.e. containing the co-occurrence instances of k1, k2 and k3), I also need to know all the terms in the Atotal expression in order to calculate the c-indices.

As discussed in previous posts of this thread, I have found that by redefining a new pair of k1 and k2 as

new k1 = k1 and k2
new k2 = k3

I can "reduce" the scenario to a query expansion case. Still there are cases in which this may not be done, since defining new k1 = k1 + k2 imposses a predefined sequence for the candidates new k2 = k3.

For this reason, I try to redefine the new k1 using previously tested k1 and k2 terms since I know a priori their degree of connectivity (from my thesaurus). I also use a correlation matrix of co-occurrences when performing such tests. This approach allows me to conduct co-occurrence reinforcement tests of previously tested pairs.


CHAINED CO-OCCURRENCES

Suppose that

1. we have 3 terms k1,k2 and k3.
2. There is no co-occurrence between k1 and k3 at all (or if there is, it is negligible)
2. There is co-occurrence between k1 and k2 and k2 and k3. So let's call k2 a "semantic bridge" term.

Visualize this as a circle overlapping two circles, each one at opposite sides of the circle in the middle (the "bridge"), so there are only two intersection regions at opposite extremes.

If we know

1. n1, n2 and n3 by querying separately k1, k2 and k3
2. n12 = # results containing k1 and k2
3. n13 = # results containing k2 and k3

Atotal = n1 + n2 + n3 - n12 - n23

Defining a c-index as an intersection/union ratio we can write

c12 = n12/Atotal
c23 = n23/Atotal

[This simple scenario leads to several combinations and permutations of the c-indices. Can you formulate these?]

The idea of terms acting as "semantic bridges" allows me to:

1. measure the degree of semantic commonalities and differences of terms in IR systems
2. conduct exploration experiments with large chains of term co-occurrence incidents in databases and in individual documents.
3. use "bridge" terms to conduct semantic connectivity enhancement studies for lossy terms in documents or in collection of documents. [Let's call this contextual information reinforcement studies.]

Please feel free to comment.


Orion

orion
06-15-2004, 01:05 PM
I want to thank Garret French Editor of iEntry's eBusiness channel for recognizing he made some errors in his calculations and interpretations of the c12-indices discussed in post #83 of this thread,

http://www.webproworld.com/viewtopic.php?t=21161&postdays=0&postorder=asc&start=25&sid=b14ef6f86308c781debdd3ede43ec516

My gratitude and respects goes to Garret and Webpronews. We are all here trying to understand how emerging semantic search engine technologies are hitting the market and how these "smart" IR systems will be using association concepts when indexing, organizing, retrieving an assigning semantic relevance to pieces of information.

Taking small steps now will make the learning curve less sharp later. I am very happy that the SearchEngineWatch Forum is not just another marketing forum. Talking about the new breed of semantic search engines soon or later SEOs/SEMs will have to deal with... Here is a revealing technical report of the SCORE architecture.

http://lsdis.cs.uga.edu/lib/download/S+2002-SCORE-IC.pdf

The "Managing Semantic Content on the Web" paper describes with diagrams the IR architecture of SCORE and how it should work. Quote from paper:

"The benefits of semantic associations are best realized in applications that integrate data, metadata, and knowledge queries."

So, such semantic associations can be explored with the use of term co-occurrences. Furthermore, the use of terms acting as commands (not mere boolean operators) reminds me the concept of semantic "bridges" I introduced in my previous post. Using semantic bridges for connecting terms with weak co-occurrence not only reinforce semantics in a document but simplifies the "flow" and grouping of similar or alike concepts and ideas. I'm currently looking for someone interested in teaming with me or in funding this type of research (emarketers, IR folks, university centers, gov folks, all are welcome).

For those interested in HAL and Microsoft's MINDNET Project, the following links

http://userweb.piasanet.com/tyale/prospect.htm
http://userweb.piasanet.com/tyale/mindnet.htm

are a 'must read'. HAL attempts to use associations "to have a default body of knowledge to better deal with any knowledge the average user"... The idea of using associations is to "provide HAL with common sense, to better recognize the implications of a user's statement to its entire accumulated body of knowledge." But HAL does more than this.

For Microsoft fans, the above reference states that there is a "semantic connectivity project at Microsoft Research, called MindNet, with currently more than seven million word associations." Clearly, semantic connectivity machines are here to stay.

On other matters:

A friend suggested me to provide an example of "bridge" terms. Good point. In theory any term can be a "bridge". As any physical bridge, the best one are those well constructed and frequently used. Thus I found that the best and versatile "bridges" are

1. those semantically connected to the terms to be associated (which should be losely connected when co-queried; i.e., n12 = 0 or negligible).
2. those with high frequency across the target IR system or intended search engine.

Here are some examples I tested yesterday in Google in FIND ALL mode. Results may change since then. (Read previous post before proceeding with the test cases). In all cases k2 is the "semantic bridge". Although I'm using some primitive examples, the cases may be relevant to copywrite style, I think (but I could be wrong).

Case 1

I'm trying to semantically connect or improve the semantic association between k1=pharmaco and k3=narcotraffic. I selected the term "drug", a term with high frequency and somewhat associated with pharmaco and narcotraffic

k1=pharmaco = 117,000
k2=drug = 39,700,000
k3=narcotraffic = 679

k12=pharmaco drug = 38,900
k23=drug narcotraffic = 401
k13=pharmaco narcotraffic = 0


Case 2

I'm trying to semantically connect or improve the semantic association between k1=effervescing and k3=narcotraffic. I selected the term "drug", a term with high frequency and somewhat associated with effervescing and narcotraffic (BTW effervescing comes from effervescence meaning the formation of bubbles in a solution. Note: effervescence narcotraffic is also disconnected in Google.)

k1=effervescing = 6,460
k2=drug = 39,700,000
k3=narcotraffic = 679

k12=effervescing drug = 551
k23=drug narcotraffic = 403
k13=effervescing narcotraffic = 0

While not the best examples, they illustrate the concept of "bridges".

Challenge: Find a k3 in which k1=nigritude and the bridge is k2=ultramarine.

Talking about challenges, it'll be of no surprise if someone comes with a contest such as "Find two non bougus terms with the strongest degree of co-occurrence (c12-index) in Google".

Finally, about the famous NU contest. We tested the semantic connectivity of k1=nigritude
k2=ultramarine over time and for a while. The time series for the c12-index was enlightening.

Last values were (in Google, FIND ALL mode)

k1=nigritude = 1,160,000
k2=ultramarine = 1,170,000
k12=nigritude ultramarine = 525,000
c12-index = 290.86 ppt

Thus the activity behind the NU phenomenon was measurable.

What's ahead: Prefixes, suffixes, stems

Feel free to comment.


Orion

nuclei
06-15-2004, 01:45 PM
I'm currently looking for someone interested in teaming with me or in funding this type of research (emarketers, IR folks, university centers, gov folks, all are welcome).


interesting

orion
06-16-2004, 12:54 PM
PAUSING...

Before discussing co-occurrence of stems and prefixes I would like to make some refinements on the use of c-indices as a measuring device for the so-called Google Bombs phenomenon.

I also want to make clear I DON'T PROMOTE the idea of "bombing" Google or any search engine or even like such idea. Still I feel I need to present the following information since it affects upcoming subjects on c-index extractions.

As mentioned in this thread, I defined a c-index as an intersection/union ratio. Such ratios are present in many forms and can be measured for non mutually exclusive events. I avoided its probabilistic definition and derivation to simplify the discussion. For now, knowing that c-indices are a measure of the degree of co-occurrence is enough.

For terms co-occurrence, measuring such ratios from data extracted from commercial IR systems and search engines is tricky since one must preselect the proper query mode. If we are interested in conducting specialized searches, we also need to define WHERE to search. This is a topic I delayed in order to present the basics first. (I know SEOs "cannot wait" for the topic).

If I want to conduct precise c-index calculations in conjuction with specialized searches, I can order the system to search, for example in a title, url, or links. [I will expand on these type of searches soon since it may interest keyword researchers]. For now let stick to searches in links.

For the Nigritude Ultramarine "bomb", in Google I can conduct a "search only in links", in both FIND ALL or EXACT mode. In Google --using FIND ALL, in-links only, in ppt and two decimal places-- this is what I found today

k1=nigritude || n1 = 53,000
k2=ultramarine ||n2 = 55,200
k12=nigritude ultramarine || n12 = 11,600

c12=120.08 ppt

Compare with the c-index calculated in previous post.

The difference in c-indices is due to the fact that we are measuring terms co-occurrence in links only, which is in harmony with the original concept of link bombing. Strickly from the concept of bombing Google via links (not with the "global" information present in document), this is a more accurate ratio and a better way of tracking Google bombs.

Still If I am interested in tracking and identifying the onset of a Google bomb" I would use both c-index calculations (find-in-links and find-anywhere-in). Here is a list of some link bombs c-indices (as of today's conditions: in Google, FIND ALL, in-links-only, case insensitive, 4 decimal places)

Nigritude Ultramarine
k1=nigritude=53000 k2=ultramarine=55200 k12=nigritude ultramarine=11600
c12 = 120.0828 ppt

Miserable Failure
k1=miserable=6420 k2=failure=21500000 k12=miserable failure=555
c12=0.0258 ppt

Talentless Hack (BTW. This is one of the first bombs described by Adam Mathes)
k1=talentless=271 k2=hack=15600000 k12=talentless hack=21
c12= 0.0013 ppt

Feel free to replicate tests with "President Waffles" or your favorite bomb, keeping in mind that c-indices can be different when calculated using in-entire documents, in-title only, in-link, in-urls, etc. In fact, different type of information and analysis can be extracted from such "localized" indices. Such information could interest SEOs/SEMs. More on this is coming.

Observations

1. Notice how a c12-index assess contributions of individual k's to a two-terms bomb. (for a one-term bomb we are out of luck).
2. Notice the degree of success of the bombs. While the dramatic differences could be the result of purging actions from Google since the bombings, notice the dramatic presence of the nigritude ultramarine bomb still in the retrieved link collection.

Finally, if I use a time series analysis of c-indices,

1. I'm providing Google's researchers with a simple method for monitoring the onset of potential Google bombing activities. [That's a freebie I'm giving to Google researchers.] However there is a drawback since...

2. someone can use the method herein described to measure the success of a Google bomb or purging and remedy actions from Google after or during a bomb.

3. someone may be tempted to use the above procedure to measure the success of competing Google bomb contests starting at a give time, t.

Feel free to comment. Anyone interested in commenting through private email, feel free to do so through the private email feature of the SEW forum.

I'm working on the stems and prefixes material.


Orion

orion
06-16-2004, 09:47 PM
To runarb's thread, post#1:

Welcome and feel at home. Excellent reference material.

To Red5's thread, post#1:

"I am avidly following Orion's excellent discussion regarding Keywords Co-occurrence and Semantic Connectivity, and I'm aware that he's just about to start talking about word stemming and prefixes, so I hope I don't preempt him too much here. Sorry, Orion, if I do!" Hi, Red5. Happy to see someone interested in this area of IR. The more IR concepts we introduce to the mainstream the better. Please feel free to elaborate on the topic of stems. BTW, excellent references.

To Incubator, Garrett's thread, post #2:

Welcome and feel at home. Excellent referenced link, http://javelina.cet.middlebury.edu/cns/Contextual_Network_Graphs.pdf, on LSI which touches many areas of SC (Semantic Connectivity).

To Garrett's thread, post #1:

1. "Does a search engine algorithm interact with semantic connectivity or is it merely a measure of something that exists in a database?" Hi, Garrett. Welcome and feel at home. Terms co-occurrence measured with c-indices intents to measure the degree of semantic connectivity. We can talk about this at two different levels, i.e. (a) at the database level and (b) at the level of individual documents. So far I am introducing the concept at the database level. Discussions at document level are coming. In theory, SC can be measured without regard for the IR or database under consideration and at both levels.

The semantic connectivity concept is not new and is somehow implicit in most IR and ranking algorithms based on Salton's Models in which cosine similarities are used. This includes most current IR systems and search engines that use a Salton component for retrieval and ranking. My model is somewhat different in the sense that expands on the idea of intersection/union ratios (of non mutually exclusive events) as a simplified component for semantic measurements.

2. "What is the unit of measurement of semantic connectivity? Is it the ppt? What are the other measures." Good question. There is no unit for semantic connectivity, at least not in my c-index model, since merely is a dimensionless ratio. Since this ratio runs from 0 to 1 and usually leads to small values, I elected for expressing it in parts per thousands (ppt) by merely multiplying it by 1000. I could have multiplied the ratio by 100, then expressing it as a %.

3. "What are stop words" Stream of characters to be ignored by the IR system during parsing, indexing and retrieval. They are defined depending on how the IR system was programmed. Stream of characters may include too common term(s), but is not limited to that. It all depends on how the IR system was programmed. For example, an IR system (i.e., a vertical portal) about "jobs" will probably consider "jobs" a stop word since querying the term
in its IR system is probably redundant.

4. "What are query mode conditions?" Selection of query modes. Most search engines use FIND ALL, FIND ANY, EXACT, etc.

5. "If everyone let semantic connectivity drive their copy writing how would this affect a database? Would that database lose its value to the searchers?" This question describes a speculative scenario. I leave that question open for others to speculate or to comment on. Any comment is welcome. Hope this help, Garrett.

To Dodger:

Thank for the posts at other forums.


I'm trying to keep discussions relevant to this thread in this thread. Still I welcome other threads on IR topics. The more IR topics we have in The SEW Forum, the better.


Orion

Incubator
06-16-2004, 10:04 PM
To Incubator, Garrett's thread, post #2:

Welcome and feel at home. Excellent referenced link (http://javelina.cet.middlebury.edu/...work_Graphs.pdf) on LSI which touches many areas of SC (Semantic Connectivity).

Orion

Thanks for the acknowledgemet, i also found the references (links) at the end of that .pdf very informative as well

Cheers

Wayne

yleewolf
06-17-2004, 08:19 AM
Hi everyone. Great thread so far even if it is a little heavy on the numbers in places! I was really looking for a simple interpretation of the following c-indices results.

k1=baccarat
k2=gambling
k12=baccarat gambling
c-index=46

k1=baccarat
k2=baccarat gambling
k12=baccarat baccarat gambling
c-index=435

I really don't know what this is telling me. If I search for baccarat,does it mean that there is a much higher occurence of the phrase "baccarat gambling" in the results than there is the word "gambling"?

Can k1 be used as the keyword that you want a page to be optimised for and then use k2 to find phrases that appear more often in the results for k1?

Anybody else in the same boat as me? I appreciate that this is not a simple black and white tool with a pot of gold on the end of it but it would be nice to have some basic theoretical uses for it.

Thanks for your help in advance.

Wolf

orion
06-17-2004, 11:08 AM
Welcome Wolf to this thread. It's an honor to having you here.

I'll try to answer your question to the best of my knowledges. First, I'll need more information

1. What query mode conditions did you use?
2. From which IR/search engine you extracted the c-indices?

Without this information it is hard for us to assess the question.

Regarding: "If I search for baccarat,does it mean that there is a much higher occurence of the phrase "baccarat gambling" in the results than there is the word "gambling"?" c-indices do not measure occurrences. A c-index measures the degree of co-occurrences of two or more k's in a given database. Occurrence and co-occurrence are two different things.

Both the database and query mode must be defined. Let assume you do a search in FIND ALL mode. A search for "baccarat" in a database should return documents containing "baccarat" as well as "baccarat gambling". A search for "baccarat gambling" should return documents containing "baccarat gambling" regardless of sequence. Regardless of the occurrence numbers, we still need to measure their co-occurrence and have something to compare with.

On the other hand, if I do the query experiment in EXACT mode, a search for "baccarat gambling" should return documents with regard for sequence and containing the phrase "baccarat gambling" as well as documents interpreted as containing this as a "phrase" (for a "phrase" as perceived by an IR system -not a human-, see previous posts). These should be a subset of the results obtained in FIND ALL mode.

Regarding: "Can k1 be used as the keyword that you want a page to be optimised for and then use k2 to find phrases that appear more often in the results for k1?" Excellent question. I haven't yet discussed semantic connectivity at the document level. We're still discussing it at the database level. Two different things. That's coming.

To answer the question without getting into details, in theory the answer is a conditional "yes". Why conditional? One must consider the whole picture; i.e., proper copyright style for the target market space, topic and demographic, what exactly a client want to target, etc., etc. c-indices are not silver bullets. I'm still unveiling the basics at the database level. When we discuss term co-occurrences and semantics at the document level, we will discuss these and other topics as well. The whole thesis of this thread is to introduce SEOs/SEMs to analytical tools and perhaps in the process try to remove some trial-and-error or 2n-guessing approaches from the scene.


Note. A word on repeating k's for c-index extractions. c-indices intent to measure co-occurrence. Repetition of terms can impose a false bias in the c-index values which otherwise would not be there. Certainly using something like k1 = k2 = T, where T is a term, produces results with no semantic significance (at least not from the co-occurrence standpoint).

I hope this help.


Orion

yleewolf
06-17-2004, 11:42 AM
Hi Orion

Thank you for your speedy reply. In answer to your first questions, I put the terms into the k1 and k2 boxes here: http://graphnical.com/cindex/ . What I am trying to do is to use the same k1 and measure the indices against different k2 phrases. So, referring back to my earlier post, if a google user queries for baccarat, do my c-index results tell me that I am probably better to use the phrase "baccarat gambling" in my content rather than just "gambling" due to its much higher correlation factor or are the results distorted by the repetition of the word "baccarat" in the k12?

Sorry if I'm pre-empting a future discussion.

Regards

Wolf

orion
06-17-2004, 04:09 PM
To Wolf:

"Thank you for your speedy reply. In answer to your first questions, I put the terms into the k1 and k2 boxes here: http://graphnical.com/cindex/ . What I am trying to do is to use the same k1 and measure the indices against different k2 phrases. So, referring back to my earlier post, if a google user queries for baccarat, do my c-index results tell me that I am probably better to use the phrase "baccarat gambling" in my content rather than just "gambling" due to its much higher correlation factor or are the results distorted by the repetition of the word "baccarat" in the k12?" Sorry I couldn't answer as speedy as before. Hope this help.

You may have answered your our question ("are the results distorted by the repetition of the word "baccarat" in the k12?"). Certainly.


First let's define the query conditions for measuring co-occurrence in the queried database: Target is Google, experiment is conducted in FIND ALL and find-anywhere-in-document. (For FIND ALL in-title, url-only, etc... see previous posts).

Your question involves two different issues

(a) how user's formulate queries to the database (user's behaviors)
(b) what is interpreted as being present in the queried database (IR behaviors)

If we want to consider co-occurrence not just from what is in the database, but what users actually search for (users' behaviors), then we should compute c-indices using the query mode conditions used by average users. Thus, the discussion that follow applies to average users (a full discussion on user's query behaviors is ahead of us).

Average users tend to use the default query mode of the queried database or IR system when searching for information.

Fortunately, the default mode used by Google is FIND ALL any-where-in-document (and without regards for sequence), not FIND ANY(also known as the OR mode). In Google FIND ANY or OR is the "at least one of the words" option of the Advanced Search tool.

Consequently in FIND ALL mode we should expect documents containing all terms of the query more likely to be found and returned. Thus

1. A search for "baccarat gambling" in Google should return documents containing "baccarat" and "gambling" without regards for sequence, location, how many times the terms occur in the documents and how.
2. Similarly a search for "baccarat baccarat gambling" should return documents containing "baccarat" and "gambling" without regards for sequence, location, how many times the terms occur in the documents and how. Any difference between 1 and 2 should be relatively small. See results below.

The relative difference between 1 and 2 is 2,000/945,000 = 0.002 or about 0.2%.

Target: Google
Mode: FIND ALL
Date/Run: 06-17-04/11:19 AM

k1=baccarat ; n1=1,860,000
k2=baccarat gambling ; n2=947,000
k12=baccarat baccarat gambling ; n12=945,000
c12=507.52 ppt

k1=baccarat ; n1=1,860,000
k2=gambling ; n2=17,300,000
k12=baccarat gambling ; n12=947,000
c12=52.00 ppt

Observations: 507.52 >> 52.00 since

gambling=17,300,000 >> baccarat gambling=947,000

Reusing "baccarat" in "baccarat gambling" ads bias in the denominator of the calculated c value.


Finally,

A search for "baccarat" in Google should return documents containing "baccarat", "baccarat gambling" and also "baccarat baccarat gambling".

Certainly, these results should differ if we conduct the experiment at other query conditions. If we conduct the experiment in EXACT mode (with regard for sequence, see postS #52 and #86), then, a search for "baccarat baccarat gambling" should return only documents containing that sequence. This could include documents containing the "phrase" and delimiters; something like

.....baccarat baccarat gambling...
.....baccarat. baccarat gambling...
.....baccarat - baccarat gambling...
.....baccarat | baccarat gambling...

...etc..

See posts in this thread on what is/is not an EXACT search and a "phrase". It all depends how the IR system was programmed to parse the information. If I instruct a system not to ignore hyphens or pipes and use EXACT mode, documents containing

.....baccarat - baccarat gambling...
.....baccarat | baccarat gambling...

will probably be ignored when I search for "baccarat baccarat gambling".


A c-index is a tool and as any tool it can be misused or its results misinterpreted or artificially inflated (see post #86 of this thread). If we want to make correlations, draw conclusions from what the c-index values represent, and include user's query behaviors, then we need to consider just that. How many average users search for "baccarat baccarat gambling" instead of "baccarat gambling", ... etc...

As mentioned in this thread, a combination of user's keyword trackers (what we search), c-index values (what/where we search) and actual users behaviors (how/where we search) is necessary. This is where we are heading to.

To conclude, c-indices are intersection/union ratios of non mutually exclusive events, or if you wish, probability ratios. The higher that ratio the higher the probability of co-occurrence of randomly selected terms.

For a c12-index, the extreme case c = 1 with

k1 = k2 = k12

is an illusion and does not exist, since it would require that

c = n12/(n1 + n2 - n12) = n12/n1, in which case we cannot talk in terms of non mutually exclusive overlapping events.

Consequently, it is not surprising that, say, a c12-index values approaches 1 as more term repetitions are included in the associated k's.

This is not necessarily a drawback of the theory. If I run a search engine, I would suspect of artificially inflated c values

1. as the result of an imposed artificial bias in the query, which I of course cannot control or avoid or...
2. of truly co-occurrence spamming/bombing activities taking place in the database collection, which I can detect, control and monitor over time (see posts #90, #92 and "Google Bombs" material).


For the online tool you and others may be using, check post #86 of this thread. While a useful tool, it can only be used with Google. They should have placed some instructions on how to use it or how to interpret results. Still I welcome the development of this and additional tools.

Hope this help.


Orion

orion
06-17-2004, 06:19 PM
Errata (Oops, Pardon, Perdon)

I am correcting previous post. At some point I originally wrote the following, and quote

"For a c12-index, the extreme case c = 1 is an illusion and does not exist, since it would require that

k1 = k2 = k12
c = k12/(k1 + k2 - k12) = k12/k1, in which case we cannot talk in terms of non mutually exclusive overlapping events."

I should read as follow (already corrected)

"For a c12-index, the extreme case c = 1 with

k1 = k2 = k12

is an illusion and does not exist, since it would require that

c = n12/(n1 + n2 - n12) = n12/n1, in which case we cannot talk in terms of non mutually exclusive overlapping events."

Note the use of n's in the c-expression and lines sequencing. I edited the post to reflect my intentions. At this time of the day, boy, I'm tired. Again I ask for your indulgence for this gross rational horror.


Orion

yleewolf
06-17-2004, 07:19 PM
Man! Thank you for your detailed response! You kind of lose me halfway through but I appreciate you trying to explain it to me! There obviously is no simple answer to my question but just so we're absolutely clear:

I am not talking about the SE user searching for anything other than "baccarat". This keeps it simple as it is EXACT and FIND ALL. What I thought the c-indices that I posted earlier were telling me was that my content on a page optimised for the word "baccarat" would benefit more from the inclusion of the words "baccarat gambling" rather than the single word "gambling".

I now know that this is not what the indices are saying as the repetition of "baccarat" creates a distortion in the results. Can we conclude then that this tool cannot help us when trying to pick relevant words for content optimisation? Can I stick to my knowledge of the english language, common sense and all the other things we content builders have to rely on?

Orion - I really appreciate the time and effort you have gone to in trying to answer my question. I shall continue to read with interest!

Regards

Wolf

DanThies
06-17-2004, 08:08 PM
Orion, all I can say is just keep going. :)

What we're working on is a potential aid to keyword research. Well, it's really just an excuse for me to play around with this stuff but it might become a tool some day. Here's what we're working on, I'd like to hear from anyone here, if you have ideas to improve this:

1) Starting with a single search term, we take the top 10 results from each of 3 search engines (Google, Teoma, and Yahoo), which gives us a list of up to 30 URLs.

2) For each URL in the list, we use our spider to index the page, and extract a list of 1, 2, and 3 word search terms from the page, ignoring stop words. This list represents "candidate" search terms that may be related to our original search term.

3) For each search term in our list, we perform a c-index calculation using Google results, with exact phrase matching, sort and present the results.

I assume that our results will improve as we expand the number of URLs to crawl. We haven't done so yet, but we're also considering a deep crawl based on, say, an entire Open Directory category.

So far, this hasn't proven extraordinarily effective, but it has helped us discover some related search terms when our standard tools failed us. We may need to bolster the automated (crawler-driven) discovery with a lexical database like WordNet (http://www.cogsci.princeton.edu/~wn/), not sure when we'll get around to that.

nuclei
06-17-2004, 08:19 PM
I think the wordnet database would be an excellant source for what you are doing. Very solid way of gathering related keywords Dan, tho you should still test everything against a thesauras somehow also I would think.

orion
06-17-2004, 09:52 PM
To Dan:

The procedure you describe sounds interesting.

Let me add that c-indices extracted from a database are a kind of "global" or "macro" measure. They may tell nothing about what's going on at the document level or miss important information. Other indices I will discuss deal with issues at the document level ("micro"). Semantic connectivity at the document level is treated a bit different.

I don't understand the "I assume that our results will improve" part. That part lost me. Could I assume Dan that you are talking about success at the document level and therefore rankings? Am I on the right track, here?


To Nuclei:

1. "I think the wordnet database would be an excellant source for what you are doing." Nuclei, you are right on the money. In theory we can use many of the several word repositories collected over the years by IR and TREC folks. Wordnet is one of such efforts.

Such repositories serve a valid purpose at the research level and for testing controlled collections. Still for targeting noisy environments (the commercial Web), current and new or branded products in the market, current clients or Fortune 500 accounts (e.g. "Google Stocks", "Buy Viagra Pills" or "Caltrate Pills Direct", etc) such repositories are quite limited. Still, they provide some good guidelines. As with any tool, they serve a purpose and when misused, they may lead to messy results.


To Wolf:

1. "What I thought the c-indices that I posted earlier were telling me was that my content on a page optimised for the word "baccarat" would benefit more from the inclusion of the words "baccarat gambling" rather than the single word "gambling". Hi, Wolf.

(a) Users searching for "baccarat" should find documents optimized for "baccarat" and "baccarat gambling".
(b) User searching for "baccarat gambling" should find documents optimized for "baccarat gambling" but should miss documents optimized for "baccarat" unless they search in OR mode (FIND ANY).

2. "Can I stick to my knowledge of the english language, common sense and all the other things we content builders have to rely on?" Please feel free to do so. SEOs/SEMs should stick to any proven method that benefit their clients.

3. "Can we conclude then that this tool cannot help us when trying to pick relevant words for content optimisation?" c-indices extracted from a database collection measure the degree of co-occurrence in that particular database. It is a "macro" measure. For topically related terms, that degree is a measure of the semantic connectivity of the terms in that database at the "macro" level. We haven't discussed yet, co-occurrence and semantic connectivity at the level of individual documents and what it can do for SEOs.

Not sure how I could say this and I ask for your indulgence in advance...

Some way, some how, SEOs give me the impression benefits derived from c-indices (or from any tool) can only be visualized, measured or rationalized at the document level.

There are two type of co-occurrences, one taking place at the database level and another taking place at the document level. When we discuss semantic connectivity and co-occurrence at the document level, things may look a lot clear, I hope.

Depending on which IR research group you ask, the following and other concepts can be defined or interpret differently at the document level (micro) and at the database level (macro)

1. Frequency
2. Similarity
3. Relevance Entropy
4. Co-occurrence

Orion

secretniche
06-18-2004, 12:33 AM
Hi!

Orion being a scientist probably 50% right or 50% wrong depending on reader's attitude. What I'm trying to say is his analysis is based on search queries dictionary which in itself is a subset of a natural English dictionary. Just like someone here has already noticed - this subset may be and is heavily biased by various SEO copywriting schemes.

Visitors use natural language patterns and co-occurences, and their search strings do not coincide with desires of copywriters. Sometimes they do, though. That's why I feel that Orion is correct at least 50%.

If we take a topic (let's say, recreation), then this topic may be represented by a superset of related/relevant word patterns. Treating this superset as a class (conceptual idea) we will sort it out into several distinct groups class properties (where, how, how expensive, when), class functions (what for, what it does, why we need it, its main raison d'etre, so to say). Class functions are constants for the whole class (for example, class SEAT will have the same function - to sit on/to rest with one's bottom, properties are variables, for each defined/specified variable we will have a different object/instance of the SEAT class - chair, stool, sofa, step, cheap leather chair, expensive antique sofa et cet. ad naus.).

What it means is that users are building their search queries based on their wishes, problems, desires - that is, they're searching for a result (mostly), that is for a function. Web masters compile a bunch of key words based on their analysis of Google dictionary and their perception of a object quality and availability in space/time frame (class properties). In other words, users searching for what web masters do not provide.

We probably would better off when apply an analysis of a real language structures and patterns. I have an extensive experience in field (live) sales and I know that people are actually looking for results, not the properties. They are not searching a specific brand of shoes (normally), but for something good to wear and look, and be appreciated by their friends and co-workers.


Regards, secretniche

orion
06-18-2004, 04:46 PM
To secretniche:

Welcome secretniche. It is an honor to having you in this thread. I feel I need to provide a full response to your post. I'll try to do this to the best of my knowledge. Let's see.


1. "Orion being a scientist probably 50% right or 50% wrong depending on reader's attitude." I Agree. I make mistakes and certainly I don't own the "truth". See post #18. I agree that some may think I'm 50% right/wrong, but not because I'm a scientist.

2. "What I'm trying to say is his analysis is based on search queries dictionary..." I Disagree. Although there are some marginal references to dictionary queries in this thread the topic of the thread is term co-occurrence and semantic connectivity. I'm simply introducing analytical tools to SEOs/SEMs. One of these, the c-index concept, is based on probability theorems which I'm presenting in non technical terms.

3. "We probably would better off when apply an analysis of a real language structures and patterns." I Disagree. The more we know about how IR systems and search engines interpret information, rationalize semantics, assign weights to terms and rank results, the better.

4. "I have an extensive experience in field (live) sales and I know that people are actually looking for results, not the properties." I agree that clients may only care about results. Analysts care about how to get there.

5. "They are not searching a specific brand of shoes (normally), but for something good to wear and look, and be appreciated by their friends and co-workers." When we talk about branding-driven or non branding-driven searches we need to be very careful. Here you are 50% right/wrong, depending on web surfer's needs or reader's attitude.

True that users looking for video games or shoes may conduct queries for "video games" or "shoes" regardless of brands. But I don't think we can extrapolate this scenario, to other scenarios and then draw the above conclusions.

We need to consider many aspects, not just science or marketing, for that matter. We need to consider, for example, the audience and demographic we want to target. Certainly users interested in buying online, let say, Viagra or Cialis pills, may include "Viagra", "Cialis", etc (a brand) in their queries instead of searching for "erectile pills". Check how many branded keywords are present in organic and paid results in Google, Overture, etc.

Last but not least

From a practical standpoint, a c-index can be used to assess the degree of co-occurrence (and patterns) of branding-driven searches. Here is an example

IR/Search Engine: Google
Mode: FIND ALL
Time/Date:06-18-2004 at 3:35 PM

k1=viagra n1=12,300,000
k2=pills n2=11,900,000
k12=viagra pills n12=2,310,000
c12=105.53 ppt

k1=cialis n1=3,090,000
k2=pills n2=11,900,000
k12=cialis pills n12=956,000
c12=68.12 ppt

k1=erectile n1=1,580,000
k2=pills n2=11,900,000
k12=erectile pills n12=373,000
c12=28.46 ppt


Observations:

1. viagra pills and cialis pills produce more relevant results than erectile pills
2. overall results are viagra > cialis > erectile.
2. Degree of connectivity is viagra 105.53 >> cialis 68.12 >> erectile 28.46
3. Degree of strong brand association with pills is viagra >> cialis


A similar assessment and treatment can be applied to paid-results regardless of products, services or brands.

c-indices (term co-occurrence measures) are then non trivial marketing metrics. Note we have elucidated which of the three patterns "____ + pills" is more connected in Google.

Hope this help in some way to address your post.


Orion

DanThies
06-18-2004, 10:00 PM
To Dan: The procedure you describe sounds interesting.

Let me add that c-indices extracted from a database are a kind of "global" or "macro" measure. They may tell nothing about what's going on at the document level or miss important information. Other indices I will discuss deal with issues at the document level ("micro"). Semantic connectivity at the document level is treated a bit different.

I don't understand the "I assume that our results will improve" part. That part lost me. Could I assume Dan that you are talking about success at the document level and therefore rankings? Am I on the right track, here?

We aren't trying to optimize web pages, we're looking for relevant search terms to use in developing content. The reason I expect to get better results when we crawl more pages, is that it will increase the number of candidate search terms available to us.

So if, for example, we're trying to find terms related to "web hosting," we crawl the 30 top-ranked pages for "web hosting," then extract all possible search terms that exist on those 30 pages. We will find many related search terms within those pages, but there would certainly be other related search terms that don't exist within that small population of documents.

Using Google to calculate the c-indexes is expedient because it requires no additional development. Ideally, what I would like to do is spider a group of web sites from the appropriate DMOZ category to create an index, and use the co-occurence within that subset of the web, to see if it works better. One of these days, not soon.

AussieWebmaster
06-19-2004, 12:07 AM
I was wondering when someone would look for a way to apply this. I like that one.
Good idea there mate

nuclei
06-19-2004, 12:20 AM
Why not soon Dan? We already have the code and the ability to pull dmoz data easily and quickly. Possibly a co-owned solution would be ideal for both of our groups.

DanThies
06-19-2004, 03:11 AM
Why not soon Dan? We already have the code and the ability to pull dmoz data easily and quickly. Possibly a co-owned solution would be ideal for both of our groups.
Email me, mate. :) Right now I'm putting time and resources into building other applications, to solve other business problems. But if you wanna smash some code together, let's talk.

orion
06-19-2004, 11:16 AM
To Dan and nuclei:

Dan and nuclei, your ideas and thinking makes a lot of sense and are practical. I would like to see the day these and other IR material are presented/introduced in SEOs/SEMs conferences.

In the meantime, I'm happy to see some in this thread are thinking on valid applications of the c-index concept. I welcome those efforts. What so far I have disclosed are the basics for the better things to come.

If your groups have the resources (e.g, machine/man-power) I will be very interested in participating in a three-way project. Asides, c-indices, I have some specific metric instruments in the area of datamining I would like to see exploited by others. Feel free to contact me through the private email feature of this thread. The best things are ahead of us.


Orion

orion
06-22-2004, 12:46 PM
I am posting this information on two thread of SEW Forums

1. Fox News & Danger Of Citing Search Counts, started by Danny Sullivan
http://forums.searchenginewatch.com/forum/showthread.php?t=299&highlight=news

2. Keywords Co-Occurrence and Semantic Connectivity, started by Orion (me)
http://forums.searchenginewatch.com/forum/showthread.php?t=48

I hope this post clarifies in some way the missconception news writers and reporters have on search results interpretations. Let's see.

I agree with Danny and most readers. I would go little far from that. The use or misuse of query-driven absolute results, as absolute ranking results for that matter, often leads to misleading statistics. As Rich Ord, a writer from WebProNews put it, search results are not Gallup polls.

To assess association of terms and concepts we need to conduct semantic associations studies at both database and document levels.

For the terms and phrases discussed in Danny's thread, we conducted a semantic connectivity analysis at the database level and these are the results (as of today). Results may change over time. See my thread for the theory behind the results. I use terms co-occurrence at the database level (Terms co-occurrence and semantic connectivity at the document level will soon be discussed in the thread).


This is what we got.


IR/SEARCH ENGINE: GOOGLE
QUERY MODE: FIND ALL
DATE/TIME: 06-22-04 AT 10:30 AM
CASE: INSENSITIVE


k1=Fox n1=20,500,000
k2=anti-american n2=546,000
k12=Fox anti-american n12=50,800
c12=2.24 ppt

k1=bbc n1=25,400,000
k2=anti-american n2=546,000
k12=BBC anti-american n12=51,400
c12=1.98 ppt

k1=white house n1=12,400,000
k2=anti-american n2=546,000
k12=white house anti-american n12=143,000
c12=11.17 ppt

k1=bush n1=31,400,000
k2=anti-american n2=546,000
k12=bush anti-american n12=338,000
c12=10.69 ppt

k1=danny sullivan n1=1,290,000
k2=anti-american n2=546,000
k12=danny sullivan anti-american n12=8,990
c12=4.92 ppt
RESULTS

1. c-index values, in ppt: 11.17 (WhiteHouse) > 10.69 (Bush) > 4.92 (Danny) > 2.24 (FOX) > 1.98 (BBC)

2. the c-indices, fraction of co-occurrence indicates that there is a lose term co-occurrence between the cases (around 1% or less or if you wish around 11 ppt or less). If we take in a blindfold manner the c-index results then danny sullivan "would be" more anti-american than Fox or BBC. Similarly we can blindfold and incorrectly assume that Fox is slightly more anti-american than BBC.

3. Do the homework: Results in EXACT mode (find "phrases"), demonstrates no significant correlation at all. Cluster similarity results don't even help, either.


CONCLUSION

1. If we wrongly cite absolute results then we must wrongly assume that the white house and bush "are" more anti-american than Fox or BBc.

2. It is clear that absolute results cited by "reporters" to reinforce a story are meaningless. Without a clear understanding of terms and concept association, semantic connectivity and the underlying theory, those results qualify as fabricated facts --at least in my book.

3. The above results are valid in the queried database only, ie. Google.

4. For a true confidence level analysis, we need to conduct c-index values that change over time (time series co-occurrence study). Only then we can talk in terms of trends and patterns and "spikes" of signifcances. Certainly conducting such studies during Spring or Summer is not the same as conducting them during Fall or 30 days before or after Election Day 2004.


I invite news writers as well as marketers interested in conducting similar semantic connectivity studies in connection with phrases or concepts (to be targeted or marketed;ie., slogans, brands, catchy political gimmicks etc.) to revisit my thread on terms (or concepts) co-occurrence and semantics.

They may find some new and other applications for my c-indices I may have overlooked. I can provide any interested party with sounded scientific marketing research data or help them to assess their current data. Just need to ask.

Sorry to step in. I cannot help myself when others try to missguide the so-called "public opinion".


Some time during the day of today or tommorrow I will start with terms co-occurrence at the document level, an area may interest SEOs.

Orion

Errata/Clarification: Rich Ord, mentioned above, is staff writer of WebProNews and actually CEO of iEntry, Inc. His article can be found at http://www.webpronews.com/news/ebusinessnews/wpn-45-20040622CitingSearchResultCountsIsNotNews.html

orion
06-23-2004, 12:14 AM
This post is organized as follow

1. c-indices revisited
2. Terms Co-occurrence Levels


C-INDICES REVISITED

IR systems and search engines are designed to assess information at two different levels; at the database level ("macro") and at the level of individual documents ("micro"). So far in this thread I have presented a simple method for measuring terms co-occurrence at the database level. This has been accomplished by means of a simple ratio, I call the c-index.

Technically speaking, a c-index is just an intersection/union ratio or probability ratios. Such ratios are also probability values and can be measured whenever we deal with ANY NUMBER or TYPE of non mutually exclusive events. Consider three non mutually exclusive events, E1, E2, and E3. If an event E results from the occurrence of E1, E2, or E3 or any combination of them, the probability of E is

P(E) = P(E1) + P(E2) + P(E3) - P(E1 and E2) - P(E1 and E3) - P(E2 and E3) + P(E1, E2 and E3)

(See Handbook of Applied Mathematics for Engineers and Scientists"; Max Kurtz, McGrawHill, 1991)

Thus, c123 = P(E1,E2 and E3)/P(E)

Note that for 3 events taking place, this is just ONE intersection/union ratio since there are a total of 4 overlapping regions. The others are given by the corresponding c12, c13 and c23, in which case we have a cluster of co-occurrences.

This scenario is not present when we examine terms co-occurrence between two non mutually exclusive events. For the particular case of two non mutually exclusive events the above expression reduces to

c12 = P(E1 and E2)/P(E)

where now we redefine P(E) as

P(E) = P(E1) + P(E2) - P(E1 and E2)

or using our now well known working expression

c12 = n12/(n1 + n2 - n12)

which is of the same form as the so-called Jaccard's Coefficient [http://www.dcs.gla.ac.uk/Keith/Chapter.3/Ch.3.html] and the correlation index expression described in chapter 2 and 5 of Modern Information Retrieval, Baeza-Yates, Ribeiro-Neto, Chapter 2 and 5). This is where the similarities end.

The c-index I describe is simply an intersection/union ratio of non mutually exclusive events, regardles of number of events or type of events we are dealing with -- no more, no less.

For example, for the particular case of three events, in which events E1 and E3 are mutually exclusive and E2 does overlap (is non mutually exclusive) with E1 and E3 we have a scenario in which event E2 acts as a "bridge" between two mutually exclusive events; formally

P(E) = P(E1) + P(E2) + P(E3) - P(E1 and E2) - P(E2 and E3)
c12 = P(E1 and E2)/P(E)
c23 = P(E2 and E3)/P(E)

These "bridges" can exist in many directions and their sole purpose is to associate apparently semantically disconnected terms (or stems or concepts). In addition to being part of queries, their associative functions are somewhat similar to command terms used in natural language IR systems (which do not rely on boolean operators) --this subject, out of the scope of this thread.


TERMS CO-OCCURRENCE LEVELS

When measured at the database level, incidents of terms co-occurrence (as measures of semantic connectivity) are defined in terms of number of results (documents) retrieved. At the document level, term co-occurrence must be defined differently. The analogy is somewhat straightforward.

At the macro level we deal with

Target: A Database
Events: # Documents in the database associated to a query

At the micro level we deal with

Target: A Document
Events: # Passages in the document associated to a query

How "passages" are defined affect the notion of terms co-occurrence.

Here we could invoke many frameworks, from association clusters, metric clusters, and scalar clusters to OKAPI-based schemes. Since I'm trying to present analytical tools to a wider audience (SEOs), I will introduce first the simplest things, hoping some may find useful the information to be presented.

Why SEOs should know about things going on at macro and micro levels? Simply put, because IR and search engines interpret, evaluate and characterize information at the two levels. To illustrate consider the concept of term weights.

Those familiar with Salton's Term Weight schemes know that the concept of frequency upon a query is defined differently at the database and document levels. In one we deal with document frequencies and in the other we deal with term frequencies. The combination of the two is what provide the basis for term weights and similarity studies.

[Thus, the generalized idea that keyword densities equate to term weights is indeed misleading and not necessarily will work with IR systems based on a particular variant of Salton's Vector Space Model.]

To conclude, we need to understand the whole picture and things going on at both the database and document level.

What's Ahead

1. Terms co-occurrence at the document level.
2. Document passages and readability concepts.
3. Cluster measures.


Orion

orion
06-24-2004, 04:02 PM
So far we have examined terms co-occurrence at the macroscopic level of databases. We have used c-index values for this purpose. When discussing c-values and co-occurrence events we have considered average user's behaviors; i.e., that average users tend to search

1. using queries consisting of 2 to 3 terms.
2. using the default query mode of the target search engine.

In Google and other search engines, the default mode is FIND ALL, anywhere in the document (title, headlines, links, body, urls, etc). We have explained that this mode returns results containing all terms specified in the query, without regard for sequence and proximity.

Before proceeding with the discussion, I would like to introduce the notion of "length scales of observation". For the present discussion, I define a scale of observation as the level at which incidents of co-occurrence are evaluated (i.e., where we search). The "length" part is somewhat relative and evident.

LENGTH SCALES OF OBSERVATIONS AT THE DATABASE LEVEL

When we extract c-index values from a large database, we obtain the fraction of documents in which terms co-occur. In the process, we are explicity relying on the effectiveness of the algorithms employed by the IR system or search engine. While we cannot avoid this dependency, we can improve the quality of our datamining experiments by resizing the scale of observation (where we evaluate). We can do this by extracting c-index values from specific portions of the database or from specific databases.

For example, some search engines have advanced features which allow users to search in FIND ALL mode in a given page identifier or descriptor (e.g., search only in titles, links, urls, etc). Thus, a savvy SEO or SEM specialist should be able to conduct c-index datamining experiments in, for instance, titles only. He/she can then identify and compare candidate combination of terms with his/her client's competitors.

Similarly, a specialist interested in targeting a specific market space, region or demographic may want to extract c-index values from focused databases. For example, instead of using Google, Yahoo or MSN, he/she may want to use, let say, a regional version of the engine (France, UK, Spain, etc.) Similarly, specialists interested in providing Spanish SEO servicies may want to extract c-indices, for example, from Google Puerto Rico, Yahoo Mexico, etc...

Finally, specialists interested in targeting specific market spaces may want to extract c-index values from specific categories and subcategories from directories. Here a suitable choice could be ODP -dmoz.org-, a topic-specific vertical portal or any industry-focused directory.

The general idea is to come up with terms, not only semantically connected but that make sense to the target audience. When conducting such experiments, however, the specialist may want to examine more than one regional database or directory, even if these target the same market space, region or demographic.

In the above cases, we are reducing the scale of observation (where we search), without changing the definition of the events under considerations (the events are # documents upon a query).

At the document level the story is slightly different. Terms co-occurrence and semantic association of terms (semantic connectivity) becomes a local phenomenon. Here, one way or the other, we may need to deal with

(a) the length scale used
(b) how we define the events
(c) readability issues

Thus, trying to blindfold apply a macro tool to a micro system, will simply not work.
Fortunately, we can resource to a vast knowledge that predates the Internet and search engines: Readability Studies and Classic IR research.

(a), (b) and (c) can be treated using the notion of "passages". In readability studies, a passage is defined as a segment of a written work of defined length. For our discussion, we will visualize "passages" as portions of information in web documents. These passages can be dissected and quantitatively analyzed. In the next posts of this thread, I will explain how to do this and how SEOs can add several analytical tools to their "SEO Toolbox".

Orion

AussieWebmaster
06-24-2004, 07:48 PM
how does copyright work? If I cut and paste all this and put it in a book.....

steve sardell
06-24-2004, 11:10 PM
This is a thread requiring thought. It is much appreciated. FWIIW it is the first time I have ever used the *show printable version* feature on a forum.

nuclei
06-24-2004, 11:42 PM
how does copyright work? If I cut and paste all this and put it in a book.....

you could be sued :p

The first provable publishing date of any work is legally a copyright. Since this is posted on a public forum, I would think that constitutes provable first publishing.

orion
06-25-2004, 12:47 AM
Welcome to this thread steve sardell. Please, feel at home. It is an honor to having you here.


To Nuclei:

1."you could be sued" You're right on the money, nuclei.


To AussieWebmaster:

I'll address your post without hard feelings.

1. "how does copyright work?"

You are describing a lovely speculative scenario already thought and a path already walked. I leave lawyers and copyright speculators do their job. I keep dated hardcopy and originals of all material I've posted.

2. "If I cut and paste all this and put it in a book....."

You're formulating this statement in 1st person, so I'll address it accordingly by using "you" or "your".

If you or let say some greedy SEO or publisher (no reference to you, aussie) try to "cut and paste all this and put in a book", perhaps four possible scenario may take place:

a. you (or that person) may need to respond some questions from JupiterMedia.
b. you (or that person) may need to respond some questions from me.
c. you (or that person) may make couple of bucks...
d. you (or that person) may lose credibility in the industry since SEOs, SEMs and clients monitoring and reading the SearchEngineWatch Forum know the information herein presented are not your ideas (or that person's ideas) but mine.

I'm not a writer but a scientist (certainly not a naive one) that have been at both sides of SE's, SEOs and NASDAQ. If someone want to write a to-the-nitty-gritty book with me on the information herein presented, hey, let's talk. I welcome the project. Perhaps we could publish more advanced things and make the SEO industry less speculative and more aware of analytical tools.


Orion

DanThies
06-25-2004, 01:00 AM
I think AW was just joking there, comrades... this is a very information rich thread, if we can get back to the subject.

nuclei
06-25-2004, 01:37 AM
I know Aussie was joking. I was just answering his question :p

AussieWebmaster
06-25-2004, 05:27 PM
Oh well guess I will have to win Mega Millions tonight then...

orion
07-03-2004, 12:50 AM
Sorry I've been away from the thread. I was addressing rutinary research work and life issues.

I'm back.

I feel the nature, extension and format of the material I was planning to post at this thread could better be addressed at the http://www.miislita.com site.

Why I started this thread? Simply put because I feel anything that contribute with the professional growth and specialization of SEOs/SEMs won't hurt them. On the contrary, the more they know about IR/AI topics the better. Perhaps best things are ahead of us.

Unlike keyword-driven searches and node-to-node analyses, I strongly believe that, semantic analytics is where we are heading to (SEOs/SEMs/SEs).

For those interested in an in-depth discussion on the topics I'll be posting on, feel free to visit my series of articles on keywords co-occurrence and semantic connectivity available at http://www.miislita.com/semantics/c-index-1.html

A recent article (http://www.miislita.com/semantics/c-index-4.html) addresses the question of co-occurrence (c-indices) vs sequencing (EF ratios), as the "C2 COKE issue" posted at the http://forums.searchenginewatch.com/forum/showthread.php?threadid=439

thread. A possible explanation to this problem is given in http://www.miislita.com/semantics/c-index-6.html For anyone interested in advanced issues, these can be addressed through regular email.

I'll address general research questions on co-occurrence and semantics at this thread; of course, if SEW editors allow it.


Dr. E. Garcia,
Mi Islita.com

Orion

hrih
07-05-2004, 10:46 PM
When semantic markup goes bad by Matthew Thomas (http://mpt.net.nz/archive/2004/05/02/b-and-i)

orion
07-06-2004, 12:03 AM
Welcome to this thread hrih. Please feel at home.

Excellent reference on semantic tagging in structured html. Those issues are being addressed with XML semantics. This thread discusses word semantics.

Orion

orion
07-07-2004, 11:05 PM
Incubator asked in the Term Vector and Keyword Weight thread http://forums.searchenginewatch.com/forum/showthread.php?p=4942#post4942

the following question:

"Landing page, 500 words, keyphrases used 3 max per page, keyphrases being 2-3 words deep across the content of the landing page. How do you recommend the implementation of those words? I am not speaking about layout but how semantic terms will fall accordingly across the remaining content
Cheers and great to see your posts again !!!"

I am answering this question here since I feel is a generic question on phrase keywords implementation in a page, rather than how to use TVT. For a generic implementation of combinations of words, I would use the following strategy. (No point intended in making this a "standard" procedure.)

(a) The first thing I would check is how connected individual candidate terms are in given combination. For this I would do a co-occurrence analysis with c-indices and sort combinations according to degree of connectivity. Ideally an mxm matrix will give me all possible combinations.

(b) For the top combinations, I would proceed to do a sequencing analysis with EF ratios. EF ratios will gives the probability that an average query in FIND ALL mode (default in most engines) will contain the query as an exact sequence. See http://www.miislita.com/semantics/c-index-4.html

(c) For positioning combinations with more than 2 terms, a fair strategy is: after previous two steps, try to position the page in top for combinations of two keywords. Once ranked, try to append a third keyword where the combination appear in the document. Check the new ranking. Often this requires a proper analysis of the passages in which the combinations appear.

I will discuss this part soon, since involves term co-occurrence at the document level, a subject this thread will be moving to.

Again, no point intended in making the above steps a "standard" procedure.

This thread is moving now to its second phase; i.e. terms co-occurrence at the document level. I hope future posts will clear any reserves certain SEOs may have on the benefits of having co-occurrence, semantic connectivity and terms sequencing strategies in their "tool box".

Hope this help in some way.

Orion

AussieWebmaster
07-08-2004, 05:18 AM
Bring it on MacDuff

AussieWebmaster
07-08-2004, 05:20 AM
Bring it on MacDuff
Or should that have been Okay let the second line hold the line.

orion
07-16-2004, 12:45 AM
Search engines use local and global information to assess term weights (keyword weights), semantic connectivity and other observables. For example, in http://www.miislita.com/term-vector/term-vector-2.html I explained how term vector theory is used to evaluate term weights (using tf*IDF schemes) and why keyword density values should not be taken for term weight values. Adding to this, frankly, SEOs/SEMs that spend their time adjusting keyword density values in documents, going after keyword weight tricks or buying the latest keyword density toy are wasting their time.

If an seo want to compute term weights, he/she should replicate the weighting scheme of the target system. This, however, is not an easy task. For details see http://www.miislita.com/semantics/c-index-7.html

Before discussing terms co-occurrence at the document level, I would like to discuss a measure for keywords sequencing I call EF ratios. Unlike c-indices, which measure degree of co-occurrences, EF ratios is a measure of term sequencing.


EF RATIOS

Results obtained by querying a system in EXACT mode are a subset of the results obtained in FIND ALL mode. The size of this subset introduces a degree of ordering in the set. This subset can be evaluated using an EXACT/FIND ALL ratio I call "EF ratios" (E for exact and F for find all). I like to express these ratios in % or in ppt.

Eq 1: EF ratio = # results in EXACT mode/# results in FIND ALL mode

Proposed Definition: Given a query Q = k1 k2 k3...kn consisting of n number of single terms. The probability that a search for Q in FIND ALL mode returns documents with the EXACT sequence Q = k1 k2 k3...kn is its EF ratio. For a detailed explanation, see http://www.miislita.com/semantics/c-index-4.html and http://www.miislita.com/semantics/c-index-5.html

EF ratios can be used to answer two questions relevant to specialists interested in targeting specific keyword phrases: ie., given a pool of candidate phrases

1. which one is more likely to occur in a given search engine database?
2. which sequence of terms belonging to a query Q occurs more often in a set of documents relevant to Q?

To illustrate consider the candidate phrases

budget hotel
discount hotel
affordable hotel

A quick check to the EF ratios in Google reveals the following
(results are only valid in Google, are few weeks old and may change over time)

DATABASE: GOOGLE
CASE: INSENSITIVE
DATE/TIME: 06-30-04 AT 2:00 PM ET
EF ratio = (EXACT results/FIND ALL results)*100

EF ratio for budget hotel = (3,400,000/8,410,000)*100 = 40.43 %
EF ratio for discount hotel = (8,720,000/13,200,000)*100 = 66.06 %
EF ratio for affordable hotel = (135,000/2,820,000)*100 = 4.79 %

The EF ratio analysis reveals that

1. discount hotel is the best sequence, found in about 66% of the documents relevant to these terms.
2. budget hotel is found in about 40% of the documents relevant to these terms.
3. affordable hotel is the worst sequence, found in less than 5% of the results relevant to these terms.

To conclude, with EF ratios we can identify the "best" and "worse" sequences from a pool of candidate phrases.


UNDERLYING (LATENT) SEMANTIC ORDERING WITHIN CHAOS

In addition to serve as a phrase discovery tool, an EF ratio provides a measure of the degree of semantic ordering present in a set of retrieved results.

That is, 66% of the results relevant to discount hotel contain the terms as an exact sequence. This adds a degree of ordering to the set of results relevant to "discount" and "hotel". An SEO cannot ignore the fact that this introduces a semantic ordering effect (66%, to be exact) in the set of results. Let see why.

Average users, searching with the default FIND ALL mode for the query Q = discount hotel, should retrieve a set of documents relevant to "discount" and "hotel" regardless for sequence. If 66% of this set contains the "discount hotel" exact sequence, the shear volume of ordering and of retrieved results is in favor of documents positioned or to be positioned and containing this exact sequence and not its reverse.

Comparing EF ratios of candidate phrases, it just makes sense trying to optimize a new document for the "discount hotel" exact sequence and not for its reverse sequence or for the other two, above. For similar reasons (shear volume of underlying semantic order), "discount hotel" could be considered more appealing than "affordable hotel" (only 4% accounts for a degree of semantic ordering).

Still an SEO cannot view EF ratios as magic bullets. He/she would need to consider the whole "seo picture" (proper copy style, client needs, etc, etc, etc.)

I hope SEOs/SEMs find c-index (co-occurrence) and EF ratios (sequencing) concepts useful enough as to be included in their keyword analytic toolbox.

Any comment?


Orion

DanThies
07-17-2004, 03:19 PM
Thanks, that's a useful addition to the picture.

It would be interesting to see a side by side comparison between c-indexes and EF ratios, for the same sets of search terms.

orion
07-17-2004, 04:00 PM
Thanks, Dan. I really value your comments.

I also added some graphics to the above links at my site, so SEOs/SEMs can visualize the degree of ordering introduced by sequencing. Some university colleages are providing me with good feedback on latent semantics and TVT. This is where the research model is heading to.

Orion

orion
07-23-2004, 05:04 PM
Just in case someone may want to know, I placed a post at the http://forums.searchenginewatch.com/showthread.php?p=6545#post6545 thread in which I discuss co-occurrence and EF ratios in the context of hyphenated queries.

Orion

orion
07-29-2004, 10:12 PM
CO-OCCURRENCE (C-INDEX) AND SEQUENCING (EF-RATIOS) APPLICATIONS

It appears, finally SEOS/SEMS are starting to find useful the concepts of co-occurrence and sequencing I have introduced to them. For example, DanThies writes the following in the following SEW thread:

"Tools to analyze keyword search frequency"
http://forums.searchenginewatch.com/showthread.php?p=7028#post7028

In Post #21, he writes

"What you end up with, then, is a really huge list of words and phrases. Only some of them are really search terms, others are just words that happened to appear in sequence. This is where the information Orion provided becomes extremely useful. The first tool we have in our bag, courtesy of Orion, is the "E/F ratio." This compares the number of times the words appear in an exact phrase match vs. in a "find all" mode. So we take our list of 3-word phrases, for example, and query Google twice for each one, once with quotes around it and once without, to get an E/F ratio. The higher the E/F ratio, the more likely this is an actual search term, naturally occuring in the language.

We can now add color coding to our list of candidate search terms, to highlight the most likely search terms, based on the E/F ratio and the number of times each phrase occured in the pages we fetched from Google. The user then selects from the candidate search terms with checkboxes, those search terms that actually seem that they may be useful. Now we have a list of good candidate search terms, but we aren't done yet. We then take this list, and calculate c-indexes with the initial search term ("auto insurance"), to provide some insight into which are more relevant. For this calculation, we use exact phrase matching only. The final listing is sorted by c-index scores. We will release the PHP source code for this application on August 5, in conjunction with the SES Advanced Search Term Analysis session."

In Post #24 of same thread he writes

"Orion,

I haven't really found any applications yet to the many basic practices of SEO (such as optimizing page design and copywriting), but these concepts, that you've taken so much time to explain, have reaped tremendous benefits within our process of keyword discovery. Not only does it allow us to be more thorough, it offers the possibility of insight into the true relevance of search terms. The end result of this application development process will be better tools for keyword researchers, and a greater emphasis on relevance and targeting from the SEO firms we work with. Even if folks don't use the tools the same way we do, there will be an open source PHP class library supporting further work on c-indexes, E/F ratios, etc. A better understanding of the theory underlying information retrieval and search benefits us all in the long run."

Thanks Dan. This makes my campaign a lot easier. I welcome these and other applications.

KEYWORD SEQUENCING AND COPYRIGHT

In previous posts and at the beginning of this thread I was planning to cover Keywords Co-Occurrence, Sequencing and Copyright (copy style) but due to lack of time, I forget to revisit the subject. For those of you still interested in this subject, I cover briefly this in the following SEW thread today (Post #50), but applied to hyphens rather than stems, which was my original goal. Still the new information on hyphenations and their effect on queries is revealing and essential to the understanding on subset of results and the concept of selectivity.

"Change To Link Bomb Sign Of New Link Analysis Shift?"
http://forums.searchenginewatch.com/showthread.php?p=7152#post7152

You may also want to check http://www.miislita.com/semantics/c-index-8.html

Orion

orion
08-22-2004, 10:37 PM
This post is more about a practical application of the concept of terms co-occurrence in keyword research. I hope it may help SEO/SEM specialists interested in conducting this type of research.

Identifications of secondary keywords with co-occurrence analysis.

I have developed a beta application I call the Latent Semantic Analyzer (LSA). Features of LSA include a word-by-word and character-by-character parser. The tool does not require any special api from search engines and includes

A Zipf frequency/rank counter
A stopword remover
A sorter (by Zipf rank and alphabetical order.

The Zipf rank allow us to conduct frequency analysis of keywords as well as crypthographic studies in which a frequency table of characters must be checked.

The parser allows us to identify secondary keywords topically-connected with seed words or phrases.

In a recent co-occurrence study we conducted the following experiments.

Procedure

1. We search Google with the default FINDALL mode for the queries
"car insurance" and "auto insurance" (without quotes). These queries were selected after concluding were acceptable, based on their co-occurrence (c-indices) and sequencing values (EF ratios).

2. Searches were limited to titles using the corresponding title command.

3. We also conducted searches using the modified queries

http://www.google.com/ie?num=100&q=car insurance
http://www.google.com/ie?num=100&q=auto insurance

An explanation of these queries is given in the Google Uncle Sam thread
http://forums.searchenginewatch.com/showthread.php?t=1014

This allows us to analyze the top 100 visible titles from Google's search results.

4. After removal of all stopwords, we conducted the Zipf analysis to determine the most frequently used keywords. As expected the seed terms are the one with the highest frequency (low Zipf rank, usually #1 or #2. We then concentrate on the secondary keywords.

Results of secondary keywords

We were able to identify from the top 100 results, the secondary terms more frequently used. Since we use only the portions of the titles displayed by Google, the analysis has a source of error. Still we were able to extract some useful observations. In particular, we were able to find interesting trends and behaviors within the top 100 search results. A sample of the top 15 terms based on their Zipf rank is given below. Terms are displayed in uppercase just to grab the attention of readers. The first column is the Zipf Rank and the second is the term frequency.

CASE I: car insurance

1 152 INSURANCE
2 107 CAR
3 45 UK
4 36 QUOTES
5 25 AUTO
6 23 ONLINE
7 20 CHEAP
8 14 HOME
9 13 QUOTE
10 12 MOTOR
11 9 TRAVEL
12 6 COMPARE
13 6 FREE
14 6 LIFE
15 6 COVER

CASE II: auto insurance

1 132 INSURANCE
2 96 AUTO
3 30 QUOTES
4 27 CAR
5 10 ONLINE
6 8 FREE
7 7 HOME
8 7 LIFE
9 7 CENTER
10 7 QUOTE
11 6 RATES
12 5 BEST
13 5 COMPARE
14 5 HEALTH
15 5 INFORMATION

With these frequency values, one may be able to conduct co-occurrence values in the titles of the top 100 ranked results. The identified secondary terms can be used to reinforce the main topic of a web page and associated to the seed query. The results can also be used in query expansion studies, if the SEO or SEM specialist is up to the task.

One interesting trend we found is that in CASE I the "UK" term has a high frequency in the top 100 results. We also found many instances of titles containing the same sequence of words. An example are titles starting with

"Car Insurance UK...."

and derivative of this title. In constrast the "UK" term is not that prominent in CASE II (auto insurance).

The top 5 terms listed in CASE I and II are consistently found co-occurring across the titles of the top 10 and 20 search results. Examples are the terms car, insurance, auto, quotes, online etc.

Orion

rustybrick
08-22-2004, 11:08 PM
Orion, each and every one of your posts adds a new level of understanding for me.

Thanks so much!

Question:

(1) Are you using your own list of stop words?

(2) If you are...Its probably easy to spot these words when running the zipf's law a few times. Correct?

orion
08-22-2004, 11:35 PM
Hi, Rusty

1. "Are you using your own list of stop words?"

I use two filters. One is a library of regular expressions to identify and remove stopwords. The other is a library of stopwords. I use both filters in tandem. User can enable/disable the filters for comparison tests.

2. Its probably easy to spot these words when running the zipf's law a few times. Correct?

This is not a black or white thing.

In some cases, you are right, you can. In other cases is not that simple. With foreign languages and regionalisms, it is quite risk trying to rely on Zipf law only.

I use a compromised treatment in which the title is treated at the document level as a one-sentence passage and at the database level the top 100 results as a global passage.

Either way, the secondary keywords are inherently relevant to the retrieved subset (top 100 results) and their co-occurrence and frequency is kind of "slaved" to the seed query.

Still, this dependency is not necessarily a drawback, since now we have a systematic way of identifying topic-focused terms, straight from the top 100 ranked results. We can conduct similar analyses in other search engines.


Orion

orion
08-23-2004, 12:59 PM
In previous post, I should have explained a bit the importance of the use of short passages (titles, in this case) in connection with the use of Zipf ranks and stopwords. Let me expand on this.

In general, for a large corpus one can use Zipf's Law to identify stopwords. In the particular case we are dealing with; ie., titles perceived as one-sentence passages, Zipf's Law is used to obtain the ranks rather than to actually identify stopwords. Since we are dealing with the top 100 search results, more likely the corresponding passages (titles) are already preoptimized. Thus, chances are that optimization minimizes the presence of stopwords in the titles.

With all, we use Zipf's law to determine stopwords from generic large corpuses written in English. The identified stopwords are then used to upgrade our two filters (regular expression and stopword library). This approach is also used with documents written in foreign languages, but with some limitations due to regionalisms.

This does not mean we cannot see stopwords in the titles of the top 100 results. To insure we have removed stopwords I use a double stopword filter before the LSA tool determines the Zipf ranks. Consequently, co-occurrence values calculated from c-indices are very different from co-occurrence frequencies calculated using Zipf ranks. BTW, the tool is named LSA since it incorporates a c-index/EF ratio calculator as well as a TVT similarity calculator. Although in beta phase, the tools is being used to optimize keywords for clients.

For SEOs/SEMs interested in Zipf's Law and how it can be used to identify stopwords, this link http://mingo.info-science.uiowa.edu:16080/courses/230/LecturesSpring2004/LexicalAnalysis.pdf has an excellent explanation. It includes a good explanation of Zipf's, Mandelbrot's and Heap's Formulas. As a fan and admirer of Benoit Mandelbrot, I would love to discuss his work in relation with linguistic frequencies, but I'm afraid that may divert readers's attention to the math behind the semantics.

Orion

rustybrick
08-23-2004, 04:09 PM
Thanks Orion.

I figured most of the top 100 results for your tests did not include the stop words. But when I did some research on zipf's law, it seemed like it was used to count the most frequent and least frequent words in a document. So generally, the most frequent words in a document (magazine, book, Web site) would be a stop word. When it comes to titles that are properly optimized, you don't find that many stop words and if you do, your own stop word filter would solve it.

Thanks for the clarification Orion.

orion
08-23-2004, 09:41 PM
Hi, Rusty. You're absolutely correct.

Here is a different application of the procedure we've developed. Suppose we want to optimize three different pages for their corresponding topic. Each topic is related in some way.

Procedure

1. Using c-index and EF ratios, determine several candidate seed phrases.
2. Using the LSA tool, extract secondary keywords from the top 100 results as before.
3. Compare each set of secondary keywords based on their Zipf ranks.
4. Select the top secondary keywords, based on their Zipf ranks.

This procedure should allow us to identify commonalities and differences within sets. An example is given below for a case study treated in previous posts and at my research site. The 3 candidate seeds are "discount hotel", "cheap hotel" and "online hotel". I'm showing only the top 15 terms. As usual, the seed terms occupy the first positions. We then concentrate on the next top secondary terms.

LSA RESULTS - Columns 1 and 2 show the Zipf rank and raw frequency.

CASE I: discount hotel

1 119 HOTELS
2 92 DISCOUNT
3 68 HOTEL
4 26 RESERVATIONS
5 25 CHEAP
6 13 USA
7 10 BOOK
8 9 RATES
9 8 TRAVEL
10 7 WORLDWIDE
11 7 RESERVATION
12 7 GUIDE
13 7 LONDON
14 6 ONLINE
15 6 DISCOUNTS

CASE II: cheap hotel

1 133 HOTELS
2 100 CHEAP
3 62 HOTEL
4 32 DISCOUNT
5 14 RESERVATIONS
6 13 UK
7 13 LONDON
8 13 ACCOMMODATION
9 12 PARIS
10 10 RATES
11 7 BUDGET
12 7 YORK
13 7 EDINBURGH
14 7 LUXURY
15 7 NEW

CASE III: online hotel

1 80 HOTEL
2 69 HOTELS
3 56 ONLINE
4 27 RESERVATIONS
5 15 BOOKING
6 13 RESERVATION
7 9 BOOKINGS
8 9 LAST
9 9 MINUTE
10 8 DISCOUNT
11 8 ACCOMMODATION
12 7 BUDAPEST
13 7 PRICES
14 6 BOOK
15 6 GUIDE
16 6 ROME
17 6 TRAVEL

From the top Zipf ranks, one can easily identify the secondary terms associated to each case. These terms, when used in the document, should reinforce the theme of the page.

OBSERVATIONS

1. Although the terms "discount" and "cheap" have a different meaning, in the context of marketing the phrase "discount hotel" is more semantically related to "cheap hotel" than to "online hotel" (CASE III). From the semantic standpoint, both "discount hotel" and "cheap hotel" should pull similar secondary terms. Accordingly, it is not surprising to observe in the top 5 Zipf ranks almost the same terms in CASE I and II.

CASE I: discount hotel

1 119 HOTELS
2 92 DISCOUNT
3 68 HOTEL
4 26 RESERVATIONS
5 25 CHEAP

CASE II: cheap hotel

1 133 HOTELS
2 100 CHEAP
3 62 HOTEL
4 32 DISCOUNT
5 14 RESERVATIONS


2. The term "booking" does not appear in CASE I and II but has a prominent Zipf rank in CASE III. This suggests that the term "booking" is more affine with the term "online" and the phrase "online hotel" than with the "discount hotel" and "cheap hotel" phrases.

CASE III: online hotel

1 80 HOTEL
2 69 HOTELS
3 56 ONLINE
4 27 RESERVATIONS
5 15 BOOKING

3. The term "reservations" has a persistent Zipf co-occurrence in all three cases. It is not a coincidence that this term is consistently present in the top 100 search results of CASE I, II and III. Thus the term "reservations" has a lot of semantic "juice" and should be included in the three different documents to be designed. This is in accord with the idea that terms which co-occur frequently have a semantic association. Consequently, we now have a sound way of identifying terms with a lot of "juice".

Orion

rustybrick
08-23-2004, 09:58 PM
I love how you just connected the c-index to the zipf's law.

orion
09-02-2004, 02:25 PM
We have finished a new experiment using our LSA tool and Google AdWords and as follow. We query AdWords with an initial key phrase seed and collect results from its "More Specific Keywords" and "Similar Keywords" lists. These results are then imported to our LSA tool for further semantic analyses.

A sample is given below for each set of results. The seed keyword is "discount hotel" and stopwords have been removed. We only show the top 15 Zipf rank results in each case. Finally, we compare results with the procedure described in previous posts (using title results). Column sequence is Zipf Rank, Frequency, and Word.


SET 1: USING MORE SPECIFIC KEYWORDS

1 148 DISCOUNT
2 148 HOTEL
3 16 VEGAS
4 14 NEW
5 13 ROOMS
6 12 YORK
7 12 CITY
8 11 LAS
9 10 RATES
10 6 CHICAGO
11 6 RESERVATIONS
12 6 SAN
13 4 BOSTON
14 3 COUPON
15 3 MOTEL

SET 2: USING SIMILAR KEYWORDS

1 18 HOTEL
2 16 HOTELS
3 10 CHEAP
4 5 DISCOUNTS
5 5 COM
6 5 DISCOUNT
7 4 FLIGHTS
8 4 AIRLINE
9 3 TRAVEL
10 3 RESERVATIONS
11 3 AIR
12 3 DISNEY
13 3 CARLTON
14 2 CHICAGO
15 2 CHEAPHOTEL


COMPARATIVE RESULTS WITH RESULTS EXTRACTED ONLY FROM TITLES

We present below results using keywords extracted from the titles of top 100 search results using the keyword "discount hotel". The procedure is described in previous posts.

1 96 HOTELS
2 83 DISCOUNT
3 82 HOTEL
4 26 RESERVATIONS
5 20 CHEAP
6 13 RATES
7 12 TRAVEL
8 11 LONDON
9 8 AIRFARE
10 8 WORLDWIDE
11 8 ONLINE
12 7 UK
13 7 PARIS
14 7 DISCOUNTS
15 7 RESERVATION

Feel free to comment on the similarities and differences observed between sets. In Set 2 and 3 first 5 results are quite similar.


Orion

PS. AN EXPERIMENT CALL FOR SEOs. I'm about to conduct a complimentary keywords research experiment and we ask for volunteers (no strings attached). Volunteers can keep the keyword results and use them at will for their sites or clients. Anyone interested, please feel free to express interest by dropping me a private email. The experiment will be limited to English-only keywords.

orion
09-03-2004, 11:11 PM
CALL FOR FIVE SEOs interested in participating in a keyword semantics experiment. See http://forums.searchenginewatch.com/showthread.php?p=12329#post12329

Orion

orion
09-06-2004, 10:38 PM
1. Someone asked me about the 12 subscript in c-index values. This is used to distinguish a k1+k2 phrase/c-index from its transpose k2+k1 phrase/c-index. (See old posts at this thread)

2. The same person asked me about why the use of the c-index mark of 25 ppt (parts per thousand) in the experiment conducted at the http://forums.searchenginewatch.com/showthread.php?p=12585#post12585

This mark is the result of testing many two-word phrases in the Google database, hence is valid for experiments conducted in Google. The mark serves as a reference point, no more, no less and to insure some degree of semantic connectivity between candidate terms. Why? Terms co-occurring frequently are often semantically connected in some way. For example, many think in Hawaii when they hear the term "Aloha". This is a semantic association.

Now, let's consider the word k1=aloha and two candidates k2 terms; ie., california, hawaii and florida. A quick FINDALL search in Google reveals that

k1=aloha n1=2050000; k2=california n2=120000000; k12=aloha california n12=307000 || c12=2.52ppt
k1=aloha n1=2050000; k2=hawaii n2=28800000; k12=aloha hawaii n12=845000 || c12=28.16ppt
k1=aloha n1=2050000; k2=florida n2=63900000; k12=aloha florida n12=204000 || c12=3.10ppt

Clearly the c-index results reveal the term aloha is more semantically connected to hawaii than to california or florida, which common usage dictates this association. The above mark is used in this experiment as a generic reference point from which one can derive secondary terms that may reinforce a theme. The idea is to start with a valid ground from which secondary terms are identified.

On the other hand, these results reveal c-indices can be used to identify terms not semantically associated between themselves or with a theme.

Orion

danielray
09-21-2004, 08:23 AM
As I understand it, the SEO application for co-occurence means the documents that contain both terms can be calculated using advanced search syntax in todays engines, and better choices can be made when choosing keywords for SEO Web copywriting. The terms can be said to be semantically related (by virtue of the Web), but are they then better keywords to target?

Since this comes from IR research, it really seems to have applications for better Web search and not so much for the SEO. The reason I think the SEO cannot really gain much from this information is that they would be targeting terms that are related insofar as they are intertwixt by the Web's documents at a high rate (thereby being more competitive)!

An engine that uses co-occurence technology blended in their Web search application can be targeted by the SEO and possibly with good results. Since there is at least one person who posted here describing using this calculation with good results, it stands to reason that the engine that was targeted is one that blends some form of co-occurence now. An article about SCIENTIFIC REALISM AND ANTIREALISM at ( http://www.work-at-home-profits.com/scientic.htm ) . Would that not be true Orion? How do SEOs benefit from this information in a practical sense unless the engines are blending semantic connectivity somehow?

danny

orion
09-21-2004, 02:10 PM
Daniel

I strongly disagree with your post for many reasons. I still don't understand why some SEOs/SEMs cannot see things running in the background of search results.

Search engine technology is based on old and new IR algorithms. To different degrees, terms occurrence and terms co-occurrence theory is embedded in most of these algorithms. Clustering algorithms, modern query expansion techniques, latent semantic indexing and more recently Bruce Croft's local context analysis for query expansion (LCA) use occurrence and co-occurrence theory. (See the work of Sparck Jones, Salton and Buckley, Attar, Fraenkel, and more recently the work of Bruce Croft and others). My advice: pay strong attention to LCA. Not only LCA is a huge advance in the area of query expansion, it also helps with keywords discovery and with the contextual design of highly relevant passages. Without co-occurrence in LCA the model could not succeed.

To sum up, huge advances have been made at incorporating co-occurrence into term vector models, scoring functions and into theme-based website building (local co-occurrence). Co-occrrence, combined with these and new techniques is one of the things that is driving the industry into the building of smartest search engines.

I want to take this opportunity to invite readers to visit the http://www.organic-rankings.com site or to drop an email to Derek Chew. I'm currently conducting two experiments dealing with term occurrence and co-occurrence theory. Exp 1 results are coming soon and deal with term co-occurrence and theme building. Exp 2 deals with co-occurrence theory and brands association and we are asking for companies interested in participating. Feel free to visit this site and submit your company.

Orion

PS. I have decided to postpone exp 2 for now.

orion
12-03-2004, 01:10 PM
Search engine marketers are not the only community that can benefit from term co-occurrence, On-Topic Analysis (http://forums.searchenginewatch.com/showthread.php?t=2031), and semantic associations. It is not a secret that the intelligence community is using such advanced techniques.

In the Security officials to spy on chat rooms (http://news.com.com/Security+officials+to+spy+on+chat+rooms/2100-7348_3-5466140.html?tag=nefd.lede), CNET reports that back in April 2003, the CIA agreed to fund a series of research projects to create "new capabilities to combat terrorism through advanced technology." One of those projects is at the Rensselaer Polytechnic Institute in Troy, N.Y., designated to automated monitoring and profiling of the behavior of chat-room users. The NSF/CIA-sponsored proposal of researchers Bulent Yener and Mukkai Krishnamoorthy says in part (emphasis added)

"We propose a system to be deployed in the background of any chat room as a silent listener for eavesdropping...The proposed system could aid the intelligence community to discover hidden communities and communication patterns in chat rooms without human intervention."

Their proposal says research will begin Jan. 1, 2005 but does not identify which IRC servers will be targeted and it seem to be based on the NSF-funded paper A Tool for Internet Chatroom Surveillance (http://www.cs.rpi.edu/~yener/PAPERS/35.pdf). In this paper, the researchers predicted their work

"could aid (the) intelligence community to eavesdrop in chat rooms, profile chatters and identify hidden groups of chatters in a cost-effective way" and that their future research will focus on identifying "topic-based information." ”

Their methodology is based on clustering analysis and the singular value decomposition (SVD) theorem. For those interested, SVD is the theorem used in latent semantic analysis (LSA). Essentially they are applying standard LSA, on-topic, and co-occurrence theory to find, analyze and interpret word patterns from online community activities (community = chat rooms, discussion forums, guestbooks, etc). It is not clear if they plan to combine this with steganographic analysis and spyware. I don’t see any technology barrier for such implementation.


Orion

AussieWebmaster
12-05-2004, 06:15 PM
Interesting as usual Orion.. I have been reading about clustering recently though more from the standpoint of getting a deeper knowledge of the many facets of search...

eZeB
01-07-2005, 10:55 PM
Orion -- many thanks for the time to explain all this which I have followed with interest.

I have my own (seat of the pants) method of analyzing web content and often find co-occurence of terms which have no apparent semantic relationship but nevertheless do co-occur with enough frequency suggest a relationship.

Without any calculations, clearly "teacher" is related to "classroom" and "school" and an analysis of the top 20 ranking pages for the keyword "teacher" will confirm this. However, many websites with, for example, the word "children" also have the word "free" which has no intuitive connection, but does co-occur for reasons that have nothing to do with semantics - i.e. commercial sites retailing teaching materials for children, which dominate the top 10 or 20 on google generally have a free trial or give away.

Are these types of behavior captured in your calculations? Likely this type of co-occurence would not persist beyond the top 20 or 30 as the sites would be more 'community' oriented and use less commercial language and the whole approach and language would be entirely different. I can think of all sorts of reasons for language and co-occurence relationships changing when the sample is taken from different positions in the results.

Isn't something falling through the cracks here as the calcualtions will no doubt show no relationship between "children" and "free" ? Another example would be "teacher" and "resources"

Is this because my sample is too small ? Wouldn't a larger sample fail to capture these relationships? Is this an important relationship? From my admittedly narrow perspective, if the majority of top 20 sites for my targetted keywords have a particlular word on them, I want that word on my site.

Your comments are greatly appreciated.

orion
01-08-2005, 01:50 PM
Hi.

Sure, my pleasure. First, thank you for taking the time to stopping by.

When I introduced normalized co-occurrence theory to SEWF readers back in Summer 2004 (this thread) it was intended for seos/sems. I will love to discuss in-depth (and in simple terms) about these and related subject regarding co-word citation, relevancy and similarity measures at a SES conference.

Normalized co-occurrence as estimated by c-indices and ef-ratios is not enough as many time I have pointed out. You would need additional treatments, one being On-Topic Analysis.

You may want to read the On-Topic Analysis (http://forums.searchenginewatch.com/showthread.php?t=2031) thread and experiment paper I wrote on this subject. Terms co-occurring should be on-topic for the intended theme, so as the document data structure.

Cheers

Orion

claus
01-09-2005, 09:02 AM
Hi, i'm new to this forum as well as thread and haven't read all, so thanks for an interesting read sofar :)

I've got a few comments, of course, sorry if i'm just restating what others have said already:

n(0,db) = # retrieved results not containing the queried term, i.
n(it,db) = # of total results retrieved by querying i in a given db.

[...]

Assuming good IR performace of recall and precision (see reference textbook) and strict adherence to pattern matching of regular expressions, n(0, db) should be negligible, thus the main assuption is that

n(it, db) = n(i, db).

That main assumption is somewhat out of step with reality at the Search Engines, as if you do a search on any word or phrase you cannot be certain that the words do in fact appear on the page.

Eg. with Google words often appear in links to the page in stead of on the page, and as links are much more concentrated (shorter) than body text of pages you will skew your derived metrics in best case.

I noted that you mentioned using the string search ("keyword" or "Search pages containing the term") - i feel that this need to be emphasized, at least.

Other thoughts

1) Causality vs. co-occurence:
By comparing number of results for keyword searches you are of course assuming a co-occurence. It need not be a causality. There are 500,000 pages returned in Google for the search tower diving (http://www.google.com/search?q=tower+diving) but those two words are really not all that related (put in quotes it's only around 500 pages which illustrates my point above).

2) Different types of clustering analysis techniques:
If you really want to dig deep into this you should examine the word clusters among the set of pages returned in the SE results (ie. not the sheer number of results). Of course, your approach is simple, which is great, but you do limit yourself to your own personal a priori assumptions about which keywords are related (see (1)) by typing those into the search box in the first place.

There are a number of techniques for this ranging from discriminant analysis (dependency) to multidimensional scaling (interdependency) and machine learning (AI, Fuzzy, Neural, etc.) - and a lot of less "established" but still useful variants as well.

3) Stop words:
Example: Someone mentioned the word "free" on school pages. Other words likely to be found most anywhere are "the", "it", "and", and soon and soforth. sometimes these words are legitimate, and sometimes they will only add noise.

So, you need to compile a list of stopwords for the subject you are studying, and remove these stopwords from your sample before doing the analysis. That list will be different from topic to topic, but some words will be generic (eg. "free" will be okay for some samples, but "the" will be a stopword for most).

4) Stemming:
We all know that "holiday", and "holidays" are different representations of the same word, so you should of course make sure that the words you examine are reduced to smallest common stem where appropriate, before doing analysis. There's a number of techniques for this.

However, for some research topics it will be necessary to distinguish between "president" and "presidency" while for others it will not, so your stem list should be modified for each topic/run of your anaysis just like your word list.

5) Semantic classes:
There's all kinds of fancy long words describing certain pieces of the puzzle or properties of subsets of those. One classification is language (eg. in Swedish the word "glass" means "ice", and in Danish the word "is" (an English stopword) means "ice")

That's just one factor though. Take a search for "cows" as an example - in academic writings you will more likely see "bovine" while in a childrens book it will be "cows". It's not only the keyword that differs between these two classes (audience) but most words used.

6) Your calculator and the number of keys to punch:
While it's simple to do the research suggested in the original post, it is not simple to do an in-depth analysis of raw pages. You cannot use your average calculator for that. The more pages you want to analyse, the more power you need, and due to topical customizations, the more unrelated topics you want to analyze you will also need manpower with hands-on knowledge from different fields.

7) Scale:
Because of the above (3-6), it's not easy to scale such methods and still get valid results. You will observe false positives (or, more probable: false negatives) if you scale this kind of research over a large number of very different topics and a large set of data, say "search queries performed on X billion pages".

8) Success stories:
Yes of course there are success stories. Techniques mentioned in (2) can be very powerful; one such success story is Bayesian email spam filtering. The key to understanding why the success stories are success stories, imho, is that they're applied to a narrowly defined scope, with a predefined purpose.

That, in turn, is a luxury that Search Engines doesn't have. They don't have knowledge about the specific intent of the searcher beforehand, and they have a dataset that amounts to "you-name-it-we-got-it". From a probability viewpoint, they can derive more clusters than there are searches, so it's not probable that they will "hit" - more likely they will "miss".

Still, my experience with the Gigasearch "gigasets", the Northern Light clusters, etc., show that if you just suggest a large number of possible clusters there wil always be a few hits among them.

orion
01-09-2005, 10:28 AM
Hi, Claus. Welcome to this thread and thank you for your input. Please feel at home. Hope you read the rest of the thread and related threads at SEWF. It’s my pleasure to revisit these points so others and new readers will have a better understanding of co-occurrence theory.

1.Matching, queried, retrieved, available results. I have stated many times precisely that this an approximation as pages containing synonyms and concepts associated to the queried terms can indeed show in the SERPs, yet the query may not be contained in these. For very large records this effect is negligible. Still you always have an error in the analysis.

2.FINDALL. When you search in Google’s default mode this is a FINDALL search (AND mode), so queried terms can appear anywhere in the document, links, urls, etc. and regardless for order and proximity. Only because you don’t see the visible text in a page or SERPS does not mean is there and counted in the results. All this, many times explained, too.

3.Causality vs. co-occurrence. Co-occurrence suggests association but is not a proof of. Already explained, too in previous posts at this thread. With regard to tower diving, a search in quotes in Google is an EXACT search (with regard for order and proximity), hence searches in this mode produce smaller results as compared with FINDALL since these will be a subset of results retrieved in FINDALL (many times explained). If you calculate the EF-ratio this gives you an estimate of the probability documents retrieved in FINDALL are containing the queried terms in an exact sequence.

4.Cluster Analysis. We use co-occurrence in preliminary experiments and results are taken as input for cluster analysis (LSI, dendrograms, etc) to be specific. We do carry out cluster analysis.

5.Other techniques available and that could be used. I agree on this with you.

6.Stop Words. We use two kind of filters (a) a stop word list and (b) a library of regular expressions. You may want to check my On-Topic Analysis thread and experiment paper.

7.Stemming. At the beginning of this thread one of my intentions was to discuss stemming using Associative Clusters. Then someone started a thread on stemming and I decided to let them discuss it. If you search for stemming the SEWF forums you should find the thread.

8.Semantic Classes. You would be surprise how many times I have explained that co-occurrence is demographic/culturally-driven so as the use of hyphens in different languages (or same language different countries).

9.My Calculator. Overhead processing is always there and a problem. I can handle several thousand records and would like to push for million. At the On-Topic paper I presented crude sample using few hundreds. I do agree on this with you. I need more power. An university machine-power center is what I am currently shopping around for. Good things are coming for 2005! But cannot talk more.

10. False positives/negatives are things I discuss, too. I have deviced three different methods for working around this, briefly described in the On-Topic paper, and could be improved.

11. Scalability. I think point 9 answers this.

12. Success stories and SE limitations. I agree.

I have mentioned so many times the above points all over the place at the SEW forums, but you may need to do a search in the forums and find them, if you have the time. Always my pleasure answering/helping.

Cheers


Orion

orion
01-11-2005, 02:50 PM
I have opened a thread at a Google section on Fractal Spam (http://forums.searchenginewatch.com/showthread.php?t=3707). Since this deals with relevancy I though I should started here. However, since it also deals with spam issues in connection with Google and other SEs I decided to open it at the above section.

Since this thread is getting too long I may start a new thread on new topics in connection with co-occurrence, so the new thread could point to this one as a required reference. Since many of you have carefully followed this thread I'm very interested in your opinion. What do you think? Any input?

Cheers

Orion

PS. I'm kind of limited ot time now. So for the time being I prefer to open the new thread in about two weeks.

eZeB
01-11-2005, 03:18 PM
Very informative. You have raised the bar in terms of posts and SEO!

I had evolved a my own co-occurence technique and your paper and rigorous treatment of the subject has been a great help in tightening up the rigor of my own techniques, research and experiments.

The c-index is a fascinating and I have tacked my own varient to the end of co-occurence analysis with excellent results.

From some very limited experimentation, I am finding there are one or two terms associated with a main keyword that have exeptionally high c-index and are 'giant killers.'

I have some ideas about where to find them haven't experimented enough. Formulating a general rule on their characteristics and where to find them (quickly and easily) would be great. Is it possible they occur in certain locations most of the time?

I would be very interested in your thoughts on this.

In your co-occurence paper you mention the SERP text returned as a possible future direction for research. That is one of the most interesting things I have heard in a long time. Any examination of the snippits returned is bound to be very interesting and fruitful.

claus
01-12-2005, 09:24 AM
>> I have mentioned so many times the above points all over the place at the SEW forums

Well, again, i'm sorry if i was just restating what was already said. It's a very long thread and i'm new to this forum, so i haven't read the other threads yet ;)

Anyway, thanks for the response Orion - i think we could get some very interesting discussions, provided we both have the time for it (which, sadly, isn't always the case for me). Also, in my post above i forgot to say that it's brave of you to submit ideas like this to people of such diverse background as the people here. It's definitely a good thing, though. :) You'll have a whole lot of explaining to do, which probably explains why you already have mentioned the points i brought up.

Anyway, i feel one of my points have been overlooked somewhat. It wasn't very explicitly stated though, so i can only blame myself for that.

Your computer vs. My computer vs. the SE computers vs. "the human factor"

It's interesting that you want to take this research into the millions-of-records level, and that you will need something like university grade equipment in order to do so. This research (original post) is quite simple in a computational sense. You can literally do it with a normal calculator for a limited data set. Still, as the number of data points increase the requirements also increase.

(Don't misinterpret my use of the word "simple" btw. - i almost never use the word simple in a negative sense. On the contrary, for most issues it's very positive.)

Now, consider the case of search engines, which have an order of magnitude more data. Of course the search engines do have university grade equipment (and then some) but still - and here is the point, finally: A lot of people simply overestimate what it is possible to do for the search engines. To do successfully, that is.

AFAIK, all of the big ones and some of the smaller ones as well are working with clustering techniques, but there's a long way from "working with" to "base search on". The technical requirements of doing this research (simple co-ocurrence of words) on a million of records are much smaller than those of, say, doing LSI on eight billion pages. And, as i mentioned above, to do this sucessfully you can't rely 100% on computation. To "get things right" you will need a human with experience from the specific topic (per topic).

This, in turn, can be bought. Either as paid assistance or as collections of pre-prepared onthologies, or whatever (even a dmoz dump might offer some help). But it still needs constant monitoring and updatig, as well as filter-tweaking per topic. With eight billion pages, that's a lot of different topics and a lot of work, even if we assume (wrong) that each page has only one topic.

And then there is the human factor, from the searchers side:

Even if you have all the technology in place as well as very explicit and large ontologies (and hand-crafted filters) you will still fall short of the human factor: Say i type in "Eagles" in the search box. The search engine can of course have a prepared set of clustered pages ready for me with all kinds of information relating to those birds, but it will not know beforehand if i really intended to search for

- an orchestra named "Eagles"
- a sports team named "Eagles"
- the golf score "Eagle" in plural
- the bird species "Eagles"
- the "Eagles" gold coins
- ...

Of course this is a standard issue which always face the search engines. Ordinary PageRank, say, does not solve this problem either, as there are probably more music and sports fans than there are ornithologists (or coin collectors).

So, now that i got that off my chest, i'll just shut up and get the rest of this thread read before i post anything more to it :)

orion
01-12-2005, 01:16 PM
Few notes before I cannot longer find time to reply within the next days.

1. eZeB

Great. I wish other follow into your steps and be eager to do more experimentation. You may also want to test EF-ratios, a notion I introduced to sems/seos at this thread. It allows you to discover the co-occurrence probability of naturally occurring phrases.

2. Claus

Great and true, the human factor. Must needed.

Orion

claus
01-15-2005, 06:49 AM
An interview with Peter Norvig posted January 12 on AlwaysOn, printer friendly version:


Semantic Web Ontologies: What Works and What Doesn't (http://www.alwayson-network.com/printpage.php?id=7480_0_3_0_C)

This is about the shortcomings of semantic markup which is not the same as we're discussing here, but it gives a few insights to the SE view on the general topic of this thread (ie. semantics) and is also an interesting read, although it's strictly non-tech.

claus
01-30-2005, 06:55 AM
Here's what /. (http://science.slashdot.org/article.pl?sid=05/01/29/1815242&tid=217&tid=14) made me aware of today:

A paper by Rudi Cilibrasi and Paul M. B. Vitanyi (CWI, University of Amsterdam) published December 2004, titled "Automatic Meaning Discovery Using Google". Here's the Xarchived paper (http://www.arxiv.org/abs/%20cs.CL/0412098).

I'm not finished reading it yet, but it strikes me as a near-duplicate of Orions thoughts regarding C-indices and EF-ratios. Of course there are differences, but the line of thought is very similar.

Orion, you should give those guys a polite call and/or email, especially as i see (via /.) that they've even got mention on newscientist.com (http://www.newscientist.com/article.ns?id=dn6924)

---
Added:
(btw. the /. discussion has a few interesting posts as well)

Having read a little more it seems to me that what they're adding is some strictly computer science considerations that sholdn't really effect the main points (such as how to binary-decode strings, and the Kolmogorov complexity of such strings).

They do normalize results and still they derive a "Google probability distribution" which can take values > 1 which values they then have to explain somehow. I think they're more CS-minded than information/statistics minded after all. Sofar, Orions aproach makes more sense to me, as it leaves out the pointless additional exercises these authors take and focuses on the core matter (Still, they're from a CS faculty, so they probably can't help it - i wonder if they've heard about Occams Razor, though)

On second thought, these guys are obviously trying to design a computer system, and in that light all the irrelevant details of course have significant value. While they're relevant to the way their system is supposed to work, they're largely irrelevant to the core matter of extracting information using the SERP results count.

Oh, on second attempt, they do derive a normalized information distance, that has values between 0 and 1. Only, as they say, "Unfortunately, the NID is uncomputable" so they have to rely on approximation. Also, they do make this very unrealistic assumption: "Assume that a priori all web pages are equi-probable, with the probability of being returned by Google being 1/M" ...

Nevermind, read and judge for yourself. There's a little algebra involved, though.

orion
01-31-2005, 04:23 PM
Thanks, Claus

Thank you for the head ups. As for the term distances, I long ago presented a simple way of calculating these in the On-Topic Analysis (http://www.miislita.com/exp1/on-topic-analysis.html) paper and as follow

5.4.1 Visualization of Term Distances

Normalized co-occurrences may have some implications in the area of data structure visualization. For example, c-indices can be used to produce a visual representation similar to the tree diagrams obtained with InPharmix's PDQ_MED (19). PDQ_MED provides a graphical representation in which the inverse of the frequency of co-occurrences is represented as the linkages between terms. As a modification of this concept, the following transformation could be employed for a pair of terms

d = log (1/c-index)

where d is the distance between terms and the c-index is not expressed in ppt but as a fraction. Thus,

d = log ((n1 + n2 - n12)/n12)


In my view, this distance metric should be enough for SEMs/SEOs. Furthermore, it can be used to resize dendrogram motifs when cluster analyses are used.


Orion

orion
02-01-2005, 12:16 AM
Oh, on second attempt, they do derive a normalized information distance, that has values between 0 and 1. Only, as they say, "Unfortunately, the NID is uncomputable" so they have to rely on approximation....I found this observation interesting, Claus.

As nuclei interdistances from potential theory, distances between terms normally run from zero to infinite. Precisely, this is what the d value, as calculated by a c-index- tell us [d=log(1/c)]

As a c-index approaches 1 this means terms are highly associated within the target search engine or IR system database, thus from the semantic standpoint the inter distance between terms should approach zero.

On the other hand, if a c-index is effectively zero, d = log(1/c) = infinite, meaning the two terms have no semantic association at all. That is like as if the terms exist in two different collections (however, they do not)

Given these facts, the d metric now can be used to assign relative values between terms. This adds a new dimensionality to cluster analysis based on dendrograms. Why?

A dendrogram-based cluster analysis gives the user an idea of the relative grouping of terms. Still, these diagrams do not tell the IR researcher how close or how far the terms are within a given dendrogram element. We now have a method of adding these relative lengths to the dendrograms. This makes the resultant diagrams more complete.

The d metric can also be used to construct fractal-like data structures and have a glimse at the fractal nature of term semantics, within a given database, document (if using scaling co-occurrence) or dissimilar one.


Orion

PS The log transformation in d is used just to scale down the actual distance which effectively is given by 1/c, thus infinite for c=0. Certainly log of an infinite quantity is an awful quantity.

claus
02-01-2005, 05:53 PM
>> Certainly log of an infinite quantity is an awful quantity.

*LOL*

I'm a bit biased towards the zero-to-one space, i'll admit that. It's a nice simple metric that loses no information. Of course distances can go to infinity but that's no problem as you can just reverse the scale and make an infinite distance become "no closeness at all" ie. zero - then it's just a matter of (number of) decimal points and your zero-to-one string can be as long as you wish (eg. infinite)

Regarding the paper, the authors are suggesting a "Google probability distribution" so they should really be very careful when observing values outside that space. There is no such thing as a p>1 in a probability distribution as there's no such thing as a probability above certainty. So, such a thing (if observed) should not be interpreted as "somewhat, perhaps, a measure of negative correlation" - the only valid interpretation is an error in the calculus.

Anyway, they are from a CS faculty and not from statistics, and they seem to realize that "it's not really like it should be", so that should not really be held against them (a lot). The paper is a very interesting read and they do develop some very good predictions about word relationships - but, alas, their whole work is based on SERPS, ie. results already indexed, sorted, and selected by Google, which begs the question: Is it their methods that provide the good results or is it really Google? If they deliberately chose a very bad engine, would their methods still work at all? We may speculate, but i guess we'll never know ...

orion
02-01-2005, 08:19 PM
Actually, infinite distances between terms or stems are well documented in IR. Modern Information Retrieval (Chapter 5; Baeza-Yates and Ribeiro-Neto) covers these when discussing cluster associations and stemming. The idea is that terms or stems with an infinite interdistance effectively behave as belonging to two different documents. The idea can be extended to entire collections.

It strikes me they (the paper's authors) normalize a metric only to later on talk about values outside the inherent bounds. I found some portions written in an ad hoc manner.

Orion

usladha
02-11-2005, 06:13 AM
Hi,
This is my first post. I am working on a search engine as my final semester project and i have got a good insight in the working knowledge of the search engineering.. Can some one help me out on automatic thesaurus generation?? :confused:

orion
02-11-2005, 02:30 PM
Hi, there


Plenty of free resources here thesaurus generation (http://www.google.com/search?q=thesaurus%20generation), but you would need to do the homework.

Cheers

Orion

xan
02-12-2005, 03:45 PM
Couldn't help joining in, and wanted to add my thoughts as well!

Fuzzy set theory means that search engines for example can use a massive database of content to identify the relationships between words, rather than using a thesaurus or dictionary. "Fuzzy set theory" means that a thing can belong to some degree to a group. simply put the idea is that it creates an index of word relationships by measuring how often they are used and in what context. The index is a fuzzy ontology of term associations which is used as one of the sources of the search engines knowledge. Pompous sci's refer to it as "FuzzONT".

The work I have seen here is generally based on the idea for an information system to expand a query, so when you put in "banana" you would also get results with terms related to it. Then user refinement comes into play. Google or MSN don't use this to my knowledge.

Almost all IR systems make use of WordNet:
"WordNet is a semantic lexicon for the English language. It groups English words into sets of synonyms called synsets, provides short definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications."

wordnet online (http://www.cogsci.princeton.edu/cgi-bin/webwn)

This project is heavily supported as well, and may be useful to look at:

CYC project (http://www.opencyc.org/)

Fuzzy sets have been heavily criticised, especially by Haack who argues that "True and False are discrete terms. For example, "The sky is blue" is either true or false; any fuzziness to the statement arises from an imprecise definition of terms, not out of the nature of Truth. "

But:

Fox retaliated by saying that many of Haack's objections "stem from a lack of semantic clarity, and ultimately fuzzy statements may be translatable into phrases which classical logicians would find palatable."

Using fuzzy systems in a dynamic control environment raises the likelihood of encountering difficult stability problems.

For those who want to know more:
People are also working on semantic relationships by using Natural gradient descent (NGD), with neural networks or SVM's, as they have learning capabilities.

Quick reference to the man who introduced it: Professor Lofti Zadeh "Fuzzy sets", 1965.

My job involves mostly getting machines to understand meaning and interrect and react appropiately to user input. Semantics are quite important, but also coherence and cohesion.

orion
02-16-2005, 11:05 PM
Terms Co-occurrence has less to do with Fuzzy Set theory and more about semantics. More about this at SES, NY. Venn Diagrams are often used as visualization aids, no more, no less.

Orion

xan
02-17-2005, 10:14 AM
I asked a senior engineer over at Google to have a look at this thread and see what he thought or what comments he could give. He came back with an interesting reply.

My apologies for not being more helpful
here; I'd like to respectfully decline to comment. Any comment that might
eventually find its way to a message board would only get me into trouble.
;-) This is GoogleGuy's area, so I leave that fully to him.



Agreed matey. And stochastic systems BTW are always used in IR, plus semantics seem to be considered the holy grail round here, and they are important, but there's a lot more to linguistics and computational linguistics than just that, even with term co-occurence. I mentioned Fuzzy set theory because Orion did and I wanted to clear that up using my own opinion, thats all.


8.Semantic Classes. You would be surprise how many times I have explained that co-occurrence is demographic/culturally-driven so as the use of hyphens in different languages (or same language different countries).

punctuation is always ignored. The only time it is a problem is for cohesion, like tying to find out where the end of a sentence is for machine translation purposes.

6.Stop Words. We use two kind of filters (a) a stop word list and (b) a library of regular expressions. You may want to check my On-Topic Analysis thread and experiment paper.

Stopword lists vary greatly with each particular use of them, and regex why would you eliminate stopwords using this as well? When the stops are gone, they're gone, this is ultra easy to do, and a first year programmer can work it out. I might not understand what you mean though, I most likely got your intention wrong.

4.Cluster Analysis. We use co-occurrence in preliminary experiments and results are taken as input for cluster analysis (LSI, dendrograms, etc) to be specific. We do carry out cluster analysis.

This a slow method. Pure pattern matching using SVM's or similar methods will be efficient and very fast. It involves less linguistic plough through and more mathematical pattern matching where terms are assigned weights which are then normalized.

7.Stemming. At the beginning of this thread one of my intentions was to discuss stemming using Associative Clusters. Then someone started a thread on stemming and I decided to let them discuss it. If you search for stemming the SEWF forums you should find the thread.

Best not to always stem, and different stemmers for different jobs (and languages of course).

9.My Calculator. Overhead processing is always there and a problem. I can handle several thousand records and would like to push for million. At the On-Topic paper I presented crude sample using few hundreds. I do agree on this with you. I need more power. An university machine-power center is what I am currently shopping around for. Good things are coming for 2005! But cannot talk more.

Orion, can't you actually use university resources? Its much easier. You should have affiliations, don't know how it works in your country. We have here, but power is not a major issue at this point for us, maybe because we have extra resources.

orion
02-17-2005, 01:46 PM
Agreed matey. And stochastic systems BTW are always used in IR, plus semantics seem to be considered the holy grail round here, and they are important, but there's a lot more to linguistics and computational linguistics than just that, even with term co-occurence. I mentioned Fuzzy set theory because Orion did and I wanted to clear that up using my own opinion, thats all.Aussie and Xan,

Semantics and co-occurrence, I love that.


Punctuation and demographics. Actually, puctuation does matter in some cases. One is the research we have carried out in the area of hyphenation. In Google, for example, hyphenation in queries in FINDALL (AND) introduces a degree of ordering. This degree of ordering acts as a localized EXACT mode within the FINDALL mode. Hyphenation rules may be different for even the same language (eg, USA English vs UK English). In the case of Spanish, we have many regionalisms. So, punctuation as contextuality causes many different meaning for a given term. This is where machine tranlation fails miserably.

Stop Words. Actually, our exps shows many terms fail to escape the regexp filters. No matter how much granularity we put into the library, we keep finding cases in which words fail to be recognized. So we opted for general classes and sub classes and stop there. Then we put into a bag of words those that escape the "literal" library of filters. This approach helps a lot with both English and Spanish terms (especially very unique terms with high IDF). This approach can be applied to almost other languages. Words that escape both filters are easily pinpointed with our on-topic analyzer software.


Cluster Analysis. This line was in response to Claus. Cluster Analysis is in fact a slow method. We don't use Cluster analysis for the implementation you have suggested. We do use it to identify data structures and is not that time consuming for this.

Stemming. Agreed. Again this was in response to a direct question from Claus. For what we do, in many instances we don't need stemming.

My Calculator. The resources we need for other parts of our research require access via univ grants. I'm currently at La Jolla, San Diego, CA. At this point for our research power is a major issue for us.

Cheers


Orion

xan
02-17-2005, 02:18 PM
You seem to be doing mostly SEO work Orion. I was talking about it from a pure IR point of view. Of course I use stops, but they change all the time depending on what you're doing.
Machine translation fails on a lot more serious things than punctuation. I work cross-language as well, I speak french, german also and I can tell you that the rules change in every language. Oriental ones aren't affected by the rame things. We don't really use punctuation to assess contextuality either at my end. What your stop method does is pretty standard, and thats a decent route to follow.

As for power...hehehehe, you need a big private grant! ;)

orion
02-17-2005, 02:24 PM
You seem to be doing mostly SEO work Orion.
No.

As for power...hehehehe, you need a big private grant! ;)
Yes. He,He.

Orion

xan
03-01-2005, 10:31 AM
Hi guys,

I know you are busy with the SEW conference in NY and reading the info around it gives good insight into what is going on there. I'm going to a pure research based IR conference soon, so maybe I can give feedback from that end of the world and we can compare.

I have a question and I am not suggesting anything is wrong or right, I genuinly would like to know:

It appears to me that the EF-ratio and c-index do not yield much information. How am I wrong?

orion
03-04-2005, 04:05 PM
Hi, xan.


The merits, virtues and applications of the c-indices and EF-ratios were well covered at SES NY. At the conference, I presented (in "virtual space") on c-indices and EF-ratios many applications and chart analytics, sharing podium with world experts like Mike Grehan and Rahul Lahiri (Ask Jeeves)

Many firms I spoke to are using both metrics for their in-house research. Dan Thies, a pioneer and icon in keyword research spent nice few minutes discussing the merits of the metrics from the marketing and research standpoint.

I wish you were there. Maybe in another SES. I'll probably be at SES Canada and the W3C Japan conference, unless plans change.

Cheers

Orion

xan
03-04-2005, 05:18 PM
Thank you Orion.

It does seem like there was a lot of fun to be had anyway!
Perhaps I shall venture to one of the other events. :)

orion
04-11-2005, 02:52 PM
Chris Sherman has mentioned the link technology of Become.com. http://searchenginewatch.com/searchday/article.php/3496571

Chris writes,

"And unlike PageRank, which essentially pays no attention to the content of a web page, relying entirely on link analysis to compute a ranking, AIR analyzes the content of a page and only indexes it if it can determine that the page contains shopping-related content. It also only crawls on-topic links, discarding non-shopping or spam links."


The portion that interests me the most is the last line of the quote (boldfaced) since it brings to me flashbacks of one of my old posts from last summer at this thread: c-indices applied to links and email addresses. It is time now to revisit these.


C-Indices as estimates of link co-occurrances

Link-Keyword Co-Occurrence

Let L1 be number of links containing term k1
Let L2 be number of links containing term k2
Let L12 be number of links containing both terms

Then the Link-Keyword Co-Occurrence is given by

clk-12 = Lk1/(LK1 + LK2 - LK12)*1000

And computed as standard c-indices. To indicate this relates to links and keywords, I added the l and k subscripts to the notation so we preserve the overall nomenclature.


Link-URL Co-Occurrence

Let L1 be number of links pointing to url 1
Let L2 be number of links pointing to url 2
Let L12 be number of links pointing to url 1 and 2

Then the Link-URL Co-Occurrence is given by

clu-12 = LU12/(LK1 + Lk2 - LK12)*1000


Documents-Email Co-Occurrence


Let D1 be the number of documents containing email 1
Let D2 be the number of documents containing email 2
Let D12 be the number of documents containing email 1 and 2

Then the Doc-Email Co-Occurrence is given by

cde-12 = DE12/(DE1 + DE2 - DE12)*1000


These estimators should help business intelligence analysts and those interested in link-building strategies. The approach can be extended to 3 or more non-mutually exclusive events.

One inmediate application of these c-indices is in the area of spotting link spammers and email research.

Though in the wrong hands such c-index studies could serve the opposite purpose or even more obscure purposes.



Orion

orion
04-21-2005, 01:54 PM
For those that still don't get this thing about co-occurrence theory, Bill at this other forum http://www.cre8asiteforums.com/viewtopic.php?t=23819 discusses the new Yahoo patent in which they use co-occurrence theory and membership theory. They even use a "hawaii" example similar to the one I used at SES NY.

I see this patent as a refrit of standard co-occurrence theory I have been using for years.

Orion

xan
04-22-2005, 07:02 PM
I see this patent as a refrit of standard co-occurrence theory I have been using for years.

Orion

Agreed Orion!

orion
06-20-2005, 02:26 PM
I want to thank you all for your participation at this thread,
which now spans too many posts.

I was planning in closing it several months ago, but due to lack
of time I was not able to do this. Once closed, I hope it can be
used as a crude reference on c-indices. It is now time to move
to advanced issues on co-occurrence.

I'm upgrading and expanding the series on C-indices at my site.
What motivates me to do this upgrade is the fact that I'm finding
some well-intentioned marketers publishing about c-indices,
co-occurrence and even on-topic analysis without a clear
understanding of the underlying theory.

So far I have updated article 1, only. I plan to upgrade subsequent
articles. Each update will include new information and advances
others have made in the area of co-occurrence theory. I hope
you like the effort and use these concepts in your marketing mix.

Again, thank you all for participating with so much value-added
feedback and recommendations. Hope to see you around,
at a conference or event.


Keep the hard work. Cheers,


Orion