View Full Version : Term Vector Theory and Keyword Weights
orion
07-06-2004, 12:11 AM
OK, let's start this thread.
This thread is about term vector models and how term weights are computed by search engines and IR systems.
All variants of Salton's Term Vector Model -a keystone in information retrieval studies- demonstrate that the weight of a term in a document is determined with a combination of global (database level) and local (document level) measures.
Most commercial search engines, one way or the other use term vector models. Some take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Some combine this score with other metrics (link metrics, in the case of Google and others).
Thus the notion that keyword weights and keyword density values are equivalent concepts seems misleading. http://www.miislita.com/semantics/c-index-7.html
Let's talk about term vectors and term weights!
Orion
orion
07-06-2004, 09:49 PM
According to Salton's Vector Model, term weights are assessed with local and global information. In the classic model the weight w(i) of a term i is defined as
w(i) = tf(i)*IDF = tf(i)*log[D/df(i)]
where
tf(i) = term frequency, number of times a term i occurs in a document
IDF = Inverse document frequency = log[D/df(i)]
D = database size or number of documents available
df(i) = number of documents containing term i
Known as the Vector Model (or Salton's Term Weight Model), the equation shows that w(i) increases with tf(i) but decreases as df(i) increases. Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms appear in few documents and receive a high weight. This makes sense since too common terms (e.g., "a", "the", "of", etc) are not very useful for distinguishing a relevant document from a non-relevant one. The two extremes are not recommended in rutinary retrieval work. Terms of acceptable weight are those that are not too common or too uncommon.
This is how term weights are computed. Over the years, several modifications to the Vector Model have been proposed. Can you think of one you may want to discuss? Any suggestion?
Orion
orion
07-07-2004, 12:57 PM
First, things to avoid and some reference material:
Term vector theory (TVT) and its discussion has been around for soo long and still is a keystone concept in IR and graduate schools. An old thread at http://www.webmasterworld.com/forum34/241.htm tried to discuss TVT but ended discussing all but TVT in action.
Here we will try to understand how the theory works. The goal is to introduce TVT to a wider audience. (Later on we can proceed with advanced TVT variants.)
Any taker? Schools? SEOs R&D Depts?
BASIC REFERENCES
1. Salton, Gerard. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
2. Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval, Addison Wesley, 1999.
3. Vector Model Information Retrieval
http://www.hray.com/5264/math.htm
4. Vector Model of Text Retrieval
http://mingo.info-science.uiowa.edu:16080/courses/230/Vector1.html
5. Term Weighting and Ranking Algorithms
http://www.sims.berkeley.edu/courses/is202/f98/Lecture17/sld001.htm
6. Automatic Hypertext Link Generation based on Similarity Measures between Documents
http://www.fundp.ac.be/~lgoffine/Hypertext/semantic_links.html
RELATED MATERIAL from WWW9 conference (somewhat still experimental).
1. The Term Vector Database: fast access to indexing terms for Web pages
http://www9.org/w9cdrom/159/159.html
2. Graph structure in the web
http://www9.org/w9cdrom/160/160.html
3. WTMS: A System for Collecting and Analyzing Topic-Specific Web Information
http://www9.org/w9cdrom/293/293.html
4. On Near-Uniform URL Sampling
http://www9.org/w9cdrom/88/88.htmln Near-Uniform URL Sampling
5. What is this Page Known for? Computing Web Page Reputation
http://www9.org/w9cdrom/368/368.htmltions
Orion
Incubator
07-07-2004, 01:38 PM
Hi Orion, great post again, was wondering if you could break something down a little more in lamens terms for some of us :)
Landing page, 500 words, keyphrases used 3 max per page, keyphrases being 2-3 words deep across the content of the landing page. How do you recommend the implementation of those words? I am not speaking about layout but how semantic terms will fall accordingly across the remaining content
Cheers and great to see your posts again !!! :)
detlev
07-07-2004, 04:29 PM
Hello everyone,
For the SEO, Term Vector translates to a competitive term getting less weight on a per-page basis. It was Term Vector that hinted to us to go after terms that are not searched upon very highly (according to WordTracker anyway). The "competition" for those terms is less.
The competitiveness of a term makes it less weighted by Term Vector theory in practice becuase of the increased number of documents on the Web which contain those terms. Therefore, you would have to stuff a given page with a competitive term in order to rank well. Cloaking excels at that, copywriting style suffers.
When you don't cloak, the idea goes: You could more naturally write a page with less competitive terms in mind and attract all sorts of naturally valuable traffic as an alternative approach and build more specialized documents. The Term Vector balance would favor a page with less competitive terms and you could more easily go after those rankings.
An additional benefit is that a naturally written page does not display on the SERP as pure gibberish like many listings of stuffed pages. Hence, it is not necessarily out of sheer laziness or ineptitude but sometimes good sense to tell clients that highly competitive terms are a waste of effort when the best performing pages on a given query are cloaked and stuffed.
Once Google came into the picture and links were more important, competitive terms became somewhat more reachable again. Term Vector, as I recall, was most obviously applied at AltaVista before Google got famous.
My .02
*cheers*
-detlev
cjtripnewton
07-07-2004, 05:25 PM
Not to argue detlev, but Orian has it correct when he states "Thus, terms which appear in too many documents (e.g., stopwords, very frequent terms) receive a low weight, while uncommon terms appear in few documents and receive a high weight."
Now that has very little to do with the Wordtracker score for a term, which samples the number of times a term is querried, not the number of times the term appears in documents.
In the context of TVT, to determine how "common" a term is, you should simply go and do a quick Google search for the term. If you're comparing two terms, the one which appears in the smallest number of documents is the least common or in Orian's terminology, the most "uncommon." The one which appears in the largest number of documents is the most common, and receives a low weight.
You're right to bring up cloaking versus copywriting. Trying to win for a term which meets some threshold of commonness requires much more effort, and many just resort to cloaking such terms. It's obviously even worse the the term also has a high Wordtracker score, because then we all focus more on winning it, making it even more common. It's a cycle I've seen repeated many times.
detlev
07-07-2004, 07:02 PM
Hello everyone,
Well put Newt. I concede that WordTracker has nothing to do with TVT directly and should not have been brought into the equation exactly. Thank you for clarifying my post without destroying what I was trying to say.
Yes, a simple query at the SE in question for the term is how to determine the number of docs which contain the term. I failed to make this point and instead relied on the assumption that one would do that.
Thanks!
-d
orion
07-07-2004, 11:02 PM
Welcome to this thread Incubator, detlev and cjtripnewton. Please feel at home.
To Incubator:
I am answering your question at the Keyword Co-Occurrence and Semantic Connectivity thread at http://forums.searchenginewatch.com/forum/showthread.php?p=4731#post4731
I feel other readers there may benefit from the discussion. That thread is moving to the second phase; i.e. terms co-occurrence at the document level. I hope future posts there will clear any reserves certain SEOs may have on the benefits of having co-occurrence, semantic connectivity and terms sequencing strategies in their "tool box".
To cjtripnewton:
"Now that has very little to do with the Wordtracker score for a term, which samples the number of times a term is querried, not the number of times the term appears in documents."--Well put.
To detlev:
1. "Once Google came into the picture and links were more important, competitive terms became somewhat more reachable again. Term Vector, as I recall, was most obviously applied at AltaVista before Google got famous.."
Thanks to Andrei Broder, seos learned abou their Term Vector Database project, just one variant of the many outthere on Salton's model. Actually, The Term Vector Model has been around and applied before the inception of search engines in the scene. Before AltaVista, Brian Pinkerton implemented TVT. Check
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html,
In this old paper he states
"The WebCrawler's database is comprised of two separate pieces: a full-text index and a representation of the Web as a graph. The database is stored on disk, and is updated as documents are added. To protect the database from system crashes, updates are made under the scope of transactions that are committed every few hundred documents.
The full-text index is currently based on NEXTSTEP's IndexingKit [NeXT]. The index is inverted to make queries fast: looking up a word produces a list of pointers to documents that contain that word. More complex queries are handled by combining the document lists for several words with conventional set operations. The index uses a vector-space model for handling queries [Salton]."
The original and classic model represented by the crude w=tf*log(D/df) equation works well with long queries but not with short queries. Because of this and other reasons, IR proposed several variants. Most commercial search engines implements variants of TVT.
Other search engines, like Google and others use a combination of link metrics and term vector weights. The trend in the last years has been to integrate link metrics with term vector schemes (with several flavors and variations) Check http://www.e-marketing-news.co.uk/topic_distillation.pdf
On Google and Term Vector Theory.
Actually, since its inception in the scene, Google has been using the term vector theory. Check http://www.webpronews.com/ebusiness/seo/wpn-4-20010905GoogleInterviewbyFredrickMarckini.html
In that webpronews article, Fredrick Marckini interviews Google's Craig Silverstein. Craig states,
"The Term Vector Theory
Google's algorithm incorporates the ideas and understanding behind the term vector theory. While the elements of the term vector theory can be quite complex, Craig offered a rather basic definition of how the theory originated. A premise of the term vector theory "says the documents are good if they contain the words in your query and they contain them a lot," explained Silverstein. As search has matured and grown more complex, Google has adapted their algorithm to complement these changes and to account for those who try to cheat and trick the search engines. While the algorithm has adjusted with the times, in essence it still embraces the beliefs behind the term vector theory.
Scoring = PageRank + Term Vector
The term vector factors of the Google ranking algorithm, which will be covered below, concentrate on how relevant a page is to a user's search. This score, combined with the PageRank score that measures the popularity of the page, is how Google derives an overall score or ranking of a Web page. Thus, the Web pages that receive high scores are, in Google's opinion, the Web pages that best meet the user's individual needs."
2. "Yes, a simple query at the SE in question for the term is how to determine the number of docs which contain the term."--Actually, the df(i) in the TVT equation is number of documents in D (all documents available in the system) which contain the term, not exactly number of retrieved results from the database. From this subset, some may not contain the queried term. The assumption that
df(i), # documents containing term i = # documents retrieved by querying term i
introduces an error in the analysis. This is a drawback of keyword-driven searches. A system may return documents semantically relevant, yet without the queried term present in the document. Depending on the degree of recall/precision performance this error may not be critical. To minimize this error one can conduct a regexp search or use EXACT mode.
I hope this post has helped in some way to clear some confusion.
In the next posts, I will explain the drawbacks of the original Salton's Term Vector Model with commercial searches. Then we can get into several solutions proposed by IR scientists.
Orion
Incubator
07-07-2004, 11:40 PM
Thanks Orion, appreciate the reply. I am interested in seeing where this takes us from a "cloaking" point of view, if we remove the human element on this topic then it will be very evident that these theories can be broken down to templates that we can deliver...and i think thats what alot of ppl here maybe thinking but not admitting. I could be wrong... but welcome this as a open debate
cheers all
WC
orion
07-11-2004, 12:25 AM
Material from this and future posts are taken from my new series on Term Vector Theory and Keyword Weights available at http://www.miislita.com/term-vector/term-vector-1.html
Keyword density values are not good term weight estimators. One must use local and global information to assess term weights.
This can be accomplished with the Vector Space Model. Lee, Chuang and Seamon (http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf) compare six different term weight models Readers interested in term vector source code may want to check the MINERVA implementation (http://www.ifcomputer.co.jp/MINERVA/Manual/Reference/Predicates/termvector/home_en.html).
According to Jamie Callan, from Carnegie Mellon University, (http://hartford.lti.cs.cmu.edu/classes/95-778/Lectures/04-BooleanVectorSpaceB.pdf) historically weights were computed using tf information without IDF values. Since I like history, let's start the discussion by representing weights with tf values only; ie., wi = tfi. Keep in mind that, with some modifications, the following procedure also applies to any term weight scheme.
DEMYSTIFYING TERM VECTORS
The following material is taken from my article available at http://www.miislita.com/term-vector/term-vector-2.html, which includes an in-depth analysis with figures and step-by-step calculations.
Consider an index term consisting of the words "car", "auto" and "insurance". This is my term space. Assume that the database collection consists of 3 documents, only. The term counts (tf) or number of times these terms occur in each document is as follow.
doc 1: auto (3 times), car (1 times), insurance (3 times)
doc 2: auto (1 times), car (2 times), insurance (4 times)
doc 3: auto (2 times), car (3 times), insurance (0 times)
If we query the system for "insurance", then the counts for the query in the term space are 0 for auto, 0 for car and 1 for insurance.
Evidently,
1. In this case the term space consists of three dimensions, auto, car and insurance. The term counts are the coordinates of a point in the term space that correspond to each document. The coordinates of each point are then (3,1,3), (1,2,4) and (2,3,0), respectively.
2. If the origin coordinates are (0, 0, 0), then the displacement of each point from the origin can be represented by a vector.
3. The length or magnitude of this vector can be measured with Pithagoras' Theorem. The coordinates associated to the query are (0, 0, 1).
To calculate the magnitude of each vector, we apply Pythagoras' Theorem. For n dimensions, we can write |Di|=(a2 + b2 + c2...+ n 2) 1/2. In our example, this gives the following magnitudes
|D1|=4.3589
|D2|=4.5826
|D3|=3.6056
|Q| = 1
CALCULATING THE DOT PRODUCTS, COSINES AND RANKING THE RESULTS
To compute each cosine value we need to know the scalar or dot product of the query vector and document vectors. It can be computed from coordinate values. For document 1 we have
Dot Product, Q•D1 = 0*3 + 0*1 + 1*3 = 3
Similar calculations gives for documents 2 and 3, Q•D2 = 4 and Q•D3 = 0.
Now we calculate the corresponding cosine values. Since the dot product is defined as the product of the magnitudes of the vectors times the cosine of the angle between them (e.g., for doc 1 we have Q•D1 = |Q|*|D1|*cosine), solving for the cosine we obtain in each case
cosine doc 1 = 0.6882
cosine doc 2 = 0.8729
cosine doc 3 = 0
Finally we sort and rank the documents in descending order according to the cosine values
Rank 1: Doc 2=0.8729
Rank 2: Doc 1=0.6882
Rank 3: Doc 3=0
As we can see, document 2 is the most relevant to the query "insurance". Document 1 is less relevant. Document 3 is completely irrelevant. The closer a cosine is to 1, the more relevant a document should be. If the cosine is zero, then the documents and query are orthogonal in the term space. In English this means that the documents and the query are not related. This is the case of Document 3;at least with our term counts vector model. True that we could have arrived to this conclusion by just looking at the term counts table above. It so happen that with this term vector scheme we have proved that the cosine between document vectors and query vecotrs are a valid similarity measure.
At this point, one may think. "Wait a second. I can divide tf by the total number of words, calculate a keyword density value and arrive to a similar conclusion". Not so fast. This is a term count model, historically one of the first variants of the vector model. Most commercial search engines do not use this model and with good reasons. The system can be deceived by just repeating over and over a given term (keyword spamming). This was the case of the first poorly written search engines and commercial IR systems. To assess the weight of a term, these days we use local, global and web graph information, not mere term counts or "keyword density" values.
We can do better by multiplying tf values times IDF values. Thus, in the above calculations we just need to replace wi=tfi with wi=tfi*IDF values. The term vector calculations remain essentially intact.
In the next articles, I will explain how this can be done and what we gain from such modifications. Until then, please feel free to comment this post.
Orion
thememaster
07-19-2004, 12:25 AM
Hello Orion,
I just wanted to chime in and say I think you're doing a great job in this thread explaining TVT. I've been working in this area for quite a while now and much of these principles are part of what's behind my own software. You can definitely put me on the list of people interested in the more advanced discussion of TVT variants when/if you decide to get into that later.
- Mike
orion
07-20-2004, 11:12 PM
Welcome to this thread Theme Master. It's an honor to having you here.
I appreciate a lot your kind words. I'll be presenting soon normalized term vector models. Then we can move to the "meat". As you and others already probably know, term vector theory is not that complicated. It just happen that many IR folks like to mask the key concepts with unnecessary nomenclature. While this strategy locks out outsiders, it makes IR concepts look unnecessarily complex. I am of the IR school that believes in simplicity. Perhaps fellow IRs are too protective of "secrets".
I visited your site. Awesome. Feel free to contact me by regular email and we can talk a bit more.
Orion
orion
07-26-2004, 08:31 PM
Unlike the term count model, Salton's Vector Space Model incorporates local and global information
Eq 1: Term Weight = tf*IDF = tf*log(D/df)
where
tf = term frequency (term counts) or number of times a term i occurs in a document j.
df = document frequency or number of documents containing term i.
D = number of documents in a database.
IDF is the inverse document frequency defined as log(D/df).
EXAMPLE
Consider the following example, courtesy of Professor David Grossman and Ophir Frieder, from the Illinois Institute of Technology. Suppose we query an IR system for the query "gold silver truck". The database collection consists of three documents (D = 3), with the following content
D1: "Shipment of gold damaged in a fire"
D2: "Delivery of silver arrived in a silver truck"
D3: "Shipment of gold arrived in a truck"
So, the query is Q = "gold silver truck". For step-by-step calculations, graphics, and detailed explanations see http://www.miislita.com/term-vector/term-vector-3.html
PROCEDURE
This is what we normally do:
1. we construct an index of terms from the documents and determine the term counts tf for the query and each document.
2. we compute the document frequency df for each document and each IDF value. In this case D=3, so IDF = log(3/df).
3. next, we take the tf*IDF products and compute term weights.
4. Now we treat weights as coordinates in the vector space, effectively representing documents and the query as vectors. This step is the main difference between this model and the model previously discussed. That is, we are including global information in the vector coordinates.
Steps 1 - 4 are pretty straightforward.
To find out which document vector is closer to the query vector, we resource to the similarity analysis introduced before (See previous posts)
First for each document and query, we compute all vector lengths;
|D1| = 0.7192
|D2| = 1.0955
|D3| = 0.3522
|Q| = 0.5382
Next, we compute all dot products.
Q•D1 = 0.0310
Q•D2 = 0.4862
Q•D3 = 0.0620
Now we calculate the similarity values
Sim(Q,D1) = Cosine D1 = Q•D1/(|Q|*|D1|) = 0.0801
Sim(Q,D2) = Cosine D2 = Q•D2/(|Q|*|D2|) = 0.8246
Sim(Q,D3) = Cosine D3 = Q•D3/(|Q|*|D3|) = 0.3271
Finally we sort and rank the documents in descending order according to the similarity values
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
OBSERVATIONS
This example illustrates several facts. Very frequent terms such as "a", "in", and "of" tend to receive a low weight -a value of zero in this case. Thus, the model correctly predicts that very common terms, occurring in many documents in a collection are not good discriminators of relevancy. Note that this reasoning is based on global information; ie., the IDF term. Precisely, this is why this model is better than the term count model discussed in Part 2.
LIMITATIONS
As a basic model, this term vector scheme has several limitations. First, it is very calculation intensive. From the computational standpoint it is very slow, requiring a lot of processing time. Second, each time we add a new term into the term space we need to recalculate all vectors. Computing the length of the query vector requires access to every document term, not just the terms specified in the query.
Other inconvenients include
1. Long Documents: Very long documents make similarity measures difficult (vectors with small dot products and high dimensionality)
2. False negative matches: documents with similar content but different vocabularies may result in a poor inner product. This is a limitation of keyword-driven IR systems.
3. False positive matches: Improper wording, prefix/suffix removal or parsing can results in spurious hits (falling, fall + ing; therapist, the + rapist, the + rap + ist; Marching, March + ing; GARCIA, GAR + CIA). This is just a pre-processing limitation, not exactly a limitation of the vector model.
4. Semantic content: Systems for handling semantic content may need to use special tags (containers)
MODEL IMPROVEMENTS
We can improve the model by
1. getting a set of keywords that are representative of each document.
2. eliminating all stopwords and very common terms ("a", "in", "of", etc).
3. stemming terms to their roots.
4. limiting the vector space to nouns and few descriptive adjectives and verbs.
5. using small signature files or not too huge inverted files.
6. using theme mapping techniques.
7. computing subvectors (passage vectors) in long documents
8. not retrieving documents below a defined cosine threshold
The model also can be improved by normalizing term and query frequencies rather than using raw frequencies. Some vector schemes apply other modifications to the IDF term.
This is where we are heading to.
Orion
>It just happen that many IR folks like to mask the key concepts with unnecessary nomenclature.
Well said [I think].
rustybrick
07-26-2004, 11:00 PM
Orion, I would just like to thank you for this thread and the keyword co-occurrence thread. I personally have learnt a lot from your posts.
Thank you.
I will be chiming in later with questions but before I do so, I want to read your posts 3 or 4 times over to make sure I have a clear understanding.
orion
07-26-2004, 11:38 PM
Thanks, RustyBrick for such kind words. I just hope others can see through my typos and grammar horrors.
The end goal and main thesis of all my posts is to make seos/sems less prone to second-guessing and trial-and-error approaches and more aware of the Scientific Method. The necessary analytical tools are outthere. It just a matter of finding them. The more seo/sem specialists know about them, the better, I think.
Orion
orion
07-28-2004, 12:25 PM
I forget to mention that Dr. David Grossman and Dr. Ophir Frieder, professors cited above, and who kindly gave me permission to use their term vector example are the authors of the authority book
"Information Retrieval: Algorithms and Heuristics" - Kluwer International Series in Engineering and Computer Science, 461.
Originally published in 1997, this material is available in a new edition this year. I though dedicated search specialists like Theme Master, Rustybrick and others may be interested in knowing the datum.
This is a must-read literature for graduate students, search engineers and search engine marketers. The book focuses on the real thing behind IR systems and search algorithms. Perhaps is time for the SEO/SEM industry to pay less attention to seo speculations disguised as "facts" and become more familiarized with the scientific facts behind search technologies. Please take this as my kind two cents.
Orion
thememaster
07-28-2004, 12:52 PM
Thanks for the resource Orion. I'll definitely be getting a copy of that.
rustybrick
07-28-2004, 12:53 PM
Orion, do you know where I can get a copy? Amazon seems to have the old edition but they are out of stock. Is there an ISBN number for the new edition? All the sites I pulled up under that title were selling the old edition.
Thanks.
orion
07-29-2004, 10:09 PM
Hello Theme Master and Rustybrick
Dr. Grossman mentioned to me the new book will be shipped to press by late August. He asked me if I could review a draft and I accepted. I have reviewed scientific publications before and is a cutthroat task. So, I may not be able to say more, except that wait until the book comes out after August or so via Amazon. As soon as it comes out I will let you know, most definitely.
Orion
thememaster
08-04-2004, 04:25 PM
Okay. Thanks for the update on that.
orion
10-18-2004, 12:39 AM
The original Term Count and Classic Term Vector Model are easy to game via term repetition (keyword spamming) and by using too long documents and queries.
This can be avoided by
1. using normalized frequencies in both documents and queries
2. incorporating a document/query length factor into the tf*idf scheme.
FOR DOCUMENTS
The normalized frequency of a term i in document j is given by the ratio
tf(i,j)/max tf(i,j)
where
tf(i,j) = frequency of term i in document j
max tf(i,j) = maximum frequency of term i in document j
For example, consider a document A in which the term counts are as follows
major, 1
league, 2
baseball, 4
playoffs, 5
Since playoffs occurs the most in the document, the normalized frequencies are
major, 1/5=0.20
league, 2/5=0.40
baseball, 4/5=0.80
playoffs, 5/5=1
FOR QUERIES
The normalized frequency of a term i in a query q is given by
0.5 + 0.5*(tf(i,q)/max tf(i,q))
tf(i,q) = frequency of term i in query q
max tfq,i = maximum frequency of term i in query q
DOCUMENT/QUERY LENGTHS
Mark Sanderson and Ian Ruthven in the Report on the Glasgow IR group (glair4) submission (http://www.dcs.gla.ac.uk/~igr/Papers/trec5.pdf) suggest an expression that accounts for the length of documents.
In my opinion (3), an equation of the same form as described by Sanderson and Ruthven could be applied to both documents and queries by
1. for documents, using normalized frequencies as tf(i,j)/max tf(i,j)
2. for queries, using normalized frequencies as tf(i,q)/max tf(i,q)
3. defining the length of documents and queries as number of terms, excluding stopwords and punctuation.
The main advantage of this proposed scheme is that too long documents and queries can be penalized and are not overweighted.
In general, one can conduct some simple tests and sense which vector-based search engines do not penalize for document lengths and repetition of terms.
To test the query weight model used by a vector-based search engine try the following test:
1. Check the # retrieved documents and top N ranked results for
Q=k1
Q=k1+k2
Q=k1+k2+k3
where k1=k2=k3
2. Compare results.
3. Note: k's can be key terms or phrases
Try the test in FINDALL mode. Avoid the EXACT mode.
Suggested terms
k1=k2=k3=car
k1=k2=k3=dog
Try this test with Vector-based IR systems/search engines only.
How are the # of retrieved results affected?
How are the top N ranked document affected?
How are paid results affected?
Share your observations and discuss possible explanations. Are you up to the challenge?
REFERENCES
1. Baeza-Yates, R., Ribeiro-Neto, B;Modern Information Retrieval;Addison Wesley, 1999.
2. G. Salton, C. Buckley;Term-weighting approaches in automatic retrieval
Information Processing &Management 24(5):513-523, 1988.
3. Frequency-Normalized Models (http://www.miislita.com/term-vector/term-vector-4.html)
Orion
Mike Grehan
10-18-2004, 11:34 AM
I forget to mention that Dr. David Grossman and Dr. Ophir Frieder, professors cited above, and who kindly gave me permission to use their term vector example are the authors of the authority book
"Information Retrieval: Algorithms and Heuristics" - Kluwer International Series in Engineering and Computer Science, 461.
Originally published in 1997, this material is available in a new edition this year. I though dedicated search specialists like Theme Master, Rustybrick and others may be interested in knowing the datum.
This is a must-read literature for graduate students, search engineers and search engine marketers. The book focuses on the real thing behind IR systems and search algorithms. Perhaps is time for the SEO/SEM industry to pay less attention to seo speculations disguised as "facts" and become more familiarized with the scientific facts behind search technologies. Please take this as my kind two cents.
Orion
Hey Orion,
Good to see you wearing a moderator hat now.
One or two small world connections in this thread which seems to have risen back to the top with your recent post. I seem to have missed it completely when it started.
Dr David Grossman was kind enough to make a contribution to the second edition of my book when it was first published two years ago.
He also sent me a partial draft of his new book for feedback (he was only five chapters into it at that time). I noted your comment about being a reviewer for his book... Sorry I'll be putting you under the same pressure with my third edition!
Also, you mention Brian Pinkerton. Again, Brian was a tremendous help to me in introducing the vector space model in layman's terms. He did implement it in "classic Salton style" as he put it. But of course, even though most search engine optimisers at the time didn't quite know what they were gaming, Brian realised quite quickly that it wouldn't scale too well with the web and that it was open to "manipulation".
I always credit Brian with being the developer of the first full text retrieval search engine. But you'll also find reference to the vector space model in Michael Mauldin's original research work for Lycos at around the same time as WebCrawler.
And with the help of Andrei Broder (certainly one of the leading minds in the field of information retrieval on the web) I was able to develop some basic graphics to give my readers a visual aid to understanding how an inverted index is created and then terms are weighted by tf and idf.
I thought they may be useful in this thread for members who are coming into this from the very beginning. They are purely for demonstration purposes and they have been very useful at the many conferences and seminars I speak at in helping people to grasp the fundamentals.
http://www.search-engine-book.co.uk/inverted-index.html
As for reference, you've outlined some excellent material. Let me just add a couple more which I find to be invaluable.
The first one is a personal favourite and an absolute must (I believe) for those wishing to get the true background, history and science behind information retrieval.
Readings in Information Retrieval.
by Karen Sparck Jones.
In my opinion, Karen Sparck Jones is very much "up there" with Gerry Salton. And she's English to boot!
As you say yourself: "Perhaps is time for the SEO/SEM industry to pay less attention to seo speculations disguised as "facts" and become more familiarized with the scientific facts behind search technologies. Please take this as my kind two cents."
The next must have IMO is:
Mining the Web: Analysis of Hypertext and Semi Structured Data
by Soumen Chakrabarti.
Soumen is delighted that his name is actually and anagram of Anarchism Outbreak! His book is enormously helpful.
Cheers!
orion
10-18-2004, 01:06 PM
Hi, Mike.
Thanks for your kind words.
The late Gerard Salton along with Benoit Mandelbrot are my heroes. There is a short distance between semantic models and self-similarity and scaling concepts.
"I Sorry I'll be putting you under the same pressure with my third edition!"
I will love to review your new book, Mike.
Orion
dgray
11-05-2004, 02:34 PM
systems and search algorithms. Perhaps is time for the SEO/SEM industry to pay less attention to seo speculations disguised as "facts" and become more familiarized with the scientific facts behind search technologies. Please take this as my kind two cents.
Orion
Absolutely, I agree that familiarity with the math behind the machine is important. I accept that the SE's use a well thought-out algorithm for generating a result set from a search query. The big question is what happens next, for I am equally convinced that they then apply one or more filters to the result set to screen out those who have tried to game the system. And let’s face it SEOs and SEMs to one extent or another are gaming the system, rather than letting the ‘natural’ rise to dominance take place. The challenge for us in the industry is to ‘promote’ our clients to the top without incurring the wrath of the keepers of the SE faith.
randfish
12-06-2004, 03:20 PM
Hi all,
I realize this is an old thread, but I'm building a tool (open & free to all) that will measure the term weight of the top 10 search results for a specific keyword phrase against a page that the user can select.
I'm having a little trouble regarding normalized term frequency. When I calculate it;
1. Should I be measuring the max tf(iq) as the highest number of occurences of a single word in each document?
2. Should I exclude stopwords? (http://www.searchengineworld.com/spy/stopwords.htm)
3. Should I only measure keyword phrases that are the same number of words as the query - i.e. if the query is 'used cars', would I count individual uses of the word 'cars' in the document when finding max tf(iq)?
4. When I use normalized term frequency, should I be multiplying the normalizer times the term frequency, so that the equation would look like:
TW = (NTFi * tfiq) * Log10(D / dfi)
Where:
NTFi = normalized term frequency (as per the equation from Orion)
tfiq = term frequency
D = # of documents in database
dfi = # of documents containg the term
Thanks much!
orion
12-06-2004, 06:48 PM
Hi randfish;
I hope this help a bit.
To answer your questions,
Should I be measuring the max tf(iq) as the highest number of occurences of a single word in each document?
There are two types of tfmax, one is for the query. The other tfmax is for each individual document.
Should I exclude stopwords?
Yes.
Should I only measure keyword phrases that are the same number of words as the query……?
Term vectors are normally calculated for individual terms. However I know of work done in which noun phrases (also known as document concepts) are considered within a modified vector scheme. See the Local Context Analysis (LCA) thread and Professor Bruce Croft’s paper.
When I use normalized term frequency, should I be multiplying the normalizer times the term frequency…
It all depends of how you define the normalizer.
Now about the purpose of the tool…
…I'm building a tool (open & free to all) that will measure the term weight of the top 10 search results for a specific keyword phrase against a page that the user can select.
Term vector models are used to compare query and document vectors that are common to the same term space. Users randomly/arbitrarily selecting documents may be selecting documents not belonging to the same term space in question. Thus, some may argue that such users would be making questionable comparisons.
In general, any type of term vector scheme is of the general form
F(w) = I(local)*I(global)
That is, the weight w is function of
I(local) = local information
I(global) = global information
The local information comes from the current document, while the global information comes from the database.
This is the main reason of why keyword density values are not good term weight estimators and should be avoided. Note that the weight of a term in a document is affected by the global information that it carries in a database. The index terms is constructed from the documents in the database.
In the example you describe your calculator will be using the “global” information that comes from just 10 documents rather from the entire database. Your index terms will then need to be constructed from 10 documents, only. Thus, weights will be too localized.
Since the documents are already scored using a ranking scheme, anything that such calculator weighs is as good as how the documents were ranked or mis-ranked. I have seen sites computing “term weights” in this manner. Honestly, I have strong reserves about such pseudo “term weights”.
Now if you have access to an entire database, you will need to construct the index terms and the term space from all the documents in the database. In addition, the IDF term must include this total as well as the number of documents containing the term.
As you can see, randfish term weight calculations are computational expensive. To top off, each time a document is added/removed/modified your index term changes. There are others term vector schemes with workarounds but are too complicated to describe in a single post as this one.
Orion
randfish
12-06-2004, 09:15 PM
Orion,
Thanks for your help and your criticism - it is duly noted!
I understand the limitations and inaccuracy of the tool - I'll attempt to increase the size of database to include at least 50 and possibly 100 documents - as you mentioned, it is somewhat computationally expensive to do so.
The goal is not so much to gain a true, accurate value for the term weight, but rather a reasonable number on which to make a comparison. I would guess that even in a database as small as the 10 documents I proposed, there would be some meaningful data one could derive by simply looking at the relative weights. However, increasing this to 50-100 would probably provide much better results.
I was still a little confused about the use of the normalizer - in your example from above, I had thought that:
NTF = 0.5 + 0.5(tfiq / Max tfiq)
Where:
tfiq = term frequency in the document
Max tfiq = maximum frequency of any term in the document
2 More questions:
1. Should this be the number I use as a normalizer, or are there more steps?
2. Why would the equation above (and the one from my previous post) not be applicable directly to 2, 3,4 or 5 word phrases?
orion
12-07-2004, 08:15 PM
Hi, randfish.
Should this be the number I use as a normalizer, or are there more steps?
Some authors use the following weight scheme:
Normalized frequency for the query, fq = 0.5 + 0.5*tfq,i/max tfq,i
Normalized frequency for the document, fd = tfd,i/max tfd,i
Why would the equation above (and the one from my previous post) not be applicable directly to 2, 3, 4 or 5 word phrases?
Normally term vector schemes are for measuring cosine similarities between documents and queries.
Now about the goal of the tool and experiment…
Before proceeding any further, I most highlight, randfish, that your intended tool has merits, however you need to consider the possible drawbacks of the experiment to avoid wrong conclusions.
I don't know if your goal is to calculate only term weights without term vectors. In any event, let's consider both. Then, you can keep only the part that is concerned with the purpose of the experiment. Let me proceed.
First in question 1, not everybody agree with the described normalized schemes since
(a) it adds a “free” 0.5 weight to terms in the query, regardless of relevancy.
(b) It makes non sparse the query-document matrix, requiring more computational efforts.
(c) There is no definitive evidence is better over other schemes.
(d) It does not account for the length of documents.
(e) With small databases, the IDF term is often zero, thus no weight computation is possible, unless one use a different IDF scheme.
Second in question 2, if you want to include phrases, you can but then you would have to transform the term space. For consistency sake, then you would need to use the same n-at-a-time grouping for all calculations.
To illustrate, let say you want to count only pairwise phrases (counted as single units). Then the term space should not include single terms, triples, tuples, etc. There is nothing wrong with this approach, but you will be limiting the term space to grouped terms. The same for the query. This would limit the analysis but it has other advantages.
Third, the comparison part is intimate linked with how the index term is created.
Let say you
1. Define the database size D = 10 (or 50, 100)
2. Remove all stopwords
3. Tokenize the survival terms (lower case them, remove punctuations, numerics, strange characters, and delimiters).
4. From the survival terms, construct the term space.
5. Compute the IDF measure IDF=log(D/d) where d is the number of documents containing the queried term.
6. Calculate the term weigths using the frequency*IDF scheme.
7. Calculate the term vectors for the documents.
8. For query terms, repeat steps 2 and 3 and compute term weights and term vectors.
9. Calculate the cosine similarities between documents and queries
10. rank results.
That’s about it.
Note that term vector schemes are “closed” systems --to paraphrase thermodynamics-- meaning outside comparisons are not possible.
Any comparison must be made between terms and documents already in the database and terms already in the index of terms. You cannot make comparisons with documents not present in the database or terms not found in the index of terms, unless you add the documents to the database and repeat steps 2 through 6.
This means that the following two are pointless:
1. Comparison of documents or terms in this database with other databases or subsets in other databases.
2. Comparison with randomly selected documents or terms not found in the test database or in the test index of terms.
Additional information is given in the Term vector (http://www.miislita.com/term-vector/term-vector-4.html) articles. I plan to include a collection of other TV schemes with their pro’s and con’s.
I hope this shed some light.
Orion
randfish
12-07-2004, 08:53 PM
Orion,
It takes us laymen a while to digest, but yes, your help has been invaluable and selfless as always. Thank you. I'll return when I have time to process everything.
orion
02-18-2005, 09:11 PM
I have been asked some interesting questions, that more or less can be summarized like this:
Could we use the classic Term Vector Space model to calculate term weights, similarities and distances for just two documents without considering an entire database?
The answer is no. In this model the weight of a term is given by two type of information
(a) local, expressed using term frequencies (occurrences)
(b) global , expressed using inverse doc freq
(b) to compute IDF one need to know the database size.
If you know the database size, then doc distances between any two docs and for a given query with TVT is possible using the Euclidean distance between these.
For similarity and distance between just two docs, others methods can be used. One of such methods is the Dice Coefficient, but there are other methods equally acceptable.
Let doc1 and doc2 be two different documents. To apply Dice, proceed as follows
Define a Dice function such that Sim(doc1, doc2) = 2n12/|n1+n2| where in this case
n12 = a co-count
n1, n2 = individual counts in doc1 and doc2
Now define
count=unique terms
n1=number of unique terms in doc1
n2=number of unique terms in doc2
n12=number of terms common to doc1 and doc2
The answer should be taken for an estimate of the similarity between docs. A similiar estimate can be applied to individual words, phrases, passages, etc. Warning: this sim measure is not precisely a measure for assessing semantics. For that you would need to consider other things.
Some important references
tele.engr.usu.edu/EECS803_Fall_2000/ Project_2/Papers/Pandaram_2.doc
http://crpit.com/confpapers/CRPITV32Chen.pdf
Or do a search for the query
dice coefficient for "two documents"
in Google. (note the quotes in two documents)
Orion
Phew,
I got tired reading all of that. It all got a bit over-complicated there and I almost boiled over! I think other people might have been been blown into oblivion as well, so I'm going to try a "words only" explaination:
In a very simple way, vector Space Search Engines use the term vector technique which is really a matrix comparing documents(x) and word frequency(y). So we throw away all information word order.
Everey unique word in the document collection is pin pointed, then theres a term count. A vector is drawn by using x and y as coordinates. All the terms are plotted, and then a central point is found which describes the document. We can then trace a vector for the document.
We calculate the length of the line between the documents point amongst the words, and their origin (magnitude). The Pythagorean theorem can be used to calculate the magnitude). This is so that we can compare documents by calculating the cosign of the angle between them all. They're given a score representing how similar they are to each other.
To retrieve a similar document to the query, we place it in the vector space and see which vector is the closest to it. We calculate the cosign between the query vector and a document:
dot product = term counts/document x query term counts.
(dot product query vector & document vector / magnitude of query vector) x magnitude of document vector = cosign
then:
dot product / (document + query magnitudes) = cosine
The problem with vectors?
Everytime a new page is intriduced to the collection, the term vector naturally must all change.
The indexing is dependant on how everything is broken up, as this affect the space, and therefore all of our calculations.
Its really really slow because of all the methematical steps and stuff, real time indexing in dynamic collections is impossible, and its a very computationally expensive method requiring lots of RAM and processing power.
This brings you nicely into "latent semantic indexing, the vector model on steroids" but don't dwell too long there, contextual network graphs are better.
Did I do ok? Not a single number, see.
Robert_Charlton
02-19-2005, 04:20 PM
Did I do ok? Not a single number, see.
xan - Thank you. You did great... very helpful for someone who's forgotten lots of math... though when you got to the formulas some conventional math notation would have helped me, as I'm not sure what "dot product," eg, is.
The problem with vectors?
Everytime a new page is intriduced to the collection, the term vector naturally must all change.
I've seen this mentioned several times in several threads now, and I'm wondering whether this is necessarilly true. I'm not grasping the need for absolute freshness. Might it be possible to calculate reference vectors that could be recalculated periodically? Or perhaps I should say maintain reference datasets that would be updated periodically... less often than the general index?
For sets of words with large frequencies, words that are very common, I would think that things wouldn't change much over the reindexing interval... and for new subjects, I'd think that their relative novelty might be factored in to weigh more heavily than the vector data.
I've not in any way considered what kind of indexing mechanism is needed to use this info in calculations, though, so I could be way off base in my suppositions. This is a complete amateur's interpretation of your plain English explanation. You see what happens when you open the gates to the rest of us?
Hi robert,
I did write the meaning of dot product but its easy be very confused indeed:
dot product = term counts/documents x query term counts.
Yes, it is absoluntely necessary for the whole collection to change if new items are added. Imagine a cupboard that you are filling with cans of beans, everytime you push one in, the other readjust to make room, and so the shape of the group changes. There is no way this can't be true. Its cause and effect.
You have to put everything into context when thinking about fresh indexes Robert, remember google has indexed around 16% of the web, and that its the largest search index compiled at the moment. You know how often the updates are, it takes a long time to get round. Index freshness is a major priority. If an index isn't fresh, then what's the point, you'd get outdated information or information not as complete as it could be. Consider this: No index is fresh and the bigger the worse it gets! All the datasets make up the main database, its just an easier way to spead the load, but in the end everything has to budge. Site age has very little to do with it.
Remember vectors are not necessarily the solution for a large scale search engine. There are obvious drawbacks really.
Robert_Charlton
02-20-2005, 03:43 AM
xan - Thanks. I understand what you're saying, but to further expose my ignorance...
- I had assumed that the vectors might be used as just one weighting factor among many, perhaps analogous to PageRank (not of course doing the same thing), where, I'm assuming, PageRank for the entire index is not recomputed every time a new page is added to the index... but the last computed PageRank values of all other pages, and an estimated value of PageRank for new pages, together with other ranking factors, suffices until PageRank is recomputed.
Also, my question about "dot product" wasn't clear. I'm just not getting the language used to describe this particular computed value. In the equation that defines it, "term counts" is clear... maybe even the word "product" is clear in standard math terminology (as in, the product of a x b)... but "dot" throws me. I intuitively follow everything in your post except those three lines with "dot product" related formulae.
Hello again Robert!
Yes you are quite right, it is a bad idea to just used vectors for the task, especially for something as big as the web indexing, many other methods are used in conjunction to that but for the sake of what term vectors are, that's pretty much the idea.
Sorry, its my fault for not explaining dot product properly:
Its the multiplication of two vectors. Its also called a scalar. You might hear it refered to as an inner product as well, thats a generalisation of dot products.
If its not clear ask again...remember the saying, no stupid questions, we all had them at one point!
Robert_Charlton
02-20-2005, 07:08 PM
Yes you are quite right, it is a bad idea to just used vectors for the task, especially for something as big as the web indexing, many other methods are used in conjunction to that but for the sake of what term vectors are, that's pretty much the idea.
xan - To follow up on this thought, then...
- is a periodically computed vector database possible?
- might it in some way be useful to provide an additional relevancy factor for web indexing?
Yes you can use periodically updated vector databases. It is much preferable to update them constantly though to deal with the influx of data.
- might it in some way be useful to provide an additional relevancy factor for web indexing?
I'm not sure i understand, but the less you update the index the less it is fresh and representative of the space it tried to index.
orion
02-23-2005, 10:06 PM
Phew,
I got tired reading all of that. It all got a bit over-complicated there and I almost boiled over! I think other people might have been been blown into oblivion as well..xan, this is just your personal opinion and view, but I give you credit for providing SEWF readers and visitor with alternatives for grasping concepts.
As a former university academic, I have found different teachers have different ways of teaching or communicating information. Any style has its pro's and con's. Some styles even create more new questions rather than providing answers. We are all here trying to share value added information, without trying to confuse or deceive anyone with false representations.
Regards
Orion
Sure Orion,
I personally found it a bit tough and confusing for people who don't intend to go and revise for an exam, or build their own systems.
I still teach postgrads at universities, and giving a good introduction before running into the maths, and then the code, and then the alternatives and drawbacks is a common procedure. I felt there were concepts which you missed out (all due respect) which help things along. I just didn't feel these people needed that to do SEO work, but you most likely know best being in the SEO arena ;)
orion
02-24-2005, 01:28 PM
Indeed.
My goal is to make SEO folks more educated and the more IR friends make the crossover, the better. I have found all kind of good people in both the SEO and IR spectrum. Don't be surprise if you find SEOs with a lot of Math and Science background and others, just newbies trying to make it in both fields as well.
This is certainly a thread for advanced folks and as such I'm trying to address it.
Orion
orion
05-15-2005, 10:30 PM
For those interested in a comprehensive list of term vector schemes, I have a reference I feel is time to let you know about it.
Check the OAK RIDGE report NEW TERM WEIGHTING FORMULAS FOR THE VECTOR SPACE METHOD IN INFORMATION RETRIEVAL (http://csmr.ca.sandia.gov/~tgkolda/pubs/ornl-tm-13756.pdf)
Orion
Hey all,
I think that term vector approaches have shown to be hardly better than LSI in some instances, and that there are better methods to use. Maximal marginal relevance is a technique that has been used as well as probalistic methods (like Logistic Regression - no better than vectors), cluster hypothesis, and so on...
Term vector is more for visualization than term relevance scoring and terms anyway are not orthogonal dimensions. It has been seen to perform just as well as LSI. It can't deal with synonymy or polsemy, it's just a bag of words.
It is very shortsighted to assume that word occurrence statistics give a real level of relevance of a document with respect to a query, because none of the core document relevancy is taken into account, such as the textual properties for example. It assumes that the similarity of the textual contents of a document gives a correct indication of the similarity of the meaning, and concepts. Docuemtns with similar mening don't necessarily have the same content. Plus, there's no concept of normalizing the length of the documents, no consideration for the position of words in the text, and it assumes that a word which is returned in both documents at a high level is most important. It doesn't deal with anaphora either.
Using vector negation is useful for getting rid of unwanted terms (but unfortunately it often gets rid of synonyms), and using tf*idf with it has and can provide efficient results. In addition, it can be run totally unsupervised which is great and used with relevance feedback, the results can be improved (Salton).
see Improving Document Vectors Representation using Semantic Links and attributes (http://www.cse.iitb.ac.in/~pb/papers/doc-vec-sem-link.pdf)
The thing is that it has many limitations, and other methods do exist for this task. Because of this I doubt very much that the cream of the cream at Microsoft, Google et al. are useing this limited method.
You need to use multiple methods for this IMO. Remember the IR rule of "Garbage in, Garbage out?"
Mike you are quite right, Karen Spärck Jones (Cambridge uni) is a lovely lady as well. Most of her work is in NLP video processing based on speech and information processing (also wrote about tf*idf and worked with Salton). She is a pretty stern teacher though and you had better have a pretty interesting and decent question to ask or you won't get an answer!
This was said about her:
“It is remarkable, one might even say moving, that someone who co-authored a paper in one of the great founding collections of our discipline, the Proceedings of the 1958 International Conference on Scientific Information in Washington, DC, should still be an architect of information science more than 40 years later.� (ASIST)
orion
05-16-2005, 12:41 PM
Today I receive email from another student looking for information on term vector. I referenced them to this thread. In addition to my previous post, the follow references may help.
Document Ranking and the Vector-Space Model (http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf)
DIK L. LEE, Hong Kong University of Science and Technology
HUEI CHUANG, Information Dimensions
KENT SEAMONS, Transarc
In this paper the authors compare several term vector schemes, their pros and cons. Good coffee table lecture I referenced before.
Ian Ruthven (Glasgow) has written also a nice paper that surveys term vector schemes and relevance feedbackA survey on the use of relevance feedback for information access systems (http://www.cis.strath.ac.uk/~ir/papers/ker.pdf)
Hope this help.
Orion
orion
05-20-2005, 05:56 AM
For those of you interested, here is the MySQL.com explanation of their term vector implementation. This is the official site for MySQL developers. (Thanks, guys for quoting my work. :))
http://dev.mysql.com/doc/internals/en/full-text-search.html
To understand their modified formulas you may want to check some of the references I mentioned in my previous two posts. In a nutshell, the weight of a term is given by
w = (local information)*(global information)*(document normalization factor)
i.e., you need three type of weightings
See the keyword density (http://www.miislita.com/fractals/keyword-density-optimization.html) paper for a simple explanation.
MySQL rearranges terms and uses the variant (again, see previous posts)
w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)
Where:
dtf = the number of times the term appears in the document
sumdtf = sum of (log(dtf)+1)'s for all terms in the same document
U = number of Unique terms in the document
N = total number of documents
nf = number of documents that contain the term
The base part of the formula is the left of the formula, "(log(dtf)+1)/sumdtf". This provides local information.
The normalization factor is now the middle part of the formula. It accounts for the length of the document. If a document is shorter than average length then weight goes up, if it's average length then weight stays the same, if it's longer than average length then weight goes down. See Pivoted Document Length Normalization (http://ir.iit.edu/~dagr/cs529/files/handouts/singhal96pivoted.pdf) by Amit Singhal (now with Google) and Chris Buckley and Mandar Mitra ACM SIGIR'96, 21-29, 1996 for an explanation of the Pivot technique they use.
The global multiplier which provides global information is
log((N-nf)/nf)
and can be viewed as a probabilistic IDF (inverse document frequency term).
Equations of this type are often used with probabilistic approaches.
Finally, the Rank R reduces to
R = w * qf;
where:
qf = number of times the term appears in the query
So, if I suspect that a system might be MySQL-based, the first thing I would look for is for IR behaviors that conform to this implementation. Then I would conduct several tests to try to understand any diversion from it.
Cheers
Orion
orion
05-24-2005, 01:55 AM
Term Weigths in a MySQL Environments (http://www.miislita.com/term-vector/term-vector-5-mysql.html) has additional information on the MySQL implementation. In particular, it is explained why negative weights can be assigned to some terms in some collections. This may be of interest to search engine marketers.
Negative global weights can be explained in terms the log((N-nf)/nf) term also known as probabilistic IDF or IDFP. In IDFP the number of documents containing a term (nf) are substracted from the total number of documents in the collection (N); i.e., N - nf. Plotting IDFP vs nf reveals that
1 positive global weights are assigned when nf < N/2
2. zero global weights are assigned when nf = N/2
3. negative global weights are assigned when nf > N/2
Thus, when nf > N/2 negative weights are assigned to terms appearing in more than half of the documents in the collection.
Orion
csanfrancibs
06-22-2005, 07:07 PM
snip
Thus, when nf > N/2 negative weights are assigned to terms appearing in more than half of the documents in the collection.
Orion
Hi Orion,
Wondered in here. Great post, love to know how things work.
Would you explain more about a "collection." What is it? Maybe real example. ;)
Some rambling questions:
Is a collection all webpages found in a search engine for the words car insurance for example? So page with just car or just insurance could rank better than a page with car insurance?
Erm or is collection my webpages on my website? :confused:
I don't understand how U relates to document size?
Plug 1, 1000, 10000 into
U/(1+0.0115*U) and the score gets bigger. But that makes the first part of equation get smaller does it not? :cool:
What is global and local? Is local my page again? Is global every page in google?
added: Okay read this > "The local information comes from the current document, while the global information comes from the database."
Database = google? or my site?
The global multiplier equation. It is seems like this value would be the same for every document in say google for car insurance. Why multiply it to differentiate pages with car insurance if it is the same number :confused:
Maybe if i understood these:
N = total number of documents
nf = number of documents that contain the term
Total documents in my website or google?
Number of documents that contain the term in a query? In my website? In google?
Thanks for you patience.
Csanfrancibs
orion
06-22-2005, 08:53 PM
Hi, there.
Oh, no problem. Happy to help.
1. The above equation refers to MySQL implementation, not Google.
2. "collection", "database", "index" for those inside and outside IR means different things. With times it has been used indiscriminately to mean the total number of documents in a queried system, more or less. I know of colleagues that dislike this, but that's fine as is just a working definition.
(Nomenclature Alert: Watchout! some folks like to write IR articles using N for the length of a document or D to refer to the database size).
MySQL internal manual uses N to refer to the queried database size while in the term vector literature the letter D is commonly used. So if I have a search engine or IR system based on the MySQL internal manual described above,
N = total number of documents in a queried system
nf = number of documents in N that contain the term in question
If a search engine indexes my site or my little collection of documents, N no longer is the collection size (from the search engine side), since now my site is part of a larger corpus.
3. In MySQL, U is the length of a given document expressed in number of unique terms, excluding stopwords. An hypothetical scenario: if after removal of stopwords we end with "car" repeated 5 times, "insurance" repeated 4 times, and "quotes" repeated 1 time, U = 3, since these are the only unique terms
4. Global and local are relative expressions. When referred to term weights, global weights refer to the inverse frequency of documents in N (or D in tv theory) containing the term we are considering.
5. U/(1+0.0115*U) essentially is a normalized scale of the general form
y = x/(1+ x)
Such scales are used to restric values from 0 to 1 when x = 0 or x >>> 1 .
The term "local" refers to individual documents.
6. Systems that use IDFP penalize terms mentioned in more than 50% of the total number of documents in a queried system. This is not a penalty or extra filter but a mathematical, natural consequence of the IDFP expression;
log(N - nf)/nf), which change signs from positive to negative. This is described in the link given in post #46.
Some systems arbitrarily assign a value of zero weight to terms with negative weights to avoid unexpected complications during scoring and retrieval.
I hope this help.
Cheers
Orion
Marcia
07-09-2005, 04:51 AM
The "technical" stuff and maths is beyond me, but something is sinking in related to this, and I have an issue that's been plaguing me that I'd love to get some more insight about.
local and global informationI understand local to be referring to the document level - and global to the level of the entire database, or corpus of documents.
How about frequencies and occurrences at an intermediate "global" level, somewhere in between the global and document - like looking at the "site" or "host" in which the document resides as a corpus of documents, either as an entity or inclusive of the pages linked_to and linked_from?
Also - looking at IDF from the viewpoint of the perimeters of the site itself, as opposed to the entire database.
Make any sense what I'm asking for clarification about?
orion
07-09-2005, 03:15 PM
Hi, Marcia
What you asks does make sense.
Before answering, let me describe a scenario I faced few years ago.
But first,
Local weights = measured at the doc level, usually in terms of word occurrences; many variants of it (linear, log10, log2, normalized, etc). I know of some specialized IR applications that use noun group occurrences; others use phrase occurrences, even others look for specific NV, NN, VN, etc).
Global weights = measured using a corpus; many variants of it.
Normalization = many variants.
Thus, I prefer to use the term "subgraph" rather than "intermediate".
Now the scenario.
What you describe is an idea kind of described several years ago in Altavista's Term Vector Database (http://www9.org/w9cdrom/159/159.html): compute tvt's for a topic-specific subgraph. The approach supposedly improved/help with topic distillation.
"A simple mechanism for constructing this query-specific subgraph is to seed the subgraph with top-ranked pages from a standard search engine and then expand this set with other pages in vicinity of the seed set ([Kleinberg 98]). However, this expanded set sometimes includes highly connected pages that are not relevant to the query topic, reducing the precision of the final result. [Bharat et al 98b] identifies this problem as topic drift."
"[Bharat et al 98b] shows that topic drift can be avoided by using topic vectors to filter the expanded subgraph. A topic vector is a term vector computed from the text of pages in the seed set. A page is allowed to remain in the expanded graph only if its term vector is a good match to this topic vector. Specifically, the inner product of the two vectors is compared to a suitable relevance threshold, and pages below the threshold are expunged from the expanded graph."
Few years ago, a client wanted to do what you suggests; compute weights for his enterprise DB, only. This is what I recommended to him.
1. Get an open source architecture that computes tvt's. (currently you can use barebone Lucene or MySQL). Index all your pages. Consider this your litle subgraph, compute L, G, N, and rank accordingly.
2. Expand this subgraph by adding the top N documents from a search engine. Recompute everything. Keep testing and refining topics, document quality, etc.
The problem. Many top ranked documents (for example, Google) used for expanding the subgraph were the result of spam (link-spammers, link-bombs, blog spam, relevance recycling, spam ads, etc.) and the subgraph became contaminated. We then switched to expanding using ODP, but then we got other problem (editors and editorial issues).
He ended using the architecture for testing purposes.
What I'm trying to say from this story is this. We can indeed as you suggest, build a customized corpus for a host and consider IDF values in the vicinity of such architecture. Many enterprise systems can be used for this purpose. Whatever computed would be valid within the domains of that architecture rather than for a large search engine out there.
Hope this help.
Orion
azrina
07-28-2005, 04:35 AM
Hi all..
I am currently searching for free software for text preprocessing (TF/IDF weighting) and can be run on Windows environemnt. Anyone here knows the software..pls let me know.
Thanks in advance.
orion
07-28-2005, 10:15 AM
There are many components part of open source projects you can download. Just search for tv components for these
1. MySQL
2. Lucene
3. Minerva
Microsoft has also good information online as well.
You would need to have a database collection of documents to extract IDF values from.
Orion
orion
08-20-2005, 10:28 PM
Hi Orion,
Wondered in here. Great post, love to know how things work.
Here is a nice tutorial:
Cosine Similarity and Term Weight Tutorial (http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html).
This is a non technical discussion on cosine similarity, dot products and term weights.
Orion
orion
09-04-2005, 06:20 AM
Hello Theme Master and Rustybrick
Dr. Grossman mentioned to me the new book will be shipped to press by late August. He asked me if I could review a draft ..... As soon as it comes out I will let you know, most definitely.
Orion
Well, this was about a year ago (see post #20 at this thread). There were some delays (print date is December 2004). If you, guys are still interested, the 2nd edition is now available. It is shorter and more affordable than previous versions. Check in Amazon. I have written a long review of the book. (http://www.miislita.com/book-reviews/information-retrieval-grossman.html) as well as a shorter version, which is available through third parties.
Orion
orion
10-22-2005, 02:13 AM
For those interested, this SEWF thread discusses the Extended Boolean Model (http://forums.searchenginewatch.com/showthread.php?t=7635) and its connection with Term Vector Models and term weights.
Orion
orion
10-24-2005, 01:28 PM
I've updated the tutorial cited in post #53. The tutorial explains how to calculate dot products, vector magnitudes and cosine similarities. It explains the underlying IR significance of these concepts when applied to ranking and retrieval. Some basic exercises are included.
Essentially, this is what we do:
Compute
1. local and global weights. The product of this is the term weight of each term.
2. doc-query dot product.
3. associated vector magnitudes.
4. cosine similarities.
5. ranks by sorting similarity values in descending order.
Done!
The meaning of these steps is described in the tutorial.
For advanced doc normalization, you may need to use the pivot method.
For multiterm queries, you may need to use the Extended Boolean Model.
Orion
marcusaurelius
10-25-2005, 08:28 PM
I find this thread very interesting but I have a problem with some basic concepts right at the beginning. I'd be very grateful for any help.
>>In the classic model the weight w(i) of a term i is defined as
>>w(i) = tf(i)*IDF = tf(i)*log[D/df(i)]
>>where
>>tf(i) = term frequency, number of times a term i occurs in a document
>>IDF = Inverse document frequency = log[D/df(i)]
>>D = database size or number of documents available
>>df(i) = number of documents containing term i
If "tf(i) = term frequency, number of times a term i occurs in a document".......which document are we talking about?
Does w(i) vary from document to document so that w is a function of i (the term index) and j (the document index, say)?
Is the log to the base 10 or e (or something else)?
It's called the vector space model but since the axes are discrete isn't that what mathematicians call a "module" (rather than a vector space)?
marcus
orion
10-26-2005, 12:58 AM
Hi, there.
If "tf(i) = term frequency, number of times a term i occurs in a document".......which document are we talking about?
Any document j containing a term i that has been indexed by the system.
Does w(i) vary from document to document so that w is a function of i (the term index) and j (the document index, say)?
Yes.
Is the log to the base 10 or e (or something else)?
Most common is log base 10. Some binary models use log base 2.
It's called the vector space model but since the axes are discrete isn't that what mathematicians call a "module" (rather than a vector space)?
Correct. Traditionally "vector space" has been used in a lose sense. Until recently, a formal foundation using Hilbert spaces and hermitian operators was still lacking. I mention "recently" because back in 2004 the book: The Geometry of Information Retrieval finally addressed this.
I've written two reviews on this book. In it, the leading authority of IR, Keith van Rijsbergen presents (and I'm quoting from the book) a new theory for the foundations of IR. In particular a new theory of measurements. He shows how a document can be represented as a vector in Hilbert space, and the document's relevancy by an Hermitian operator. All the usual quantum-mechanical notions such as uncertainty, superposition and observable, have their IR-theoretic analogues.
More information is given in chapter 3 "Vector and Hilbert spaces".
Note: This is not a book for marketers. You need to have a quantum mechanics background and understand linear algebra. Still is a fascinating reading.
Orion