View Full Version : Keyword Density: Trick or Treat
orion
10-31-2005, 12:36 AM
Mike Grehan has written a nice article about online SEO myths, in particular about the non sense of keyword density.
Online SEO Information: Trick or Treat? (http://www.clickz.com/experts/search/results/article.php/3559701)
Feel free to discuss it here.
Orion
Mikkel deMib Svendsen
10-31-2005, 03:57 AM
The major problem with Mikes writing for ClickZ is that it is written for people with very little understanding of SEO and therfore he, in my mind, simplyfy things too much - to a degree where it is just not correct anymore. Let's just look at a few of those things from this article ...
coding's just not that important, as long as a crawler can get to the pages and parse the text out of them.
Not true! Let me give a few examples:
- Text printed to screen using JavaScript document write
- Extremely complex table structures
- .NET views state codes (I've seen them be over 20,000 carachters at the top of all documents!)
- Syntax and document declaration errors
And thats just a few of the coding issues that very often result in reduced indexing and bad rankings
keyword Density
One thing everyone has to be aware of is that Mike is ONLY talking about english language websites. When it comes to websites in other languages search engines works A LOT more simple. And even with english websites simple keyword density analysis often makes a lot of sense and has proven to me over and over again to improve rankings. Just take a look at SERP scraper sites :)
OK, you may have valid scientific arguments why my methods should not work but they DO in fact work. I DO constantly improve indexing and ranking for many clients and inhouse projects. I may be totally wrong, theoretical, but what do I care when it works so well.
I think the conflict here is that you surely need a lot of knowledge to build a good search engine and the tech that goes into the major ones today are much more complex than ever before. However, that dosn't mean you have to apply the same level of high-tech to understand enough about how that mechnism works to actually do what you want: Get more organic traffic. In fact, sometimes you can apply very simple techniques to gain much better results.
I realize that I am much more of a "brute force" type than Mike. But Mike, and others, have to understand that there are many roads to success. Every week I look at sites, adjust keyword density and gain better results. Keyword density is definately not a magic bullet and there are a lot of other factors to look into but to say that they don't count is just not in line with the real world as it works today. Especially not when it comes to websites in other languages than english :)
One thing I do agree with Mike about, though, is that there are many myths in SEO. Totally agree! SEO is still a rather complex task but not nearly as complex as Mike apparently want it to be :)
http://www.syllabus.biz/keyword-density.php
another article on this.
Hristo
10-31-2005, 12:41 PM
How does keyword density improve relevancy? It does not. Even google's prototype paper shows this idea is discarded a long time ago. It is the frequency of keywords, not the density that's counted.
Keyword density would imply that if the very same information is written by 2 persons having different writing styles (one of them verbose, the other less descriptive), then one of them is more relevant? Of course, not.
Mikkel deMib Svendsen
10-31-2005, 01:19 PM
I am not talking about relevancy. I am not running a search engine but a marketing firm.
I generally find it very funny when eggheads come around and tell my my methods dosn't work. Theoretical they might be right but please look at the real life, results, and not the least the bank accounts of all us brute force guys and then tell me again it dosn't work that way :D
different styles of seo Mikkel deMib Svendsen.
If you have a system that works then stick to it.
Hristo
10-31-2005, 01:51 PM
I generally find it very funny when eggheads come around and tell my my methods dosn't work. Who told you that?
Isn't SEO supposed to be based on how search engines calculate relevancy? Isn't talking about how search engines work supposed to be based on ... well, how they work :)
I generally find it very funny when good SEOs come around and tell me they know how search engines rank pages :) They don't. I don't. Only the guys there know (still there's some stuff that we do know, and one of them is that keyword density is worthless for a search engine).
Mikkel deMib Svendsen
10-31-2005, 02:05 PM
I just stumbled on another funny quote from the article:
It's not possible to reverse-engineer anything, unless you know what all the components are.
Mike, I am pretty sure you know thats not a fact but rather a "wish" from your side. Just ask any competent hacker. If you are really nice I'll even show you a few hacks on systems I have no knowledge about. Reverse engineering is exactly the art of finding out how a system works that you don't know how works to begin with. If you knew it there would be no reason to reverse engineer it :)
orion
10-31-2005, 03:40 PM
My position on Mike Grehan's article
Mike has a way of electrifying SEO communities. He does it again.
As many of you, I cannot speak for Mike but I, and as many of you, can read what he actually did or didn't write. Whether he writes in a particular style, that's up to him and to each one of the individual readers out there to assimilate rather than to claim that he simplifies "to a degree where it is just not correct anymore".
What he didn't write
While I cannot speak for Mike, I can read what he did not write: He never said or even implied that " that he (Mike) is ONLY talking about english language websites." This, simply untrue from your part, sir.
However, if he had mentioned that "this article is for English-only websites", then readers interested in targeting only the English market would not need to worry about claims and stories of success in obscure markets from non-English regions of the world, unless they want to target those markets/regions. I would say the same about the spanish audience in the US (hispanics), spanish audience in Latin America (latinos) or spanish audience in Spain, as each region/sector has its own cultural diversity and semantic issues and marketing needs to begin with.
What he did mention
Mike does mention he started in marketing and "crossed over" to science. I did the opposite, moved from science to marketing. We both reached a point in our paths "where marketing research meets science", without our perceptions being necessarily mirror images. That's for sure.
In that process, I've found all kind of reaccionaries, myths, lies and mezquine (http://yahooligans.yahoo.com/reference/dict_en_es/entry?lb=e&p=num%3As15044) attitudes at both sides of the fence: idiotic search engine spammers and idiotic IR scientists, both looking for an opportunity to undermine each other sides. Who's telling the truth or is an "agent of misinformation"? Depends on who is lying and why and depends on who is listening and why.
Nobody is suggesting that there is only one path to search engine marketing success. Not even Mike suggested this.
Keyword density
Whether an SEO marketer choses to spam a search engine or not and why or want to learn about IR or not and why is everybody's best guessed "trick or treat". Meanwhile, an "appeal to pity" or an "appeal to cause-effect" observations is evidence of nothing.
Only because I change a key term, here or there, that does not means that any success is because I changed the document keyword density, as such changes may change other factors such as immediate context and semantics, to mention a few.
Keyword density or term weight scores is not what makes a term important. For a given term(s) 100,000 documents can have identical keyword density, term weight and term vector scores, but the relative importance of such term(s) can be different in each case as this is slaved to ordering, context, semantics and on-topic features of the document, among other things. If I add HTML DOM manipulations, which many SEOs know nothing about, then the actual significance of the term in the doc gets even more complex.
Indexing
Regarding some document features you mentioned, most of those play no role, sir, because during indexing these are removed in order to create an indexable representation of a document in a linearized format. This is an area many SEO still are not familiar with. Once the linearized version is created
this is then used to compute several scores or for several IR tasks (term weighting, duplicates and near duplicate detection, topic analysis, etc).
How does a given engine actually conducts indexing and scoring is someone else to learn about since the overall descriptions and task are out there. Therefore, in my opinion, arguments about how removed code lines could affect a mythical kw density value are bit pointless.
Orion
Mikkel deMib Svendsen
10-31-2005, 04:07 PM
Orian, I totally agree that there are many ways to get to the same results. In case I did not make that clear, science is definately a valid way. It's just, that often you can find much simpler models that really works that is so much easier to work with and not the least scale in a company faced with the pain of growing and adding new staff. Mike simplyfies things as well as I do (and probably anyone) - we just don't do it the same way :)
From my days at school I recall the history of how we, humans, have percieved the solar system over time. Now, I am certainly NO expert in this, but my understanding is that the old Greeks actually knew exactly how to calculate the position of most stars but the way they did it was just so very very complicated having Earth in the middle. Once the model was changed having the Sun in the middle it all became so much easier to understand for normal people and to calculate with. Did the solar system change? No, only how we "analyze" it. The results are the same. Better rankings. Some times well educated people like you, Orian, makes me feel like I am talking to the old Greeks. Now please don't feel offended by this, you have my depest respct :)
The reason I mentioned Mike talking about English websites is because it is a very common problem that everyone just naturally default to English talking about search engines and SEO. Most of the stuff you read that is definately 100% US/English targeted never states that. I know, from a scientific
point of view they should, but they don't.
Dealing with SEO in Danish or Swedish is VERY different from English. A lot of the basic linquistics are simply not in place for our languages and once you move into more advanced stuff engines seems to not even have begun the work. A lot of things works very differently in Danish. I am not an linquistic expert by any means but just one very basic example is how and when you devide words and how "-" is interpreted. In Danish we usually allways put words together - such as "car insurrance" wich in Dahish is "bilforsikring". In fact, we can make any new combinations as we go (so it's virtually impossible to have a "full" Danish dictionary). Some times you find people writing it as "bil-forsikring" which I believe is also OK. You sometimes see people do the same in English "car-insurrance" (I know this is probably not the best example, but it works for now and my English skills are too limited to find anything better). In English it will make sense to switch the "-" with a space " ". In Danish that would be wrong - you would have to remove the "-" and put the two words together.
I agree that not everything Mike talks about in the article have anything to do with language - such as coding issues. But because he, and others, usually never point out when they are talking about English only issues and when not I allways find it important to make that point.
ADDED:
I forgot to mention another very importan difference between doing SEO in English and doing it in small languages like Danish and Swedish: The fact that we can get away with just about anything! Engines simply do not ban many sites outside the major languages. We are left alone to fight ..., and so we do :)
orion
10-31-2005, 04:54 PM
I accept your clarification.
In Danish we usually allways put words together - such as "car insurrance" wich in Dahish is "bilforsikring". In fact, we can make any new combinations as we go (so it's virtually impossible to have a "full" Danish dictionary). Some times you find people writing it as "bil-forsikring" which I believe is also OK. You sometimes see people do the same in English "car-insurrance" (I know this is probably not the best example, but it works for now and my English skills are too limited to find anything better). In English it will make sense to switch the "-" with a space " ". In Danish that would be wrong - you would have to remove the "-" and put the two words together.
Here is a common ground for us to discuss.
In Google, hyphenated queries work as an EXACT query mode, producing results with a lot less noise than in the default FINDALL mode
So, in general
A: k1-k2 === "k1 + k2"
B: k1-k2 + k3 === "k1 + k2" + k3
invokes an EXACT query mode. In case B the hyphen acts as a localized EXACT mode within a FINDALL mode.
The fact that your native language uses/append hyphens/hyphenated queries may work in your favor when "targeting" for success Google or English-oriented search engine. Sure there are many reasons for your success, not mere hyphens or yuxtapositioning of terms (hyphen removal and appending). How to define "success" and "targeting" is up to you.
One thing I do that in order to find out if my doc ranks high for a target sequence is to check that query in EXACT mode, then optimize accordingly. If I don't rank high in this mode, often I don't rank high in FINDALL for the same sequence, because of the noise involved in FINDALL modes. If I rank low in EXACT I often rank low in FINDALL. Sure exceptions are possible. If I want to know the "contamination" associated or signal-to-noise ratio for that sequence when someone search in FINDALL, I compute an EF-Ratio.
If I compute this for an IR system that consistently defaults to hyphens, the EF-Ratio may not be a good signal-to-noise discriminator and I use other occurrance methods.
If I want to conduct seasonal trends, I would compute those co-occurrence indices at time intervals.
Non of the above are silver bullets. I don't expect all here to follow my way of doing things nor I will call others "egghead" in public because they don't.
They are what they are.
Orion
neatorama
11-02-2005, 04:11 AM
So, keyword density is dead - long live keyword density?
Is that what I'm getting here, people? But surely keyword density affect relevancy somehow (okay, so if not relevancy then SE rankings).
Why? Because if you have a 0 kw density, then you get 0 relevancy. Having close to 100% kw density, then it's kw stuffing or spam and you'll get slapped. Surely there's a happy medium?
Hristo
11-02-2005, 05:55 AM
It is not keyword density. It is the frequency of keywords, type of matches (phrase match, close match ...) and order of keywords.
In relation to density:
- if you put more keywords you increase both keyword frequency and density (you increase relevancy).
- if you delete some text to increase the density you get no additional relevancy because the frequency of keywords stays the same
Frequency matters, not density.
neatorama
11-02-2005, 12:06 PM
So this should be pretty easy to test, right?
Has someone actually did a statistical test on this? Something along the line of two 100-word article with 5% kw density, the only difference is in one the keywords are bunched up (more frequent) and on the other more spread out throughout the article.
orion
11-02-2005, 12:42 PM
Long tested by many.
At this SEOChat thread
More on Term Weight - Method for Calculating the importance of a term in a document (http://forums.seochat.com/showthread.php?t=46505&page=15)
many have similar reasonings.
To put my last post at that SEOChat thread in perspective you would need to read the entire thread, which I always recommend to new readers when jump into an old topic under discussion.
In that thread, I stopped by to explain why neither term frequency (tf), keyword density (kw) and term weight (w) scores are measures of term importance or relevancy, for that matter.
Several examples were given. Here is one. Assume the followings passages are "documents", all of identical length (L) and where
hot, tf=2
oven, tf=1
pizza, tf=2
After considering tf and global inverse doc frequency weights (IDF), it can be proved that these terms all have identical tf, kw and w in all the documents, but their relative term importance and relevancy is different in each case:
Passage A: This pizza is hot. From the oven........hot pizza.......
Passage B: This is Pizza Hot. From the Pizza Hot oven..............
Passage C: This oven is hot. Hot.....from........pizza........pizza.
Passage D: Hot, hot from the oven... This is pizza....pizza.........
Passage E: This is pizza. Pizza, ..... The hot oven...from...hot.
Passage F: This is the hot, hot oven....pizza....pizza...From...
Passage G: This is pizza. Hot pizza from....the hot ....oven........
This illustrates that the importance of a term and its relevancy is slaved to semantics, contextuality, on-topic, what the user's search for and what the reader is perceiving as useful. Sure other elements matter (document linearization, on-page factors, ordering, etc).
If I add HTML DOM manipulation strategies [excellent presentation topic for a 2006 SES and SEOs to learn about] and even the query mode utilized, then the relative importance of a term in a document, as its relevancy, cannot be assessed with mere term frequency, keyword density or term weight scores.
All this is explained in minute details at that SEOChat thread, but we can go over each point again, if some of you want to.
Orion
Hristo
11-02-2005, 01:24 PM
orion,
when I said that it is frequency that matters, I didn't mean term frequency of each keyword or term weights, but the matched keyphrase frequencies (an extract from Google's Original Paper explains best):
For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched together. For every matched set of hits, a proximity is computed. The proximity is based on how far apart the hits are in the document (or anchor) but is classified into 10 different value "bins" ranging from a phrase match to "not even close". Counts are computed not only for every type of hit but for every type and proximity. Every type and proximity pair has a type-prox-weight. The counts are converted into count-weights and we take the dot product of the count-weights and the type-prox-weights to compute an IR score. All of these numbers and matrices can all be displayed with the search results using a special debug mode. These displays have been very helpful in developing the ranking system.
...
The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help. We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.
I also think term weights are not used by Google.. Example: it is very hard to rank multi keyword phrase that uses a popular sub-phrase. Like: if you want to rank well for [search engine optimization add_another_keyword_here], you may find that even though very few other docs have the add_another_keyword_here, it is not given extra weight. It is difficult to rank it without a lot of link popularity/old site etc. while on Yahoo, MSN it is much easier to rank such longer keyphrases without too much SEO.
orion
11-02-2005, 02:00 PM
orion,
when I said that it is frequency that matters, I didn't mean term frequency of each keyword or term weights, but the matched keyphrase frequencies (an extract from Google's Original Paper explains best):
I also think term weights are not used by Google.. Example: it is very hard to rank multi keyword phrase that uses a popular sub-phrase. Like: if you want to rank well for [search engine optimization add_another_keyword_here], you may find that even though very few other docs have the add_another_keyword_here, it is not given extra weight. It is difficult to rank it without a lot of link popularity/old site etc. while on Yahoo, MSN it is much easier to rank such longer keyphrases without too much SEO.
I understand what you trying to say.
The example I provide is search engine-independent. Regarding Google's paper, I'm not sure is wise to still quoting Google's old papers to assert current relevancy issues.
Regarding phrase frequency, this neither is a measure of relevancy. Again relevancy defaults to more complex things.
Regarding term weights, these must be computed by Google and any search engine in order to score dot products, term vectors (even Google says they use term vectors, hence term weights, in that old paper --oops, I just quoted the damn paper) as well as cosine similarity; then, use that cosine similarity score as a measure of document-query similarity, often call "relevancy". This type of relevancy is different from user's perception of relevancy and what makes words to achieve term importance.
Again, term importance and relevancy cannot be assessed properly with term/phrase frequency, term vector, or keyword density for that matter. You need to know too many semantic relationships and term ontology embedded in a document and across a corpus.
Many years ago (late 90's), Prof. Bruce Croft's group published the famous Local Context Analysis in which document concept frequency (noun phrases and their frequency) were combined with term weights to compute term vector and doc-query similarity scores. The original goal was to improve query expansion, but then many later realize that it could be used for optimization as a noun phrase optimization technique. The problem: document segmentation, term/phrase disambiguation and topic transitions did make a difference.
In addition, what is perceived as an exact sequence of terms by users is not necessarily what is perceived as an exact sequence of terms by a search engine. To illustrate, a search in EXACT mode for "pizza hot" or "hot pizza" can return documents relevant to that sequence, even when the phrase is not present in the docs.
Doc cases are given in the above example. You just need to read carefully between delimiters and stopwords to realize why is this. So, how one counts literal "phrases" can be risky business when we discuss relevancy.
Orion
Hristo
11-03-2005, 03:51 AM
Regarding Google's paper, I'm not sure is wise to still quoting Google's old papers to assert current relevancy issues. Most of the stuff you are writing about is way older than that paper and has been developed with no internet in mind.
Regarding term weights, these must be computed by Google and any search engine in order to score dot products, term vectors (even Google says they use term vectors, hence term weights, in that old paper --oops, I just quoted the damn paper) as well as cosine similarity; then, use that cosine similarity score as a measure of document-query similarity, often call "relevancy". This type of relevancy is different from user's perception of relevancy and what makes words to achieve term importance. Correct me if I am wrong, but that paper does not talk about term weights and uses the term [vectors] in a totally different meaning than what's implied in classic IR theory. Also the paper explicitly says that the standard vector space model does not work well on the web.
orion
11-03-2005, 04:36 AM
Most of the stuff you are writing about is way older than that paper and has been developed with no internet in mind.
Correct me if I am wrong, but that paper does not talk about term weights and uses the term [vectors] in a totally different meaning than what's implied in classic IR theory. Also the paper explicitly says that the standard vector space model does not work well on the web.
I myself even say that the vector model does not work on the web and why. Check http://www.miislita.com/term-vector/term-vector-3.html Even Google's implementation of term vector for detecting duplicated content fails to filter dupes. They even mention that in one of their patent I reviewed at San Jose, SES.
Term vector theory is old. Of course.
Search engines, including Google uses term vector theory. There are many definitions for term weights that are then used for computing term vectors. Once local weights and global weights are defined, then dot products and cosine similarity values are scored. The later do not change since dot products and cosine similarity are math definitions.
The "standard" term vector is also known as the "classic" term vector model. There is also a Generalized Vector Space Model and the Fox's Extended Boolen Model, The probabilistic model and Robertson's BM25 Probabilistic Model. There are even many more, recently published. All this well known and many of these discussed at this more than a year old thread:
Term Vector Theory and Keyword Weights (http://forums.searchenginewatch.com/showthread.php?t=489)
At that thread is mentioned how many search engines, including Google have combined term weights and term vector models from IR into their algorithms. Here is a quote:
"Other search engines, like Google and others use a combination of link metrics and term vector weights. The trend in the last years has been to integrate link metrics with term vector schemes (with several flavors and variations) Check http://www.e-marketing-news.co.uk/topic_distillation.pdf "
On Google and Term Vector Theory.
Actually, since its inception in the scene, Google has been using the term vector theory. Check http://www.webpronews.com/ebusiness/seo/wpn-4-20010905GoogleInterviewbyFredrickMarckini.html
In that WebProNews article, Fredrick Marckini, from IProspect, interviews Google's Craig Silverstein. Craig states,
"The Term Vector Theory
Google's algorithm incorporates the ideas and understanding behind the term vector theory. While the elements of the term vector theory can be quite complex, Craig offered a rather basic definition of how the theory originated. A premise of the term vector theory "says the documents are good if they contain the words in your query and they contain them a lot," explained Silverstein. As search has matured and grown more complex, Google has adapted their algorithm to complement these changes and to account for those who try to cheat and trick the search engines. While the algorithm has adjusted with the times, in essence it still embraces the beliefs behind the term vector theory.
Scoring = PageRank + Term Vector
The term vector factors of the Google ranking algorithm, which will be covered below, concentrate on how relevant a page is to a user's search. This score, combined with the PageRank score that measures the popularity of the page, is how Google derives an overall score or ranking of a Web page. Thus, the Web pages that receive high scores are, in Google's opinion, the Web pages that best meet the user's individual needs."
End of the quote.
Going back to your original thesis of phrase frequency as a measure of relevancy, that seems an oversimplification of the problem of assessing relevancy.
Orion
Hristo
11-03-2005, 08:44 AM
orion, most of the links in your post are badly entered. Pls, correct them, I want to read them.
Even Google's implementation of term vector for detecting duplicated content fails to filter dupes. They even mention that in one of their patent I reviewed at San Jose, SES. Can you elaborate? Is it on the patent dealing with duplicate docs or query-specific dup docs?
Going back to your original thesis of phrase frequency as a measure of relevancy, that seems an oversimplification of the problem of assessing relevancy. Yes, you are right. I just wanted to make a point that on-page relevancy is based on frequencies, not density (which is a myth). Or in other words, use more kewords without worrying about density ;)
If you have some good relevant links on IR, papers etc. please post them (esp. stuff coming from guys/gals working at search engines).
neatorama
11-03-2005, 12:15 PM
Thanks Orion & Hristo -
I'll have to think about the 2-word frequency vs. juxtaposition idea. A lot of what you said definitely made sense. But what about 1-kw relevancy, where there's no issue of phrasing or juxtaposition, and where frequency = density?
orion
11-03-2005, 12:26 PM
orion, most of the links in your post are badly entered. Pls, correct them, I want to read them.
Can you elaborate? Is it on the patent dealing with duplicate docs or query-specific dup docs?
Yes, you are right. I just wanted to make a point that on-page relevancy is based on frequencies, not density (which is a myth). Or in other words, use more kewords without worrying about density ;)
If you have some good relevant links on IR, papers etc. please post them (esp. stuff coming from guys/gals working at search engines).
Fixed them (line breakers). Sorry about that.
At SES, San Jose, I presented before the industry a review on Google and AltaVista's patents on duplicated content in documents (many SEW members and moderators, SEOs and all major search engines were present: Google, Yahoo, MSN, ASK, etc.)
It was discussed that Google suggests in one of the patents calculating a local term vector for snippets and the associated cosine similarity scores. The idea was to use this an alternative to detect dupes. The problem: content with identical terms, on-page frequency, weights, etc but in different order could pass for dupes, when in fact they may be about a different topic. This is an inherent problem of term vector models (among many others).
Regarding on-page relevancy, there are many things affecting relevancy, not just on-page word/phrase frequencies. To say that relevancy can be mapped to mere page frequencies is a naive notion. Many things affect relevancy. There is on-topic analysis and theming, the terms data structure, scope & ontology used with broader and narrower terms, HTML DOM tree manipulations (a great topic for SEOs and spammers alike), the query mode itself, etc.
For instance, as Google, many search engines ignore delimiters combined with a space such as in periods, commas, pipes, etc and stopwords such as "is", "a", "of", etc. This means that if you do a search in EXACT mode, this will return documents not necessarily containing the intended phrase, as in
....k1. k2.....
....k1, k2.....
....k1 | k2....
The phrase for k1 followed by k2 is not there but the engine will count that as a match.
The example with pizza and hot query illustrates this. The followings passages taken as "docs" will match the query, even when A and G do not contain the intended phrase. (emphasis added to illustrate the point).
Passage A: This pizza is hot. From the oven........hot pizza.......
Passage B: This is Pizza Hot. From the Pizza Hot oven..............
Passage G: This is pizza. Hot pizza from....the hot ....oven........
BTW: This also illustrate that EXACT searches are not necessarily searches for phrases. Docs with queried terms delimied by separators can be returned as match for the "phrase".
Regarding papers from search engines, if you do a search at this forum or better at the Relevancy & Technology Section of this forum, there are many excellent threads discussing them. Here are a few:
Improving PageRank: The Papers (http://forums.searchenginewatch.com/showthread.php?t=4140)
The WLR Algorithm (http://forums.searchenginewatch.com/showthread.php?t=2704)
Block Analysis 101 (http://forums.searchenginewatch.com/showthread.php?t=2119)
and many more...You just need to do the homework.
Orion
orion
11-03-2005, 03:47 PM
Thanks Orion & Hristo -
I'll have to think about the 2-word frequency vs. juxtaposition idea. A lot of what you said definitely made sense. But what about 1-kw relevancy, where there's no issue of phrasing or juxtaposition, and where frequency = density?
Hi, neatorama.
That also falls short. If two docs, A and B are of same length and repeat a term X the same number of times, they will have same frequency and same keyword density. Again that's tells nothing.
If incidents of X occurs in A at the begining and in B at the end, that would impact their scoring of relevancy.
If incidents of X occurs in A inside a link but in B in mere text, that would impact their scoring of relevancy, too.
If incidents of X occurs in A inside a title or h1 but in B in some place else, that would impact their scoring of relevancy as well.
Many other scenarios/reasonings are possible.
The point is that both frequency, keyword density as term weights/term vectors are not good discriminators of relevancy. In the former, these are local features ignoring global features from the corpus and in the later, these assume that terms are not interdependent (term-independence) nor are scoring semantics.
Orion
bsaric
11-03-2005, 04:04 PM
Its really hard to believe that so powerfull SE use keyword density, such a simple formula, to measure such important data from site.
orion
11-03-2005, 04:20 PM
Another problem with relying on mere on-page frequency (terms and phrases) is that this parameter can flip in some term vector implementations the sign of term weights from positive to negative.
This is purely a mathematical phenomenon and not a filtering mechanism, a search engine penalty or anything connected with semantics or topicality (or whether or not the term in question is a traditional stopword).
The phenomenon is observed in term vector probabilistic models that define IDF as
IDF=log((N - n)/n)
where N is the collection and n is number of docs containing the term to be weighted.
I discussed this phenomenon at
Implementation and Application of Term Weights in a MySQL Environment (http://www.miislita.com/term-vector/term-vector-5-mysql.html)
The MySQL implementation of term vector uses this IDF definition, but you can alter it.
The phenomenon occurs when n > N/2 or greater than 50% of N in the collection. It is a global effect which switches the weight of a term in a document to a negative value.
Thus, a term that occurs in more than 50% of a collection gets a negative weight, regardless of its on-page frequency, either as a single term or as part of a phrase. The same applies to entire phrases.
In this case, regardless of how many times this "flipper" occurs in a document, it will get a negative value, which in turn affects the retrieval and ranking of the documents. Because of this some developers arbitrarily ignore negative terms in order to artificially bias the retrieval process. This also creates other unwanted complications, depending on the nature of the database or its "slots".
This is only one example to show you, guys, that relevancy scores are much more complex to assess than with relegating it to on-page frequency and keyword density fallacies and myths.
Orion
Hristo
11-04-2005, 10:55 AM
orion, thanks for the links.
Do you think Google gives more weight to keywords appearing at the top of a document, compared to keywords at the bottom (everything else being equal like font-size)? I doubt it because
1. Google does not read external css files
2. You can present text in one order in the HTML but it may appear differently on-screen (example: a text that's at the top of the source of an html file, can be rendered at the bottom of the actual page).
orion
11-04-2005, 01:01 PM
orion, thanks for the links.
Do you think Google gives more weight to keywords appearing at the top of a document, compared to keywords at the bottom (everything else being equal like font-size)? I doubt it because
1. Google does not read external css files
2. You can present text in one order in the HTML but it may appear differently on-screen (example: a text that's at the top of the source of an html file, can be rendered at the bottom of the actual page).
I think I did not expressed that well. At SES San Jose I mentioned this:
Documents needs to be prepared for indexing by the IR system or engine. A representation of the document is indexed. This is accomplished by reducing the document to a linearized text stream by removing all markup tags, scripts and CSS instructions, then tokenization, stopword filtration and stemming takes place. The survival terms/stems stream is indexed.
This means that any repositioning due to CSS is removed and ignored. The relative position of terms as viewed by users and rendered by a browser is not the actual position in the linearized text stream. I have a comprehensive tutorial of this at Document Indexing Tutorial (http://www.miislita.com/information-retrieval-tutorial/indexing.html), part of the Information Retrieval Tutorials Series.
Then the linearized version undergoes other analysis: dupe analysis, term weight scoring, etc.
Terms at the beginning of the linearized version of the doc are then more relevant that at the end. This is what I was trying to referring too. Sorry I was not that clear in my use of the term "doc".
orion
11-04-2005, 01:23 PM
One interesting thing about understanding the indexing process and the associated doc linearization process) is that it renders as an artifact the dubious SEO practice of trying to visually scanning a doc and then trying to compute localized term/phrase frequencies or keyword density values in, say, titles, headers, or tables.
I have developed a software that does linearization of docs on the fly. It incorporates a bilingual stemmer (English and Spanish). It is not for sale. I use it to conduct GAP analyses and optimization of docs.
After markup removal and linearization, the relative position of terms in the text stream can be different from that visually perceived by users. The phenomenon worsen when nested tags are used (nested divs, tables, layers, etc).
One more thing. The arbitrary repetition of terms or phrases (on-page frequency) and negative weighting can produce the phenomenon of keyword weight fights, where keywords neutralize each others. This effect can worsen when nested tags are used and when used terms with negative weighting are present in a database (vector cancellation can take place).
After linearization, if the resultant discourse of the text stream is fully incoherent or has no clear topic transitions it can be rendered as useless or with no semantic value.
Orion
bsaric
11-04-2005, 04:04 PM
so thats why pages with clear code (example:div after div) are ranked higher, becose they don't lose sense and order after lin, fil, tok and weigh. and all other?
orion
11-04-2005, 05:36 PM
There are many reasons of why docs can rank high or low. It is not just about HTML coding.
The point I'm trying to get across is that arbitrarily nesting tags, especially with complex tables can be detrimental.
When div tags need to be nested, this must be done wisely (eg. with some browsers there are some positional issues which can be resolved with blank div tag as a container for another div tag).
When tags with content need to be nested for a reason, the coder must insure that this will not affect the text stream. In such cases a wise, very wise design and coding must be conducted.
My experience in the industry by interacting with leading SEOs and SEMs tells me that most of them have not been exposed to doc linearization, HTML DOM analysis or indexing processes for that matter.
This is why I suggests at this and other forums and at conferences avoiding or minimizing the use of nested tags, ignore keyword density/frequency/term weight fallacies in relation to relevancy and semantics and start learning a bit about IR processes.
Certainly, some are free to stick to what work well for them and their clients and not to follow these advices.
Orion