View Full Version : Do keywords in URL influence your rank? Research by Web CEO Team.
Web CEO Team
11-03-2004, 05:36 AM
Moderator note:
Moved to Search Technology and Relevancy area from Beta Test area
__________________________________________________ _______________
Hello guys!
My name is Serge Bond and I am a search engine analyst working for www.webceo.com. I decided to find out the real importance of keyword presence in URL. There are so many empty-minded topics with lots of opinions and no research preceding them. That's why I performed the following research:
I made a list consisting of both very competitive and uncompetitive keyphrases. The level of competition was estimated by number of results (in Google and Yahoo) and by the number of daily searches (taken from WebCEO).
Overall, I have taken 24 keyphrases with almost 30000 URLs. Here’s the list:
http://www.webceo.com/images/files/table.gif
By each keyword I have extracted the URLs of top-700 pages. This was done, again, with the help of Web CEO. Even though our tool lets you extract up to 1000 URLs, Google rarely gives really more than 700 (even less, as a rule). You can see the number of URLs for each kwd in the table above.
After this, I have created a spreadsheet summarizing the number of keywords from the keyphrase in these URLs. For example, let’s consider the keyphrase “search engine optimization”. In my spreadsheet, next to the URL search-engine-bla.com would stand “2” (because in the URL are present “search” (1) and “engine” (2)). Accordingly, in front of search-engine-optimization.com would stand “3”. For example:
http://www.webceo.com/images/files/table1.gif
You can see all other tables at my blog http://seometry.blogspot.com
Then, I have made charts for each keyword. At the horizontal axis we have site rank. At the vertical axis we have the number of kwds from keyphrase in the URL corresponding to each rank. Here you can see the most typical patterns:
Competitive keyphrases:
http://www.webceo.com/images/files/compforweb.jpg
For competitive keywords we can see the following picture:
In Yahoo pages with many keywords in URL are saturated at the top of the list.
In Google the picture is almost the opposite. We can see more pages with keywords in URL at the bottom of the list.
Uncompetitive keyphrases:
http://www.webceo.com/images/files/uncompforweb.jpg
For uncompetitive keywords the situation is different
In Yahoo the general picture remains the same.
In Google the distribution becomes similar to Yahoo: pages with many keywords in URL are situated at the top of the list.
My conclusions:
For Yahoo presence of keywords in URL is a considerable ranking factor.
In case of competitive keywords, Google takes presence of keywords in URL as overoptimization and penalizes a page.
In case of uncompetitive keywords, Google uses this factor in its ranking process.
creativecraig
11-03-2004, 06:22 AM
Looking through your competitive keyphrases they dont seem to be that competitive. Its very difficult to say a phrase is competitive just by looking at how many results are returned!
I would say one of the most competitive industries to optimise in would be the financial sector. Because the returns are so how high for good placement everyone wants a good ranking - yet I dont see you mentioning that in your research.
Web CEO Team
11-03-2004, 06:32 AM
Hi Craig,
Thank you for your advice. Please tell me several highly competitive finance keyphrases and we'll perform this research on them. It will take two hours or so, then we'll immediately post it in this thread
Agree with craig regarding the 'high' competition phrases. The only really competitive phrase I see in there is search engine optimization.
Try a search on G for the secong one in your list in quotes. I really wouldn't call that competitive.
As far as the conclusions go, agree with number one, however:
In case of competitive keywords, Google takes presence of keywords in URL as overoptimization and penalizes a page.
hmmm. Try some searches on popular 1 word pharmacy terms. Keyword in URL all over the first page, I can't see any penalties there.
Its good to be posting the results of research for everyone's benefit, but imho both the method and conclusions are a little shaky.
Web CEO Team
11-03-2004, 06:47 AM
hmmm. Try some searches on popular 1 word pharmacy terms. Keyword in URL all over the first page, I can't see any penalties there.
Tell me several and we'll do this. It would be better if you tell me the keywords or keyphrases (I don't want to be biased in any way)
imho both the method and conclusions are a little shaky.
Yes, I agree with you. They are a little shaky. If you have a better method or better conclusions (no sarcasm here, I sincerely would be very glad to see any new idea), please share them with me.
Best,
Serge
creativecraig
11-03-2004, 06:52 AM
Mortgage, remortgage, loans, credit card... the list is endless - I think that most people would agree that these are competitive.
If you have a better method or better conclusions (no sarcasm here, I sincerely would be very glad to see any new idea), please share them with me.
Well, it would be better to be using some more competitive keyphrases for your high competition terms. People will rarely post their pet terms on a public forum though :-)
Other things need to be taken into account to judge how competitive a phrase is:
how many links the top sites have
how many of those are internal
how many are from distinct domains
how sharp their on page optimisation is
how effective their internal linking is
how much 'agressive' seo is on the first page of results and how well its done
etc
I'd also like to see a bigger sample size.
Of course all that is difficult to automate, but would produce more meaningful results I think.
I'm not the most dilligent bloke around when it comes to research and testing, so you might want some advice from people who spend more time on that than me, but the above would help I think if you could find a way to factor it in.
Anthony Parsons
11-03-2004, 07:55 AM
My conclusions:
For Yahoo presence of keywords in URL is a considerable ranking factor.
In case of competitive keywords, Google takes presence of keywords in URL as overoptimization and penalizes a page.
In case of uncompetitive keywords, Google uses this factor in its ranking process.
I don't believe these results are actually very factual at all. Google actually does not factor bias against a particular keyword, ie. keywords in URL's help non-competitive but not competitive. I think you need to get something more factual before presenting results here.
projectphp
11-03-2004, 08:49 AM
Problems with your methodolgy:
1. There are no labels on either the X or Y axis. I assume the X axis is the rank, and Y is number of keywords in the URL, but would have been nice to have been told.
2. Small sample size. Is this enough results to utter the dreaded "therefore"? I don't think so.
3. Other factors ill considered. PageRank, anchor text, on page KWD etc are not mentione at all. These result could all be caused by something else, e.g. Yahoo likes on page text, Google prefers off page factors. keyword-hyphen-domains tend to be well SEO-ed on page, but short on natural links. This would make such sites more likely to rank well in Yahoo than Google.
4. No isolation of factors. This was not a test run on 4 domains and a made up word in which you controlled all variables, including links in and out and all anchor text. Therefore (that dreaded word) can you really make any positive statement that positively excludes other factors?
Not a bad start, but unfortunately the evidence is far from conclusive, the mathematics and assumptions made are not spelled out clearly and concisely, and there is far to much room for doubt. Without control and reporting accuracy, there is very little one can do.
I applaud your attempt at making SEO accountable, but think that perhaps there are better, more conclusive tests, that could have been performed. Next time, perhaps start by posting your methodology, then we can critique that and save you a bunch of time compiling results. If that started out right, you wouldn't have spent time heading in the wrong direction.
Web CEO Team
11-03-2004, 09:48 AM
2CreativeCraig:
Mortgage, remortgage, loans, credit card... the list is endless - I think that most people would agree that these are competitive.
OK, I'll check these keywords and post results here. Tomorrow. But, I should note, that checking keyphrases is more important, because optimizing for just one keyword is not a good idea considering fierce SEO competition.
Its very difficult to say a phrase is competitive just by looking at how many results are returned!
I also took into attention the number of daily searches. Next time, I'll show these parameters in the table.
2PPG:
Other things need to be taken into account to judge how competitive a phrase is:
...
We perfectly know which factors influence a page's position in SERPs. In our post we mentioned that this factor is not decisive. We were simply interested to find out how strong can presence of kwd in URL influence page's position is serps. The charts show that pages with keywords in URLs are placed higher in Yahoo than in Google. The next step will be studying other factors. This way, we'll be able to determine which factors are deciive and which are minor.
I'd also like to see a bigger sample size.
Do you mean more keywords or more pages by each keyword?
2Anthony Parsons:
Google actually does not factor bias against a particular keyword, ie. keywords in URL's help non-competitive but not competitive. Absolute rubbish.
Anthony, do you have any factual proof that Google does not factor any bias against particular keywords? We suppose that Google's ranking algorythm consists of several levels. It is widely known that PageRank is the main ranking factor. For pages that have close PageRank values, on-the-page factors come into play, including keyword presence in URLs. There is no bias against some special keywords, however non-competitive ones are weaker optimized, they do not have lots of high PR pages linking to them, that's why off-the-page factors can be better seen.
2projectphp:
[INDENT]
1. There are no labels on either the X or Y axis. I assume the X axis is the rank, and Y is number of keywords in the URL, but would have been nice to have been told.
Fixed
2. Small sample size. Is this enough results to utter the dreaded "therefore"? I don't think so.
What do you mean? More keywords or more URLs for each keyword?
3. Other factors ill considered. PageRank, anchor text, on page KWD etc are not mentioned at all. These results could all be caused by something else, e.g. Yahoo likes on page text, Google prefers off page factors. keyword-hyphen-domains tend to be well SEO-ed on page, but short on natural links. This would make such sites more likely to rank well in Yahoo than Google.
These results are caused by "something else". However different SEs react to the researched factor differently. This can be seen in the pictures.
4. No isolation of factors. This was not a test run on 4 domains and a made up word in which you controlled all variables, including links in and out and all anchor text. Therefore (that dreaded word) can you really make any positive statement that positively excludes other factors? I agree I can't make any 100% true conclusions. What about the expirement you are talking about, I'll do this. What methodology of testing would you suggest?
Let's do this right here in this thread. Everything: from methodology to math to assumprtions and results. Your help will be very appreciated.
Marcia
11-03-2004, 10:16 AM
Hi WCT, and welcome to SEW forums.
One thing I'd love to see some stats on is the effect of keyword repetitions in URLs. Just to illustrate:
www.food-site.com/healthy-food/healthy-breakfast-food.html
See? How are multiple occurrences of the same word in the URL doing in the SERPs, and is there any type of effect with any of the major engines, positive or negative?
Anthony Parsons
11-03-2004, 10:21 AM
Anthony, do you have any factual proof that Google does not factor any bias against particular keywords?
I have lots of facts about lots of things, being the founder of SEO Testing and all. I have tested the use of keywords in domain names, folders and filenames, and currently doing some more testing for archive purposes. Many of the people here commenting have also done their own tests, thus not found what you are commenting.
We suppose that Google's ranking algorythm consists of several levels. It is widely known that PageRank is the main ranking factor.
PageRank itself has very little to do with ranking a website actually. PageRank is a unique algorithm that is combined with the main algo's to rank pages. PageRank is "one" of hundreds / thousands or techniques scrutinized to rank a page.
For pages that have close PageRank values, on-the-page factors come into play, including keyword presence in URLs. There is no bias against some special keywords, however non-competitive ones are weaker optimized, they do not have lots of high PR pages linking to them, that's why off-the-page factors can be better seen.
Having high PR pages pointing to a keyword has nothing to do with their competitiveness either. Competitiveness is a measure of how many people are attempting to capture that term, nothing more, nothing less. The impact from people linking to pages attempting to capture that term has no real relevance for the test you are trying to perform.
I would take projects advice, place up what you are looking at testing first, then let people give you some advice on which way to structure your test so you get it right the first time. Keyword presence, hypenated URL's, etc, have nothing to do with PageRank, so you can keep that out of your test to begin with. As projectphp said, good start, but not quite there to an accurate standard. As soon as you have multiple links pointing to a simple test like this, the results diminish rapidly because of all the factors within the link itself. You need to test in a completely controlled and untainted environment.
We perfectly know which factors influence a page's position in SERPs.
I was talking about factors relating to how competitive a keyphrase is, not what factors make a page rank well. I thought that was pretty clear.
It is widely known that PageRank is the main ranking factor. For pages that have close PageRank values, on-the-page factors come into play, including keyword presence in URLs.
Really? I want to come and optimise in your world :-)
PR is important yes, but no more so than other factors. Number, quality and theme of links will get you further than high PR. If what you say is true then results would be ordered first by PR then within that by on page factors. Is this really what you see in the serps?
I also took into attention the number of daily searches.
How does this relate to how competitive a term is?
There is no bias against some special keywords,
In case of competitive keywords, Google takes presence of keywords in URL as overoptimization and penalizes a page.
To be honest, it looks to me like you've started with the assumption that keywords in url does influence ranking and then set out to prove it.
Hello Serge,
I think your test is very good for a start and no similar or better tests are publicly available.
PageRank itself has very little to do with ranking a website actually. PageRank is a unique algorithm that is combined with the main algo's to rank pages. PageRank is "one" of hundreds / thousands or techniques scrutinized to rank a page.
Look at:
http://www.prsearch.net/index.php?Query=loans&BSearch=Search
If PageRank is a so makes so little difference, how come all listings have a PageRank higher than 5. And how come the only PR 8 site on that page is the first?
projectphp
11-03-2004, 07:15 PM
What about the expirement you are talking about, I'll do this. What methodology of testing would you suggest?
Ok. To know that this one issue causes the problem this one factor needs to be the only variance. So, you need several different domains all competing for a common nonsensical word, and then to see which site "wins".
If Google has 100 factors it considers, then a result like you displayed may be caused by any of the other 99 factors. How can we know for sure? It is for this reason that any sort of conclusion is iffy at best.
Some examples of unanswered questiosn:
1. Did where the keywords appear have an effect? Is keywords in the domain different to in the rest of the URL?
2. Was the biggest influence in the results something specific, i.e. PageRank on Google.
3. Do pages on keyword URLs have commmonality, i.e. are they well on page SEOed? What does this indicate? Sometimjes, we test one thing, learn soimething else (taht is how fingerprints being unique were discovered).
4. Are the results on Google slanted towards Brand name non-keyword domains?
5. Have keywords in the URL been isolated enough to have a conclusion?
IMHO, SEO testing often can't prove much of anything.
Anthony Parsons
11-03-2004, 07:25 PM
IMHO, SEO testing often can't prove much of anything.
And this is very true, from my experience. The only thing that can be found is whether it influences a ranking or not. How much is always unknown, and I know this from running SEO Testing. If your testing keywords within a domain, folder or filename, the first thing is to ensure those words do not appear in the title or page itself, to ensure no influence is given from that end. No anchors, no nothing in regard to the terms used in what you are testing. That is the only way in which you can achieve a close result, though still not 100% conclusive.
Another thing I have found, is that as soon as you combine factors, then everything can go out the door as one affects another, positive and negative.
There is nothing wrong with your test idea Serge, just that its not near as conclusive as it could be. I think you know that now from the feedback already.
Marcia
11-04-2004, 01:33 AM
There are those who believe that the value of keywords in URLs, at least in the domain itself, lies only in the fact that there will be links containing the keywords in anchor text.
That created a bit of stir, when some concluded that there were filters being applied for repetitions of anchor text that exceeded an acceptable threshhold, and that it was possibly put into place for things such as Googlebombing.
Another rumor had it that another engine was seriously frowning on repetitions of the keywords in the filepath to given pages.
I'm not overly "scientific" - I pretty much believe what I see consistently enough - so I'm wondering how things like that can be tested, for example, whether it's the keywords in the URL or anchor text actually giving a boost.
Also wondering how it could be isolated that in fact multiple repetitions in the filepath were causing a problem, or whether a drop in rankings could rather be due to other factors.
Chris_D
11-04-2004, 02:45 AM
I decided to find out the real importance of keyword presence in URL
The only way to do that - imho - is to compare two otherwise identical pages - with and without keyword presence in the url.
the sites to test would require identical inbound anchor text links, identical content. Identical. Just isolate the ONE issue you are trying to test - and test it.
Call me simple - but isn't that the definitive test?
Have a look at these SERPS - there's more ways to skin a cat than putting 'skin-a-cat' in the URL:
http://www.google.com/search?hl=en&q=miserable+failure
http://www.google.com/search?q=Weapons+of+mass+destruction
http://www.google.com/search?hl=en&lr=&safe=off&c2coff=1&q=flash+games
:)
Anthony Parsons
11-04-2004, 03:58 AM
Exactly Chris. The results are not conclusive due to the nature of the test. Test keywords in domain. Easy.
Make page, put keywords in domain, put keyword in page or title once.
Make identical page, change domain and record ranking.
Same pages, different domains. Measured at unique times obviously. Which one went higher? As I know very well, this is not conclusive because you have rankings all round the page changing.
You could isolate it fairly well by choosing a good keyword that is not so impartial to other page affects, ie. something that is very unique, but has atleast one or two other pages competing against it.
If you have the two domains indexed though, the results will take affect on a near weekly basis though in Google, so you could get a good judgement to any affect through changing them around and measuring each time.
Marcia, I am not hypeing about PageRank or PageRank bashing, it is more that the original test goes on about how PageRank is used in the test to measure keywords. What does pagerank have to do with measuring keywords in a URL. This is from the original post.
dannysullivan
11-04-2004, 08:10 AM
I've made a number of edits and deletions to put this thread back on track. I'll ask everyone to please stay on the topic of this thread -- look at the data presented and offer criticisms and comments on it, as was requested. That's useful. Going off on tangents about personal motives and some of the other stuff I've had to delete isn't useful.
By the way, here are some past threads where we've dicussed the issue of keywords in URLs influencing rank:
Hyphenated URLs (http://forums.searchenginewatch.com/forum/showthread.php?t=87) SEO and file names (http://forums.searchenginewatch.com/forum/showthread.php?t=257) Feel free to cross-link to other similar discussions you may have seen. I'm keeping a list of these, as it's a popular topic.
Change To Link Bomb Sign Of New Link Analysis Shift? (http://forums.searchenginewatch.com/showthread.php?t=700), July 2004 repeating terms in URL (http://forums.searchenginewatch.com/showthread.php?t=1505), Sept. 2004 URL Spaces & Alt Tag Naming Conventions (http://forums.searchenginewatch.com/showthread.php?t=1734), Sept. 2004
cline
11-04-2004, 12:12 PM
In case of competitive keywords, Google takes presence of keywords in URL as overoptimization and penalizes a page.
An alternative interpretation is that the weighting of the presence of keywords in the URL becomes diluted by other optimization factors.
nickleus
11-04-2004, 05:00 PM
The only way to do that - imho - is to compare two otherwise identical pages - with and without keyword presence in the url.
the sites to test would require identical inbound anchor text links, identical content. Identical. Just isolate the ONE issue you are trying to test - and test it.
Call me simple - but isn't that the definitive test?
Chris you are RIGHT ON! Now we just gotta find that perfect environment :) Any tips?
Chris you are RIGHT ON! Now we just gotta find that perfect environment :) Any tips?
Here's an easy way. Link out to 2 pages from a high PR page (so that Google would pick it up soon). Say x.htm and y.htm.
Don't give any title or other stuff to these pages.
Only put 5 links on them with a link text like "1","2", aso.
On one of the 2 pages place the link urls in reverse order (to neutralize possible ranking benefits for a better link prominence).
Each of the 5 linked pages would include ten random and different words (each page should differ though from the other at least slightly) with some additions.
2 of the pages could be:
dfgrwc-1.htm and wouldn't include "dfgrwc" in the <body>.
dfgrwc-2.htm would include "dfgrwc" at the end of the <body> text.
The other 3 filenames shouldn't include dfgrwc.
Page number 3 would include "dfgrwc" at the end of the <body> text.
Page number 4 would include "dfgrwc" both at the beginning and at the end of the <body> text.
Page number 5 would include "dfgrwc" both at the beginning and at the end of the <body> text and somewhere between.
After the pages are picked, you would know whether a SE returns a page for a keyword if the keyword is only present in the filename; or the <body> text should also include the keyword but the filename would boost ranking; or neither.
If the filename is considered at a SE as a ranking factor, you could compare the importance of a keyword in the filename compared with a keyword in the body text.
Anthony Parsons
11-04-2004, 07:40 PM
Each of the 5 linked pages would include ten random and different words (each page should differ though from the other at least slightly) with some additions.
Getting close now IMO. The only problem with placing random words (different page copy) is semantics. As soon as you change the words upon the page, semantics come into play, and unless you know the algorithm, you will taint the results solely through changing page copy. How the engine reads the words and connects each will provide a different ranking straight up. You would have to test each URL one at a time, measure then change, measure then change, re-measure as applicable to confirm. Slow, but the only way to ensure the results are untainted from indifferences.
Marcia
11-05-2004, 01:44 PM
The only problem with placing random words (different page copy) is semantics. As soon as you change the words upon the page, semantics come into play
Anthony, how are you figuring semantics is entering into it for keywords on the page? Do you mean relationships to other words on the page?
I've come across cloaked pages (that ended up uncloaked and ending up showing at the the engine) and pages delivered to the engine using a bait-and-switch with meta refresh and/or JS, and it was keywords and phrases stuck into pages full of sheer gibberish. They were obviously ranking due to off-page factors, keyword density and number of occurrences - plus where on the page the keywords appeared. And you could almost tell they had been "generated" using some sort of formula rather than being hand crafted.
I've also seen legit pages move up or down in rankings at Google from changing nothing more than on-page text with number and placement of occurrences, but it had nothing to do with them being or not being in the URL, which is really what we're looking at here.
Back when Yahoo showed the Directory listings as default and with a ranking order, it was definitely felt that keywords in the URL - in that case the domain name itself - played a significant role. So with or without testing, if something is used or has been used, it's entirely probable that it'll be used again by someone at some time.
IMHO the main thing with keywords in the URL, which from what I've seen can give a boost, is the danger of hitting a filter for excessive repetition, or maybe too many hyphens. I think it's safe to assume that anything that's subject to abuse or over-use by optimizers can eventually come under scrutiny.
From all the whining I've seen over the years, it seems the most useful type of testing is that which tells what runs into filters or penalties. That's where the really aggressive marketers have the edge over others. After some people have enough sites penalized or banned they get some very valuable insights into how the algos are working, that the rest of us don't. It may or may not be scientific, but it's sure enough testing the weapons out in the trenches.
Added:
Serge
Tell me several and we'll do this. It would be better if you tell me the keywords or keyphrases (I don't want to be biased in any way)
One word pharmacy:
phentermine
viagra
cialis
Finance type words and phrases:
debt consolidation
student loan consolidation
mortgage loans
mortgage refinancing
homebased business
home-based business
loans
pdstein
11-05-2004, 04:01 PM
Ok. To know that this one issue causes the problem this one factor needs to be the only variance. So, you need several different domains all competing for a common nonsensical word, and then to see which site "wins".
If Google has 100 factors it considers, then a result like you displayed may be caused by any of the other 99 factors. How can we know for sure? It is for this reason that any sort of conclusion is iffy at best.
I appreciate the research done, but I have to agree with this statement.
Just because sites with keywords in their URLs tend to be towards the top of the rankings doesn't mean that ranking is caused by the keywords in the URL. My guess is that that a webmaster who thought enough to include keywords in the URL is probably more likely to have optimized that site. So is the better ranking because of the optimization or the URL? No way to know with this methodology.
As others have suggested the only way to test this is to have identical pages with differing URLs and see how they get ranked.
Anthony Parsons
11-05-2004, 09:20 PM
What I mean Marcia is that by changing the words on a page, you’re actually allowing the algorithm to assess different criteria apart from just the domain name. For example, if you place the following words upon a page:
"seo copywriting literacy dictionary spelling grammar"
then place the following upon a second page
"seo copywriting chocolate gifts golf sunshine automobile"
you will get different results from the sheer use of semantic connectivity between the words. If the domain name is the same, and you change the page only once ranked and recorded, you will find different results, hence this cannot occur when testing URL variances. Check both pages for SEO copywriting, and I bet the one that semantically connects ranks higher (being the first). This means, you cannot use different words upon a page when testing URL's. The URL is the test, not the page copy, so everything must remain the same, identical in every facet, with only the URL the changes between tests.
This change is what has thrown out the original test IMO. Everything must be identical except the item you’re testing. The inbound links must be controlled, hence you cannot post the URL for others to look or someone will link to the page and taint the results. This is what I am getting at. This is also why when the word "PageRank" was mentioned in the first test, I commented that PageRank has nothing to do with the test, hence it shouldn't even be mentioned. The inbound links to a test, IMO, need to all come from the same page and use nothing but the URL as the anchor, not 1, 2, 3 or any such thing, just the URL name only.
I believe with these certain shortfalls fixed, the original test may then run a little more accurately. Whilst SEO Testing is never 100% accurate, facets can be measured to assimilate a change in ranking under a controlled environment, as your well aware. This test can be proven a lot more accurately. The use of non-competitive phrases in the domain is probably the first instance.
From results I have obtained before, if then differs again whether the keyword is used within the URL, as a folder, or as a filename. You will find this when testing yourself. You will also find that you don't need hyphens or underscores between words, as the SE's will still read them from within the URL structure.
It would be very hard to measure a competitive term within a URL for testing purposes due to the number of factors required to get a competitive term ranking. I believe they only need to know whether or not keywords in a url are assessed and used for ranking purposes, ie. some shift in ranking is recorded through keywords in the URL. If it is assessed, then it is assessed, if not, then no. As per the original conclusion in regard to differentiating between non-competitive and competitive terms in the URL affecting rankings, and my original post, that no bias is shown it is simply a yes or no answer.
With all these factors moved aside and controlled a little better, pages that do not change, one inbound link controlled and only the URL that will change for testing purposes, then the test will show accurate results IMO, from previous testing I have performed.
Marcia
11-05-2004, 09:32 PM
Thanks for clarifying the point, Anthony - great post there!
I haven't said exactly how would you draw conclusions from the test and some might have missed some details.
Basically you would search for "dfgrwc" at a SE (see my other post).
After the pages are indexed you would have at least 4 of them returned in the SERP in some kind of order. The pages you created would be the only results for a "word" like "dfgrwc" (not hard to guess why).
Now based on the order of the pages you would draw the conclusions I've been talking about.
Even if Google uses semantic algorithms other than the well known word stemming (highly unlikely), I really don't see Google connecting "dfgrwc" with any word you know in English or any other random or meaningless words.
The inbound links to a test, IMO, need to all come from the same page and use nothing but the URL as the anchor, not 1, 2, 3 or any such thing, just the URL name only.
I was talking about link text not anchor. :rolleyes:
Anthony Parsons
11-05-2004, 11:05 PM
I was talking about link text not anchor. :rolleyes:
Link text is anchor text weby, ie. This Is Anchor Text also Known as Link Text (http://forums.searchenginewatch.com)
Anthony Parsons
11-05-2004, 11:09 PM
I could be missing something here weby, so correct me if I'm wrong. But if your talking about having four pages with the term "dfgrwc" in them, you won't be seeing those four pages rank in the engines due to duplicate content, ie. the content has to be the same for the test to return accurate results, regardless of what content you use. Three of the pages will be filtered and only the one the engines deem most significant will be shown.
I could be missing something here weby, so correct me if I'm wrong. But if your talking about having four pages with the term "dfgrwc" in them, you won't be seeing those four pages rank in the engines due to duplicate content, ie. the content has to be the same for the test to return accurate results, regardless of what content you use.
My opinion is that the content doesn't have to be exactly the same, since the "filler" words would not influence rankings for say "dfgrwc".
Though I think the test would also work as you say, with exactly the same words on the pages (except the keyword additions).
Indeed Google or others would only show 2 results, but then you could always choose "repeat the search with the omitted results included" at the end of the SERP which would bring up all pages.
HitProf
11-06-2004, 09:30 AM
Hi,
Thanks for sharing your insights. I find the area covered a bit narrow. I'm curious to see results for let's say travel (hotel new york), financials (home loans), legal (personal injury lawyer), software (anti virus software), etc.
I also doubt you can say it's a penalty, could be other influences as well. For example: anchor text may have a much bigger influence than keyword-in-url, causing the results in competitive area's in the direction of well known sites rather than keyword rich url's.
I'm looking forward to see more results.
orion
11-06-2004, 09:47 PM
This thread has been moved from the Beta Test area and has evolved into two main topics
a. Original topic: WebCeo methodology and findings
b. Derivative topic: keyword sequences, occurrences in document descriptors (urls, in this case) and ranking results.
Topic “b” has been triggered by topic “a”; however, both are interrelated as they refer to factors affecting relevancy.
In post #20 of this thread, Danny stated
I'll ask everyone to please stay on the topic of this thread -- look at the data presented and offer criticisms and comments on it, as was requested. That's useful. Going off on tangents about personal motives and some of the other stuff I've had to delete isn't useful.
So, let’s look at the data presented and offer criticisms and comments on the methodology presented by WebCeo without going off-topic.
WEBCEO METHODOLOGY
WebCeo has stated
I decided to find out the real importance of keyword presence in URL. There are so many empty-minded topics with lots of opinions and no research preceding them. That's why I performed the following research...
Interesting and at the same time unfortunate quote coming from WebCeo, as we will see in this lengthy post.
WebCeo then provides a procedure in which search results obtained by querying search engines are associated to sequence of terms and phrases and with the ranks of individual pages. Note that keywords and their occurrences are extracted from the top N retrieved URLs. He then classifies key phrases as competitive and non competitive but does not provide a scientific metric qualifier or divider.
Let’s dissect his thesis in steps.
How risky is the mapping, association and correlation of search results to variables affecting relevancy? To address this and related topics, first we must have a clear understanding of query theory. Accordingly, when one reads an IR research in which search results are mapped to some variables, the first thing one must look at is at the query mode employed in the study. As you keep reading this post things become more evident.
When one queries a search engine the engine returns results according to its default query mode (which can be overwritten using advance search features). In Google and other search engines, this default mode is FINDALL, also known as AND. In this mode
1. the retrieved documents must contain ALL terms specified in the query.
2. the terms can be anywhere in the documents; i.e., in titles, descriptions, meta tags, body, url, links, tables, etc.
3. the terms must be present in the documents without regard for order and proximity.
Thus, querying Google in its default FINDALL mode for a query Q consisting of, let say, three terms (Q = k1 + k2 + k3) retrieves a set of documents in which these terms can be anywhere in the documents (for example, k1 in a title or sentence, k2 in the body or k3 in the url). There is no way to tell by just looking at the search result pages of the search engine where or how the terms co-occur or how far they are from each other in the document. Since FINDALL is a search without regard for order and proximity the terms can be in any sequence or at any given distance from each other in the document.
This also means that for a single query of the form Q = k1 + k2 + k3, the set of retrieved documents are relevant to and will contain documents containing any of the following 6 possible combinations regardless for proximity
Q = k1 + k2 + k3
Q = k1 + k3 + k2
Q = k2 + k1 + k3
Q = k2 + k3 + k1
Q = k3 + k2 + k1
Q = k3 + k1 + k2
Clearly one cannot say that the retrieved results are representative of/associated to the queried k1 + k2 + 3 sequence, since these merely are a subset of the total number of results.
Now if one queries any of the above six sequences in FINDALL not only the total number of search results will be different, but the top N ranking results and urls will be different, too. Therefore, extracting keywords or term sequences from urls or documents and then correlating these to search results and factors affecting relevancy is questionable.
Add to this that each search engine or IR system produces different search results and uses different type of stop words/delimiters libraries and you can see why combining or even comparing results from different systems is an incorrect procedure.
Thus, keywords extracted in FINDALL, from a particular IR system or search engine, from the top N results or top N urls, and subsequent mapping of these keywords or urls to a given variable -rankings, term occurrences and co-occurrences, etc- is a risky business.
Consequently, I’m inclined to state that WebCEO's methodology and statistics are incorrect and his thesis is contrary to query theory. However, feel free to disagree with me.
KEYWORD SEQUENCES, OCCURRENCES, RELEVANCY, ETC
How about mapping keyword sequences, phrases, semantics and relevancy with search results obtained in the default FINDALL mode?
This is also highly questionable. If one wants to associate an exact term sequence with some experimental variables one should use the EXACT query mode, not the default FINDALL mode. An EXACT search is a search with regard for order and proximity. Thus, a search for k1 + k2 + k3 should only retrieve documents containing this EXACT sequence of terms. But there is more to the story. In an EXACT mode
1. the queried system often return less results than in FINDALL
2. all terms present in the query must be present in the retrieved documents
3. all terms present in the query must be present in the documents and following the exact sequence as specified in the query
Now the “with regard for proximity” part depends on the library of stop words and delimiters to be ignored by the target search engine or IR system. This means two things
1. EXACT results are a subset of the results obtained in FINDALL
2. contrary to popular opinion, an EXACT search is not a search for phrases, since one can see EXACT search results in which the ignored stop words and delimiters are found between the specified terms. The IR system simply parses but does not count them for sequencing or relevancy purposes.
[Note. An exception of this would be hyphens used in FINDALL queries, since hyphens act as localized EXACT modes in the hyphenated portion of the FINDALL query. This unique phenomenon is observed in Google and few other search engines -not in all of them.]
While search results in EXACT mode could provide better variable-sequence correlation measures than FINDALL, still is not enough. Why? Simply put, because two different EXACT search results are a subset of different FINDALL set of results, as stated in point “1” above.
So, “what is left?”, you may ask. What is left is probability estimates. Since EXACT results are a subset of the results obtained in FINDALL mode, the ratio of these two search results gives a probability estimate I call EF RATIOS. Thus, for a given query, Q
EF RATIO = EXACT mode query results/FINDALL mode query results
In English, an EF RATIO is the fraction of documents retrieved in FINDALL mode that contain the EXACT sequence of terms as specified in a query Q. Another way to look as this is by stating that an EF RATIO is the probability of documents retrieved in FINDALL mode to have the specified EXACT sequence of queried terms. To learn more about query theory and EF ratios, see the Keywords Co-Occurrence and Semantic Connectivity (http://forums.searchenginewatch.com/showthread.php?t=48) thread.
EF RATIOS are an important tool for studying natural language sequences and for correlating word patterns and popular term sequences found in retrieved web documents. For a given queried IR system or search engine, EF RATIOS can be used to compare and contrast result sets of different sizes and triggered by dissimilar queries. Moreover, an EF ratio can be used to conduct temporal word pattern analysis; that is, to observe how terms co-occurrence change in time. This is important in business intelligence where one needs to monitor co-word occurrences in time and in seasonal markets where one needs to understand the time evolution and usage of word patterns.
Still EF RATIOS are far from perfect, especially when one considers that online database collections as content of individual documents are constantly upgraded or are contaminated with commercial noise, link building strategies, “SEO tricks” and all sort of strategic alliances.
Orion
Anthony Parsons
11-06-2004, 10:53 PM
Nice post Orion. That is the very decisive and technical way of saying, its inconclusive and needs to take on board what we have stated in regard to finding a near accurate result, though never 100% due to the indifferences measured at any one time within search results.
orion
11-08-2004, 01:40 PM
Well put, Anthony.
However, to be fair with WebCeo I must give him credit for making some research attempts. The problem here is not that he resourced to term extraction from search results which is a fair procedure. The problem is what he did with the results and how he mapped/interpreted them.
It is clear that this work lack of a clear understanding of query theory and factors affecting ranking results and keyword competitiveness. So I can understand why most of the above members feel this was an “empty-minded” work –-to use his own words.
Orion
Appreciate this type of research Chris_D did! This is definitely more insightful than most of the feelings and guesses one reads in other posts and SEO forums.
Now, obviously the outcome regarding competitive keywords is counter to what most of us expected (and experienced so far).
The problem is now to identify the reason or the flaw. Couldn't one flaw be, that there is no significant evidence that Google penalizes competitive keywords? If yes, the conclusion would be as expected.
Comparing two sites with different links seems to me an promising idea as an alternative test. But this too has perhaps some considerable drawbacks:
You have to wait months or so
You have to control that everybody in the universe is going to link to both sites
With 2 sites you probably have another problem of a sample which is too small
Thus, keywords extracted in FINDALL, from a particular IR system or search engine, from the top N results or top N urls, and subsequent mapping of these keywords or urls to a given variable -rankings, term occurrences and co-occurrences, etc - is a risky business.
I agree with Orion. He also justifies why and gives us hints of doing better. Now we have two choices:
Either use EXACT mode or to consider all results in the test.
Do I understand you right? If yes, I'd prefer the first if this is possible with M** or G** ? (i.e. with parantheses or andvanced search)
orion
11-21-2004, 08:56 PM
Hi, sfk
Indeed, if you do the same experiment in EXACT mode, not only the original statistics collected in FINDALL by WebCEO breakdown, but the mapping of results in terms of keywords will change in a non trivial fashion.
One can do a bit better by mapping in EXACT mode and even better if using EF ratios. As mentioned, the probability of finding an exact sequence of queried terms in documents retrieved in FINDALL is given by EF ratios.
Orion
strategicrankings
11-26-2004, 02:42 PM
While doing competitors backlink analysis for one client, i was particularly surprised to note that one of the competitors site was ranking high for one of my client's target keyword.
And by coincidence the domain name of the site i was analysing contains the exact keyphrase my client was targetting with hyphen between them,
I researched all their backlinks on alltheweb and few of the backlinks anchor text had the target keyword (this is what i thought) in them. Actually most of their backlinks barely read their domain name.
So i was wondering how could they be ranking so high for this given keyphrase. I came to understand that lots of their backlinks came from directory listings where they did not really have the control over the anchor text, but what was happening was that when they were listed simply with their domain name, the domain name was actually the anchor text of the link which was then containing their target keyphrase. which is what we all want , don't we ?
Excuse me for the bad english, i'm not native english.
orion
11-27-2004, 11:13 AM
Welcome to this thread, strategicrankings. Please feel at home. I can only discuss what experiments and theory tells me and give you my honest opinion. Thus, here is my view.
Trying to construct dimensional maps of the form
z = f(x,y)
where
x = exact or inexact terms sequence
y = urls
z = rankings
or trying to do experiments based on “correlation by isolation” of variables and then trying to draw conclusions based on partial observations, ignoring other factors or environmental variables is often a risky business.
There are too many variables affecting search engine ranking results. Any two Web documents, D1 and D2, and belonging to two different domains, U1 and U2, “live” within dissimilar local and global environments.
Even in the ideal scenario in which D1 is a carbon copy of D2 and they only differ in their url naming convention chances are that they will
1. have a different backlink structure across the Web; i.e., L1, L2.
2. will be hosted in different servers; i.e., S1, S2.
3. will rank differently across search engines.
To test the above dimensional map, one can do better by buying two different domain names, N1 and N2, one with an exact sequence and the other with a slightly different sequence (for argument sake, assume they differ by just one character). Then proceed as follows.
Design their content as carbon copies of each other, insure their internal link structure is identical as well as their external (backlink) structure across the entire Web remain identical --a task almost impossible to do. Place both domain tree structures in the same server, submit both sites to the same search engines the same day and time.
With all these precautions two things could happen. The engines could
(a) ignore or ban one or both of the documents/urls for being carbon copies
(b) accept, crawl and index one or both web properties.
In this ideal and extreme scenario even if (b) occurs chances are they will
(a) be indexed with different crawling timestamps.
(b) If just one engine index D1 but not D2 their linkage across the Web will no longer be identical.
The odds are they should rank differently across the Web. Now add to this that users most likely cannot control who could/could not link to them across the entire Web. An engine that uses the link structure of the Web in its ranking mix could see each Web property as having different “fingerprints” for their node connectivity (backlinks).
To sum up, the above experimental maps are questionable as they are futile exercises of “correlation by isolation”. In my view, one could only talk about probabilistic estimates and accept the fact that the odds are against the experiment.
Orion
PR Weaver
12-08-2004, 02:06 PM
There are too many variables affecting search engine ranking results.
Hi WebCEO, Orion and all of you,
Even if I read you since a few months, this is my first post on this great forum.
I agree with Orion, the conclusions of this test cannot be 100% sure because too many variables were involved.
I didn't try to elaborate such an analysis, but decided to focus only on one variable: do search engines really read keywords in URLs?
This article explains the test and give the first results: Should You Use Targeted Keywords in URL? (http://www.prweaver.com/blog/2004/12/08/79-keywords-in-url)
As we were only focused on one item (keyword in URL) we have decided:
To create webpages with the same look and feel of our other webpages
To create webpages with the "word" rkpatjfg in the URL (directories and file names)
To avoid the word rkpatjfg in the body text and head text of these pages
To link to these new pages without using the word rkpatjfg in the anchor text
My test confirmed that the main search engines do take account of keywords in URL.
Danny, I hope I stayed on the topic of this thread :rolleyes:
Best,
Olivier Duffez
Web Rank Info - PR Weaver
strategicrankings
12-08-2004, 04:35 PM
Hi Olivier,
Interesting article indeed. I posted in a thread related to the same topic on your forum back in June 2004. http://www.webrankinfo.com/forums/viewtopic.php?t=11496&postdays=0&postorder=asc&start=0
Unfortunately this was my first and last post there. Nice to see you here however :)
Anthony Parsons
12-08-2004, 07:12 PM
It is good to see someone test and document what they have done. Good stuff PR Weaver.
orion
12-08-2004, 08:52 PM
Welcome to this thread PR Weaver. Feel at home.
Let’s look at the evidence.
The issue and thesis at stake here is how valid is trying to correlate absolute ranking results with specific terms and sequence of terms that have been extracted from urls. The evidence from WebCeo own experiment suggests that their experiment was flawed.
Term Weight Schemes
Many seem to miss some important facts about how search engines, including Google, utilize term vector tf*IDF schemes when computing the weight of a term in a document and in a database. It is not only about link building or keyword placing in urls.
Since its inception in the web scene, Google has been combining PageRank and term vector tf*IDF schemes (among many other things) in its ranking mix.
On the other hand, by virtue of tf*IDF schemes where
IDF=log(D/d)
D=total # of documents in the collection
d=documents containing the queried term(s)
d will always be very small; thus, very uncommon terms (like invented terms) will always have a high IDF value and high term weights. For details, see the Term Vector (http://forums.searchenginewatch.com/showthread.php?t=489) thread.
Obviously this introduces bias to the experiment. This is what I call spurious ranking results. Being #1 out of 10 is not the same as being #4 out of 100,000. We can do better by comparing relative rankings for a given database.
A better approach would be the use of real terms as used in natural language, but again this introduces other variables to the test.
Here is a simple test.
1. Design a web page and save it with a valid keyword term, k1.
2. Resave the same web page with a different term.
3. Submit both and check for their rankings.
The two pages will have the same look and feel, and more important, the same content. Check how both ranks for k1.
Now let assume that the engine reads terms in the url.
Soon we will realize that even in this simplified test, the weight of k1 in the submitted document will depend among other many things on how many documents in the database engine use k1, by virtue of the IDF term. Consequently, complete isolation of keywords-in-url effects is not possible and the comparisons cannot longer be sustained.
IDF effects on the weight of a term is another reason of why the above correlations are risky. Page designers, submitters and searchers have no control over IDF effects (the size of the database and number of documents in the database containing the term).
I hope this help.
Orion
Anthony Parsons
12-08-2004, 11:13 PM
Well said Orion. PR Weaver has clearly stated though that the test was not about ranking weight, only that the search engine does read the URL terms, nothing more. Yes, an accurate test for ranking does need to be performed. I am doing one over at SEO Testing, when I have the time. It is formatted, just not uploaded as yet.
From previous tests, I have found that Google does little to nothing with them towards ranking bias, though Yahoo and MSN did clearly use them to include within their rankings.
What PR Weavers test has concluded though, is that Google does use it to "some extent" as it did show up for a direct search. You can put the oddest phrase you want within a description, and it won't turn up when searched directly, as Google definately does not use the description for ranking purposes, only display. That I have personally tested and documented as SEO Testing. So the test does show it actually weights it, just not an extent of how much, which is the real test.
I think the actual test needs to go something like this though:
#1 - Make page about anything and save.
#2 - Insert keyword within page.
#3 - Record results.
#4 - Now insert keyword into URL.
#5 - Measure to see if any noticable ranking fluctuation was lost or gained.
#6 - Remove keyword from URL to double check to ensure it returns to near original position.
#7 - Repeat #5 & #6 as required to take a mean average if necessary.
#8 - Now repeat by changing the keyword into a folder and test, then filename and test, hypenated, underscored, no punctuation, variances.
orion
12-09-2004, 12:28 AM
HI, all.
Whether a search engine reads keyword terms in urls or not, the problem with these type of testing (e.g., the mapping of ranking variations to a given parameter) is that there are too many parameters web designers, testers, and web searchers cannot control no matter how controlled setting they utilize. These parameters affect the weight of terms and therefore term relevancy and ranking.
One of such parameters is the tf*IDF term vector scheme utilized by a search engine. There are different schemes per search engine and even a given search engine can tweak at any time its unique tf*IDF scheme.
Let say one adds/remove a term from the url. At any given time during the addition or removal of the term
D, the database size is changing
d, the number of documents in the database containing the term is changing.
So, let say an experimenter do a test adding k1=dogs in a url and that the database size is X and the number of documents containing k1 is Y
Let say the user remove k1 from the url, retest. X, Y and thus IDF still can go either up or down. Let assume it goes down at the time of retesting, thus, affecting the IDF and the weight of the term. What kind of correlation a tester could arrive to? How he/she knows if the scheme has been modified or not or upgraded/downgraded?
Since X, Y and IDF can either go up or down at any given time, the mapping and correlations are still more a “voodoo stat” thing than anything else.
I have mentioned only one scoring function (tf*IDF schemes) that testers cannot control from a search engine. How about if an engine at any given time tweak or changes the cosine similarity function utilized to score similarities between documents and queries? How about using modified similarity scores, parsing methods, etc?
To sum up, in my view, and others can disagree with me, there are too many factors that a search engine can tweak or the tester may not be aware of, that make such ranking correlations risky if not tests on futility.
Orion
Anthony Parsons
12-09-2004, 02:02 AM
Absolutely Orion. I'm not disagreeing with you on that point. Testing SEO techniques is very much an object of futility, as the exact science behind it is not necessarily known unless you designed the algorithm your testing.
In my testings, if I don't see enough improvement to differentiate an actual increase / decrease, then I leave it at an unknown for ranking purposes, though can generally answer whether it is actually used or not. Its just to what extent it is used!
PR Weaver
12-09-2004, 05:18 AM
Yes Orion: I know it's nearly impossible to give test results with 100% reliable numbers.
My test had only one goal: analyzing whether search engines read keywords in URLs or not. Do you agree with my conclusion: "Google, (MSN beta), Yahoo and Exalead do read keywords in URL"?
Best,
Olivier Duffez
Web Rank Info - PR Weaver
orion
12-09-2004, 08:57 AM
Great discussion, guys.
I’m inclined to agree that some search engines do read keywords in urls. The question is why.
I may need to disclose some reasons. In the process I’ll describe a procedural recipe.
Why’s
Too many reasons can be invoked; from adding terms to their index terms to internal testings to business intelligence to feedbacking other services (paid services). In the particular case of Google, MSN and others, who knows why and I prefer not to speculate.
I do know that some very simple and other primitive systems used to create four type of arrays; titles, urls, descriptions, and keywords out of metadata. The array entries were them correlated via a unique document ID. This IR architecture is computational expensive.
How-to's
An alternative approach is an architecture in which the keywords array is eliminated altogether and as follows:
//create a three-dimensional array per document containing titles, urls and descriptions, so you have one entry per document
//parse each entry to extract keywords
//create the index terms
//apply the weighing scheme and scoring function
//use a loop to populate a temporary keywords array at query time.
What do you gain out of this approach at running time?
(a) the need for indexing the keywords
(b) the need for building/maintaining a keywords array, thus saving resources.
However, this approach creates other problems, some a bit off-topic.
So, those are the why’s and how-to’s. Now to say that terms that were extracted specifically from the url portions of the entries bump up or down ranking results or to correlate rankings with specific terms in urls is the questionable part of the tests.
I hope this shed some light.
Orion
Beltzilla
06-24-2008, 10:16 PM
We just added keywords in URL to our womens belts site and therefore this post has a real eyeopener. I always assumed that it would be beneficial, but never saw any evidence. Thank you for sharing.