View Full Version : How come Google rank both static and dynamic page for duplicate content?
callback
03-21-2005, 05:07 PM
How come Google rank both static and dynamic page for duplicate content? Static page is used as mirror page that redicts to the dynamic page. Google not only indexes both pages but also provide rankings. The two pages have a little difference in title tags but the contents are the same.
These two pages are seen in Ecommerce sites, for example, bestbuy.com. Bestbuy.com's merchandise, e.g., Compaq "Presario 2800 Notebook" is searchable by Google with the static page http://www.bestbuy.com/products/1099389840386.jsp and also by the real and lengthy URL with "?" for dynamic.
I can't understand it. I thought Google detected mirrow page to reject spamming or it simply index the destination URL instead of the mirrow, if the dynamic page can be indexed. But, now I saw both had been indexed.
I think Amazon has the similar practice to get their products indexed by creating static mirror pages corresponding to the dynamic, so whenever a new product page is made, a static page is also made, so they can be crawled.
If this can be done, you save your energy/money to change all dynamic URL to static by simplying creating mirror pages.
Will this be practical? Will it be penalized by Google, 'cause you can create ten or twenty mirror pages for one dynamic.
Mikkel deMib Svendsen
03-21-2005, 07:18 PM
I thought Google detected mirrow page to reject spamming
Yes, they "try" to do so, but they are far from perfect. Just imagine for a second the kind of resources it takes to do this perfectly. With 8 billion documents in the index it is virtually impossible to compare all pages - one by one. And they would have to do it for every new update to catch all. It's just not realistic at this stage. So, you will find many examples of such duplicate content. However, in my experience much of it does in fact dissapear over time - it just dosn't happen right away all the time.
On top of this you should be aware that search engines have sometimes made special arrangements with major sites that have important information. This can also confuse what you see.
callback
03-21-2005, 07:24 PM
Yes, they "try" to do so, but they are far from perfect. Just imagine for a second the kind of resources it takes to do this perfectly. With 8 billion documents in the index it is virtually impossible to compare all pages - one by one. And they would have to do it for every new update to catch all. It's just not realistic at this stage. So, you will find many examples of such duplicate content. However, in my experience much of it does in fact dissapear over time - it just dosn't happen right away all the time.
On top of this you should be aware that search engines have sometimes made special arrangements with major sites that have important information. This can also confuse what you see.
Great points. As what I can imagine, is that even the dup. filter finally catches a dup page and delete it, it may have been a few months late. At that time, products at bestbuy.com may not be sold and it really doesn't matter because new pages have been generated. This is a short-term strategy, but it just works on sites that are updated that frequently.
lots0
03-21-2005, 08:20 PM
The two pages have a little difference in title tags but the contents are the same.
I think what your talking about here are "near duplicates". If any of the code or text is different they really can't be called "duplicates". Duplicate pages would be pages that are identical in every respect, except for the URL.
I don't believe that google (or any search engine) has the ability or desire to detect near duplicates.
Mikkel deMib Svendsen
03-21-2005, 08:29 PM
As far as I am aware all major search engines use various kinds of near duplicate check for all such analysis. There are simply too many page variables that can change on every single request to make it work if you dont such as rotating content, time stamps, news flashes and advertising.
Having said that, I usually find that it's only "real" duplicate content that gets booted. A very small difference in title alone might not be enough. I would definately not feel safe.
I think what your talking about here are "near duplicates". If any of the code or text is different they really can't be called "duplicates". Duplicate pages would be pages that are identical in every respect, except for the URL.
I don't believe that google (or any search engine) has the ability or desire to detect near duplicates.
HI Lotso
While I have no evidence that they have implemented it, Google do have a patent on a duplicate page detection algo (http://www.chat11.com/Google_Patent_On_Detecting_Duplicate_Files_-_Description) which fingerprints both the entire page and portions of the page.
If they are not doing it now, they have at least shown an interest in identifying and possibly penalizing pages that use great chunks of text that are identical on other pages.
AussieWebmaster
03-21-2005, 09:57 PM
As far as I am aware all major search engines use various kinds of near duplicate check for all such analysis. There are simply too many page variables that can change on every single request to make it work if you dont such as rotating content, time stamps, news flashes and advertising.
Having said that, I usually find that it's only "real" duplicate content that gets booted. A very small difference in title alone might not be enough. I would definately not feel safe. I agree these are not mirror pages and considering the differences in title tags, the dynamic category headings and many times the different ways the dynamic code generates the page has the differences just below the bar of penality... and the other thing mentioned above is that it is not flushed at the start... when the bot gets to fully checking the items have changed prices etc. and are refreshed and reindexed as updated pages...
sites where inventory is updated and model numbers are changed slightly and thus generate new pages have an edge.
Friends,
Google considers many factors to find dupicate pages / content
Within two page rank update period, dupicate pages will be removed :D not considered how big or small the site is
Jag