View Full Version : Effects of near identical pages?
critter
03-10-2005, 01:20 PM
MODERATOR NOTE: Merged from similar thread.
Quick qestion.
We have a very large affiliate network of websites.
It seems many of our webmasters do whatever they like, and will often rip off our website in a few different ways.
A. They created a framed site and include our url in the body frame
B. Affiliates will just copy our entire site and load it up on their own domain.
Now we all know duplicate content is not great in terms of SEO and potential rankings. My question is these are webmasters who are doing this without our permission. Does the various search engines penalize us for this? Should we be going after these affiliates aggresively and getting them to stop the plagerism and stealing of our sites?
Cheers
CRITTER
enormo
03-10-2005, 06:11 PM
MODERATOR NOTE: Original first post.
I have an ecomerce site that is totally database driven. Not the best thing for search engines.
I am going to write a script that can generate a static version of the site so that it will be faster (I don't update the site enough to warrant hitting the DB each time) and more search engine friendly.
My potential problem is that I'm going to have multiple near identical pages...
i.e. The way the site is designed now, the individual product pages(that feature a single product) also have a menu on each page with only a partial listing of items. So while looking at item A you are also seeing item A,B,C,D,E in the menu. If you click the buton to advance the menu you will be seeing item F,G,H,I,J in the menu BUT you will still be looking at item A as the individual featured item. (to get an idea you can go to tiffany.com
... click what region you are in and it will take you to my example.)
So, the potential search engine problem is I will have multiple pages for each item... the only difference between the two is that the images in the menu will be changed. Right now there might only be two near-identical pages per item but that could increase if the inventory grows.
Would this dilute my search engine rankings?
Thanks for sticking with the long explanation. If you need more info or a clarification I would be happy to provide it!
Mikkel deMib Svendsen
03-10-2005, 06:51 PM
Duplicate content can, and will most likely in the long run, hurt your INDEXING - and if pages are not indexed they wont, off course, rank for anything. So indirectly it can hurt your rankings but only because such pages can, and often do, get removed.
The problem is you never know which version the engines end up removing - or if they remove both or more of the site. Your best solution is to NOT make them have to make that decission - make it for them by only giving them access to index one version. You can either exclude the duplicate version with META-robots or Robots.txt or detect the spiders and remove links to them. The first method is probably more "secure" and standard.
lots0
03-10-2005, 09:01 PM
near identical
Isn't that kinda like being, close to nowhere?
I know this is really obvious but; Near identical is not identical.
Connie
03-10-2005, 10:05 PM
Isn't that kinda like being, close to nowhere?
I know this is really obvious but; Near identical is not identical.
I may not have understood what you meant, but to me if the only thing you change on a page is a graphic, then near identical to a spider would probably be identical. Even if some text is changing at what point would near identical not be identical to a spider. :)
I agree the change would benift the visitor but don't think a spider would see it that way.
Marcia
03-10-2005, 10:18 PM
"Near duplicate" is enough to cause a problem. It's not only identical text - pages can be picked up as near dups even with some unique text if other factors are identical, incuding but not limited to page structure, page names, file paths, etc. - some pertaining to detecting mirrors on different sites but still worth considering.
Before creating additional pages, check into mod_rewrite - or consider creating additional static content pages for the site that are useful to users and help the engines determine that the site is relevant for the topic or products.
ThouShaltSeo
03-10-2005, 10:27 PM
I know this is really obvious but; Near identical is not identical.
it all depends what Google calls "identical", not what it means to us, or how it's defined in a dictionary. Maybe even 80% similarity=Identical to Google
Mikkel deMib Svendsen
03-11-2005, 01:37 AM
Maybe even 80% similarity=Identical
If it was this simple it would be easy. Unfortunately that's not my experience. Using a "raw" percentage would give engines far too many false positives and with the ever lasting focus on index size engines just don't like to do that.
In general, I usually never see problems with near-identical pages. Not that I am saying it would never be a problem but most of the problems I see is with real duplicate content - multiple URL's pointing to the same pages (for example with different parameter values - or static and dynamic versions).
New graphics, however, might just be one of those situations where pages could count as duplicate if all the difference is the images.
lots0
03-11-2005, 03:11 AM
In general, I usually never see problems with near-identical pages.
I agree.
In my experience, for a page to receive a duplicate penalty it must be a duplicate (you know the exact same).
If the onpage URL that is calling the graphic changes when the image changes then the page is not “identical” or a “duplicate”, it is in fact a different page.
If your worried about these pages hurting your ranking, you might want to look into banning the bots from them.
Andy1969
03-11-2005, 04:56 AM
"Near duplicate" is enough to cause a problem. It's not only identical text - pages can be picked up as near dups even with some unique text if other factors are identical, incuding but not limited to page structure, page names, file paths, etc. - some pertaining to detecting mirrors on different sites but still worth considering.
Thanks Mikkel, I am interested to know more on the research behind this, is there a white paper or anywhere you could point me to, to find out more?
Andy
Mikkel deMib Svendsen
03-11-2005, 05:08 AM
I believe you can find theoretical papers on identification of duplicate or near-duplicate strings but they may, or may not, directly apply to the engines in question. I have not seen any vali studies that proof what kind of matching engines across the board use for this. So personally I am just speaking from experience orking on a high number of dynamic sites with these problem.
kunalg
03-11-2005, 07:00 AM
MODERATOR NOTE: Merged from similar thread.
hi buddy
u don't need to worry if any other webmaster copy ur site or content because all the search engine follows the (fcfs)first come first serve. it means if google at any time find two sites with same content then search engine remove or banned the site which is later registered or which is new.if u want to report this to search engine u can also try that.
dannysullivan
03-11-2005, 07:04 AM
I merged some posts on a similar thread into this one, as the subject was so near identical, pun intended :)
To the above, it's not as clear cut as just going with the "oldest" page. I believe Google, for example, has said that it will tend to prefer the most "linked to" version. So if you had content, someone else took your content, your content would still likely come up first if more people linked to it. I haven't reviewed the official rundown by search engines on how they determine which duplicate copy to go with for a bit, so I'll put that on my list. In the meantime, perhaps others who have heard can also share. The age of the document certainly could be one of those factors.
Michael Martinez
03-11-2005, 04:01 PM
Amazon's rankings have not been injured by duplicate content from some of their larger associates. I believe there may be two principle reasons for Google's tolerance:
1) Amazon is what I call a Trusted Content Site. It's large, it's old, and Google has been crawling it for years.
2) Amazon has so many associates out there that the duplicate content probably constitutes only a fraction of the sites linking to Amazon, so its own natural link popularity helps Google determine that Amazon is the best site to select from any collection of duplicate pages.
If you have the capability, you should consider offering datafeeds to your affiliates (I prefer .CSV files myself, but other people may have different preferences). If your power affiliates can build their own pages from their own templates, they may prefer doing that anyway.
Database-driven sites are now usually crawable. All of the major search engines are actively crawling and indexing dynamically generated forums, product catalogues, and blogs on a regular basis (as well as news sites). Content management systems are usually written to store their data in SQL databases. All you really need be concerned about is keeping your dynamic URLs as clean as possible (session IDs can be a problem, but Amazon gets them indexed all the time) and ensuring that your internal link navigation is logical and persistent enough to enable a spider to get to all the pages.
Marcia
03-11-2005, 04:16 PM
is there a white paper or anywhere you could point me to, to find out more?There are several, and one in particular I've just recently come across that's one of the best yet. I'll dig them out and collect them over the weekend to post.
AussieWebmaster
03-11-2005, 04:37 PM
Here's a few to look over at various levels of intensity:
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,658,423.WKU.&OS=PN/6,658,423&RS=PN/6,658,423
http://www.cse.lehigh.edu/~lopresti/Publications/1999/icdar99.pdf
http://csdl.computer.org/comp/proceedings/icdar/1999/0318/00/0318toc.htm
The above should scramble your brains.