View Full Version : Is A Trailing / On A Directory Seen As A Differnet File By Google?
bobmutch
12-06-2004, 12:36 AM
I would like to open this subject up for discussion. Is a trailing / on a directory seen as a differnet file by Google? Does Apache or IIS see them as different files? Can you 301 a example.com/dir into a example.com/dir/ ? Please post examples for different files with and with out a trailing / that have differnet PR.
Here is a smaple I got off the ranking folders instead of pages (http://forums.searchenginewatch.com/showthread.php?t=1532) thread.
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
ThouShaltSeo
12-06-2004, 12:50 AM
I believe so (that they're separate).
Also, if you link to domain-com/file.html?track it will be seen as a different page from domain-com/file.html and who knows, possibly trip a dupe filter since it's the same exact content.
I would like to open this subject up for discussion. Is a trailing / on a directory seen as a differnet file by Google? Does Apache or IIS see them as different files? Can you 301 a example.com/dir into a example.com/dir/ ? Please post examples for different files with and with out a trailing / that have differnet PR.
Here is a smaple I got off the ranking folders instead of pages (http://forums.searchenginewatch.com/showthread.php?t=1532) thread.
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
bobmutch
12-06-2004, 01:16 AM
ThouShaltSeo: Yes I understand that the query string ? on the end of a file will product an addtion file as far as Google is concerned but not as far as Apache of IIS are concerned. I just finished fixing a site that had a directory/? in the source on about 6 pages. Each one of the directory/? pages had a PR7. Pretty big bleed.
Same thing on example.com/dir/ and www*example.com/dir/ . They are seen as 2 differnt locations by Google but the nice thing the are seen as two differnet locations by IIS and Apcahe also so you just 301 one into the other.
ThouShaltSeo
12-06-2004, 01:32 AM
can you please tell me how did you fix the /? thing? It can't be 301d
thanks in advance,
ThouShaltSeoI just finished fixing a site that had a directory/? in the source on about 6 pages. Each one of the directory/? pages had a PR7. Pretty big bleed.
bobmutch
12-06-2004, 01:45 AM
ThouShaltSeo: In the site I just fixed I just changed the code in a javascript function that put a /directory/? in the source of each offending page so it no longer added theh ?.
If you are dealing with a dynamic site where you have example.com/inventory.cfm?id=777 then you just do a replace() fuction that changes the ? and = to a / in your source and then add SafeSpiderURL dll to IIS if you hosting is on a win32 box and if your hosting is on Apache use mod_rewrite to convert your changed form in the source what what you want the server to read.
bobmutch
12-24-2004, 12:50 PM
Has anyone else come up with some example how a trailing / can be seen as a differnet file by the search engines as in this case.
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
I have been looking and havn't seen anything. I am wondering now if this is not just a case of mod rewrite?
JohnW
12-24-2004, 01:10 PM
It's not as completely clear-cut, to me, as some suggest. There are examples both supporting and contradicting this - some where the page (folder) with the trailing / was considered to be a different page with different PR and BLs, and then other examples where Google has figured it out and treated the the pages as if they were in fact the same page regardless of the /. IMO consistancy is the only thing worth worrying about.
bobmutch
12-24-2004, 02:06 PM
JohnW: "IMO consistancy is the only thing worth worrying about." Right, sounds good to me. But if people are giving you lnks with no / on them and you are doing links with /'s then you could end up with 2 differnet files according to the search engines, IF they are considered two different files. Now I am quite sure this is not the case. Personally I think the examples in this thread are caused by incorrect mod rewrite code.
I have noticed all the Directories, or most of them, use / on ALL there entries and I have been doing that myself for some time.
orion
12-24-2004, 03:03 PM
I hope this help.
URLs need to be normalized and then hashed to conform a unique page identifier. Hashing is done to avoid collisions when documents are mapped to unique page ids.
The procedure is pretty much straightforward. It is also described in WebBase : A repository of web pages (http://newdbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999-26&format=pdf&compression=&name=1999-26.pdf)
Hector Garcia-Molina’s group write
“2.2 Page identifier"
"Since a web page is the fundamental logical unit being managed by the repository, it is important to have a
well-defined mechanism that all modules can use to uniquely refer to a specific page. In the WebBase system,
a page identifier is constructed by computing a signature (e.g., checksum or cyclic redundancy check) of the
URL associated with that page. However, a given URL can have multiple text string representations. For
example, http://www.stanford.edu:80/ and http://www.stanford.edu both represent the same web page but
would give rise to different signatures. To avoid this problem, we first normalize the URL string and derive a
canonical representation. We then compute the page identifier as a signature of this canonical representation.
The details are as follows:
Normalization: A URL string is normalized by executing the following steps:
Removal of the protocol prefix (http://) if present
Removal of a :80 port number specification if present (However, non-standard port number
specifications are retained)
Conversion of the server name to lower case
Removal of all trailing slashes ("/")
The resulting text string is hashed using a signature computation to yield a 64-bit page identifier.
The use of a hashing function implies that there is a non-zero collision probability. Nevertheless, a good hash
function along with a large space of hashes makes this a very unlikely occurrence. For example, with 64 bit
identifiers and 100 million pages in the repository, the probability of collision is 0.0003. That is, 3 out of
10,000 repositories would have a collision. With 128 bit identifiers and a 10 billion page collection, the
probability of collision is 10-18. See [CGM98] for more discussion and a derivation of a general formula for
estimating collisions."
In a general sense, the idea of this procedure is to recognize some URL text string cases as having the same signature. An example is given below
http://www.doc1.com:80
http://www.doc1.com:80/
http://www.doc1.com
http://www.doc1.com/
After the treatment they all should have the same page id.
Since text affects the checksums, different directory paths leading to the same document content should produce different page ids.
http://www.doc1.com/aaa/docA.hml
http://www.doc1.com/bbb/docA.hml
Whether this is detected or not depends on the "spam patrol" (human or algorithmic) of the target system.
Orion
JohnW
12-24-2004, 03:46 PM
Bob, it's hopeful that you have some control over how pages link to you ;-) so when it is possible, you do what you can. I like using the / because it saves the web server processing time, and, if volunteer linkers copy/paste from the URL line it will usually have the /.
bobmutch
12-25-2004, 12:43 AM
orion: Very interesting info. Now how would you apply this to the subject at hand?
orion
12-25-2004, 11:36 AM
My pleasure, bob.
The originator of this thread asked several different questions. The first one was
Is a trailing / on a directory seen as a differnet file by Google?
File recognition with large-scale search architectures is a non trivial task, as the system needs to put in place url-to-document maps via page identifiers.
With WebBase and similar systems, these ids are constructed from the text strings in the directory associated to each file. Whether the slash is in a URL does not matter for file identification purposes since during normalization these are removed anyway. The same would happen with other special characters. They do not contribute to the checksum.
That’s why the researchers assert
“For example, http://www.stanford.edu:80/ and http://www.stanford.edu both represent the same web page but would give rise to different signatures. To avoid this problem, we first normalize the URL string and derive a canonical representation. We then compute the page identifier as a signature of this canonical representation.”
Removal of all trailing slashes ("/") insures http://www.stanford.edu:80/ and http://www.stanford.edu representing the same file would be mapped to the same page id signature. During scoring documents, there is no reason for these two to have different scores.
In my view, I consider a huge architectural flaw a large-scale search engine that uses URL normalization to assign a unique page id but then still score differently the files.
URL normalization is a non-trivial task within a search architecture. Indeed, poor URL normalization or the lack of it is one of several things that can be used to
a. spam a system
b. distinguish between true search engines and oversized site search tools (e.g., javascript "search engines") and search-in-a-CD tools marketed as “search engines”.
I hope this help.
For additional details on WebBase, see this thread http://forums.searchenginewatch.com/showthread.php?p=28630#post28630
Orion
bobmutch
12-25-2004, 01:01 PM
orion: Ok I got all that on the first round. I realize that apache and iis see example.com/? example.com example.com and even example.com:80 and example.com:80/ as all the same thing. The post was good information as I didn't know the process for stripping.
When I asked you to apply that to the subject at hand I wasn't looking for a further explaining, I was wondering how does you post relate to the question at hand.
The subject at hand is does Google is the trailing / as differnet. We know it sees the query string as a driffernt file example.com/ example.com/?, we know it sees www*.example.com and example.com as differnet files. Personally I don't think Google sees example.com and example.com/ as differnet files but I was wanting some one in the know to clearly prove that, as we have had some here on SEW say differnet.
Hence the examples at the beginning of the post:
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
If you have comments on that I would please post them too. And don't get me wrong your other info was very good.
Also while we are at it what about www*example.com and example.com , what is the process for dealing with the host name in this process.
Normalization: A URL string is normalized by executing the following steps:
Removal of the protocol prefix (http://) if present
Removal of a :80 port number specification if present (However, non-standard port number
specifications are retained)
Conversion of the server name to lower case
Removal of all trailing slashes ("/")
I don't see anything in the above that deals with it.
orion
12-25-2004, 11:35 PM
Hi, bob.
I hope this help.
We know it sees the query string as a different file example.com/ example.com/?, we know it sees www*.example.com and example.com as different files.
Text strings for constructing checksums and queries not necessarily go hand-to-hand. Indeed, text string searches and checksum searches involve different resources and are of different nature.
Personally I don't think Google sees example.com and example.com/ as different files...
I don’t think so, either. Do they score both cases differently? They should not. Are they actually scoring the two case scenarios differently? Now that’s a different question. To answer this I would need to conduct controlled experiments or to examine any valid evidence and try to replicate the cases.
Hence the examples at the beginning of the post:
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
If you have comments on that I would please post them too.
These cases need to be examined a bit closer. This should include time-based parameters..
I still view as an architectural failure (others could view this as a gaming opportunity) a search engine that scores a document differently because it belongs to two different urls that were already normalized to conform to the same page id checksum.
I don’t see any valid reason from the relevancy standpoint, to score differently the two cases, above.
I would accept few instances due to collisions, but it will be hard to invoke a “collision probability” defense if there is a pattern.
Also while we are at it what about www*example.com and example.com , what is the process for dealing with the host name in this process.
URL normalization convey the idea of conforming them to a common format. Each system has its recipe. In WebBase, all trailing slashes are removed. I remember that the old AltaVista implementation consisted of adding trailing slashes at the end of urls, so as to conform to the generic format http://www.domain.com/
Normalization to a common format is pretty much straightforward regexp find-replace task. If you need a particular regexp for doing this I can PM you examples.
Orion
Chris_D
12-28-2004, 09:45 AM
Bobmutch,
Here are Googleguys comments on the trailing slash and the 301 redirect you avoid:
http://www.webmasterworld.com/forum3/15894.htm
Hope that helps
Chris_D
bobmutch
12-28-2004, 12:33 PM
Chris_D: Thanks:
GG quote from WMW listed above:
"I've got a few minutes free, so let's go into detective mode for a bit. Most webservers are configured to append the "/" automatically via a 301 redirect. For example, if you try to fetch www*google.com/webmasters , our web server will do a permanent 301 redirect to the canonical page, which is www*google.com/webmasters/ (note the trailing slash). "
That thread is pretty old so I am not going to revive it at WMW. Here is GG saying out of the box most webservers are configured to append a / via a 301 redirect or people are configuring their server that way?
If it is not a configuration out of the box is there some code to stick in your .htaccess or httpd.conf to make those changes?
pageoneresults
12-28-2004, 03:07 PM
Great topic and some very interesting discussion.
Yes, with and without a trailing forward slash are treated as two different files unless otherwise specified at the server level. Now, bear with me as I can really only explain this in laymen's terms...
Content Negotiation
The W3C and other large website structures are now utilizing content negotiation. That means that this...
www.example.com/sub
...could be different than this...
www.example.com/sub/
With the use of content negotiation, there are no file extensions. Basically you are cleaning the URI of all underlying identifying technologies.
example.com is the root domain. www.example.com is a sub-domain of the root and both should be treated as different locations. Unless of course you've implemented a 301 to redirect the root to the sub-domain of www.
When doing rewrites on IIS, we always force the trailing forward slash as the server seems to interpret things differently when using an ini file on IIS, it doesn't automatically apend the trailing forward slash. We don't mind as it forces us to produce a rewrite that is flawless in its function. We'll typically back our way through a URI and make sure that the proper server headers and content are being presented depending on where you are in the string. That part gets deep for me! ;)
So yes, these are all different locations based on the specifications...
example.com
www.example.com
www.example.com/name
www.example.com/name/
orion
12-28-2004, 11:57 PM
ISAPI and Apache url rewrite works fine for making clean urls, but is not the same as url normalization (from the search engine side).
After normalizing the urls the system should assign a unique checksum to each of these in order to map urls to documents. Thus, from the search engine side the common urls (with/without slashes) should map to the same page identifier. So far that's how the IR implementations I'm familiar with work.
Orion
rustybrick
12-29-2004, 04:59 PM
Thought this might be of interest to this thread. I was doing some searching on client sites at Yahoo and I noticed a problem with the way Yahoo was handling the normalization of the URLs. I posted a real life example and put the thread in the Yahoo Web Search forum with the title Trailing Slash Issues - Normalizing URLs the Wrong Way (http://forums.searchenginewatch.com/showthread.php?t=3530).
ISO9000 Guy
12-30-2004, 12:08 PM
There is also the aspect of:
www.example.com (http://www.example.com)
vs.
example.com
These also will give different results.
EDIT: Oops - I see it was noted about 3 posts up.
bobmutch
01-01-2005, 10:51 PM
Another 3 examples:
http://services.google.com/ads_inquiry/en PR10
http://services.google.com/ads_inquiry/en/ PR0
http://services.google.com/appliance/request_info/site PR10
http://services.google.com/appliance/request_info/site/ PR3
http://www.google.com/support/toolbar PR5
http://www.google.com/support/toolbar/ PR0
JohnGalt
01-12-2005, 05:25 PM
What do you think will happen - will a page get spidered at all - if it has this type of URL paramter:
What do you think will happen to this?
www.site.com/index.html?paramter1=alskdjf/lasdkjf/laksdjf/¶mter2=lakfjs/lasjdf/lajfds/lksdjf[/url]
Dave Hawley
01-12-2005, 11:48 PM
Just to throw a spanner in the works (http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=GGLD,GGLD:2004-44,GGLD:en&q=%22Our+sales+team+can+help+optimize+AdWords+for+ your+business%22)
seomike
01-13-2005, 02:43 PM
RewriteEngine On
RewriteBase /
# www.domain.com/root/ -> www.domain.com/root
# www.domain.com/root/sub/ -> www.domain.com/root/sub
RewriteCond %{REQUEST_URI} ^/.*/$
RewriteRule ^(.*)/?$ /$1 [R=301,L]
This will detect if the request has a trailing / if it does then it will 301 redirect to the normalized url.
If you have page scores in Google with the URLs that are not normalized you can kiss those rankings good bye if you implement this since all user/spider/bots will be redirected to the normalized urls.
Even with the current knowledge we have that google will transfer page scores if the redirect is a 301 I wouldn't do it. There will be a switch over time where the site will need to be crawled, scores passed, old urls supplimented and new urls indexed in their place. I'd say you could see a drop in rankings from anywhere from 3-6 months. and Longer if you screw something up like not change the <a href> links on your site :)
I say keep your url's as is just uniform how your <a href> links are for all projects/sites ahead of time. If you are going to absolute link using trailing /'s then do it for all urls. If you are going to have normalized urls in your <a href> tags then do it for all and so on.
Note that Yahoo normalizes all urls their spider brings back no matter how you do your <a href> linking. So if you have a mod rewrite make sure you add a /?$ to the end of all your rewrite rules so that your system picks that up and doesn't spit all yahoo searchers to a 404 page.
example:
RewriteRule ^([^/]+)/?$ /$1 [L]
Make sure your link partners or directory listings are uniform.
Don't allow link partners to link like this
http://yoursite.com/
http://yoursite.com
http://www.yoursite.com/
http://www.yoursite.com
http://www.yoursite.com/index.ext
pick 1 way and stick with it FOREVER. otherwise you aren't doing any thing good for your site as far as link pop goes.
bobmutch
01-14-2005, 12:38 AM
seomike: I agree that it makes no since to 301 URLs with non-trailing /'s into URLs with trailing /'s. Personally I don't see that there is any difference between www*example.com/sub and www*example.com/sub/ when it comes to the PR the page has. Nor do I think that Google sees them as differnet pages.
The examples in this thread that show different PR I am guessing are producted by monkeying with the mod_rewrite and are not the norm.
ThouShaltSeo
01-14-2005, 12:47 AM
I tend to agree. Google "finds" a 301 redirect to /directory/ when it hits /directory at least on my sites
seomike: I agree that it makes no since to 301 URLs with non-trailing /'s into URLs with trailing /'s. Personally I don't see that there is any difference between www*example.com/sub and www*example.com/sub/ when it comes to the PR the page has. Nor do I think that Google sees them as differnet pages.
The examples in this thread that show different PR I am guessing are producted by monkeying with the mod_rewrite and are not the norm.