Special thanks to:
|
#1
|
|||
|
|||
|
Is A Trailing / On A Directory Seen As A Differnet File By Google?
I would like to open this subject up for discussion. Is a trailing / on a directory seen as a differnet file by Google? Does Apache or IIS see them as different files? Can you 301 a example.com/dir into a example.com/dir/ ? Please post examples for different files with and with out a trailing / that have differnet PR.
Here is a smaple I got off the ranking folders instead of pages thread. http*://www.avismauritius.com/en/locations/ PR=3 http*://www.avismauritius.com/en/locations PR=0 |
|
#2
|
|||
|
|||
|
I believe so (that they're separate).
Also, if you link to domain-com/file.html?track it will be seen as a different page from domain-com/file.html and who knows, possibly trip a dupe filter since it's the same exact content. Quote:
|
|
#3
|
|||
|
|||
|
ThouShaltSeo: Yes I understand that the query string ? on the end of a file will product an addtion file as far as Google is concerned but not as far as Apache of IIS are concerned. I just finished fixing a site that had a directory/? in the source on about 6 pages. Each one of the directory/? pages had a PR7. Pretty big bleed.
Same thing on example.com/dir/ and www*example.com/dir/ . They are seen as 2 differnt locations by Google but the nice thing the are seen as two differnet locations by IIS and Apcahe also so you just 301 one into the other. |
|
#4
|
|||
|
|||
|
can you please tell me how did you fix the /? thing? It can't be 301d
thanks in advance, Quote:
|
|
#5
|
|||
|
|||
|
ThouShaltSeo: In the site I just fixed I just changed the code in a javascript function that put a /directory/? in the source of each offending page so it no longer added theh ?.
If you are dealing with a dynamic site where you have example.com/inventory.cfm?id=777 then you just do a replace() fuction that changes the ? and = to a / in your source and then add SafeSpiderURL dll to IIS if you hosting is on a win32 box and if your hosting is on Apache use mod_rewrite to convert your changed form in the source what what you want the server to read. |
|
#6
|
|||
|
|||
|
Has anyone else come up with some example how a trailing / can be seen as a differnet file by the search engines as in this case.
http*://www.avismauritius.com/en/locations/ PR=3 http*://www.avismauritius.com/en/locations PR=0 I have been looking and havn't seen anything. I am wondering now if this is not just a case of mod rewrite? |
|
#7
|
||||
|
||||
|
It's not as completely clear-cut, to me, as some suggest. There are examples both supporting and contradicting this - some where the page (folder) with the trailing / was considered to be a different page with different PR and BLs, and then other examples where Google has figured it out and treated the the pages as if they were in fact the same page regardless of the /. IMO consistancy is the only thing worth worrying about.
|
|
#8
|
|||
|
|||
|
JohnW: "IMO consistancy is the only thing worth worrying about." Right, sounds good to me. But if people are giving you lnks with no / on them and you are doing links with /'s then you could end up with 2 differnet files according to the search engines, IF they are considered two different files. Now I am quite sure this is not the case. Personally I think the examples in this thread are caused by incorrect mod rewrite code.
I have noticed all the Directories, or most of them, use / on ALL there entries and I have been doing that myself for some time. |
|
#9
|
||||
|
||||
|
I hope this help.
URLs need to be normalized and then hashed to conform a unique page identifier. Hashing is done to avoid collisions when documents are mapped to unique page ids. The procedure is pretty much straightforward. It is also described in WebBase : A repository of web pages Hector Garcia-Molina’s group write “2.2 Page identifier" "Since a web page is the fundamental logical unit being managed by the repository, it is important to have a well-defined mechanism that all modules can use to uniquely refer to a specific page. In the WebBase system, a page identifier is constructed by computing a signature (e.g., checksum or cyclic redundancy check) of the URL associated with that page. However, a given URL can have multiple text string representations. For example, http://www.stanford.edu:80/ and http://www.stanford.edu both represent the same web page but would give rise to different signatures. To avoid this problem, we first normalize the URL string and derive a canonical representation. We then compute the page identifier as a signature of this canonical representation. The details are as follows: Normalization: A URL string is normalized by executing the following steps: Removal of the protocol prefix (http://) if present Removal of a :80 port number specification if present (However, non-standard port number specifications are retained) Conversion of the server name to lower case Removal of all trailing slashes ("/") The resulting text string is hashed using a signature computation to yield a 64-bit page identifier. The use of a hashing function implies that there is a non-zero collision probability. Nevertheless, a good hash function along with a large space of hashes makes this a very unlikely occurrence. For example, with 64 bit identifiers and 100 million pages in the repository, the probability of collision is 0.0003. That is, 3 out of 10,000 repositories would have a collision. With 128 bit identifiers and a 10 billion page collection, the probability of collision is 10-18. See [CGM98] for more discussion and a derivation of a general formula for estimating collisions." In a general sense, the idea of this procedure is to recognize some URL text string cases as having the same signature. An example is given below http://www.doc1.com:80 http://www.doc1.com:80/ http://www.doc1.com http://www.doc1.com/ After the treatment they all should have the same page id. Since text affects the checksums, different directory paths leading to the same document content should produce different page ids. http://www.doc1.com/aaa/docA.hml http://www.doc1.com/bbb/docA.hml Whether this is detected or not depends on the "spam patrol" (human or algorithmic) of the target system. Orion Last edited by orion : 12-24-2004 at 05:25 PM. Reason: Adding Some New Lines |
|
#10
|
||||
|
||||
|
Bob, it's hopeful that you have some control over how pages link to you ;-) so when it is possible, you do what you can. I like using the / because it saves the web server processing time, and, if volunteer linkers copy/paste from the URL line it will usually have the /.
|
|
#11
|
|||
|
|||
|
orion: Very interesting info. Now how would you apply this to the subject at hand?
|
|
#12
|
||||
|
||||
|
My pleasure, bob.
The originator of this thread asked several different questions. The first one was Quote:
With WebBase and similar systems, these ids are constructed from the text strings in the directory associated to each file. Whether the slash is in a URL does not matter for file identification purposes since during normalization these are removed anyway. The same would happen with other special characters. They do not contribute to the checksum. That’s why the researchers assert “For example, http://www.stanford.edu:80/ and http://www.stanford.edu both represent the same web page but would give rise to different signatures. To avoid this problem, we first normalize the URL string and derive a canonical representation. We then compute the page identifier as a signature of this canonical representation.” Removal of all trailing slashes ("/") insures http://www.stanford.edu:80/ and http://www.stanford.edu representing the same file would be mapped to the same page id signature. During scoring documents, there is no reason for these two to have different scores. In my view, I consider a huge architectural flaw a large-scale search engine that uses URL normalization to assign a unique page id but then still score differently the files. URL normalization is a non-trivial task within a search architecture. Indeed, poor URL normalization or the lack of it is one of several things that can be used to a. spam a system b. distinguish between true search engines and oversized site search tools (e.g., javascript "search engines") and search-in-a-CD tools marketed as “search engines”. I hope this help. For additional details on WebBase, see this thread http://forums.searchenginewatch.com/...8630#post28630 Orion Last edited by orion : 12-25-2004 at 11:39 AM. |
|
#13
|
|||
|
|||
|
orion: Ok I got all that on the first round. I realize that apache and iis see example.com/? example.com example.com and even example.com:80 and example.com:80/ as all the same thing. The post was good information as I didn't know the process for stripping.
When I asked you to apply that to the subject at hand I wasn't looking for a further explaining, I was wondering how does you post relate to the question at hand. The subject at hand is does Google is the trailing / as differnet. We know it sees the query string as a driffernt file example.com/ example.com/?, we know it sees www*.example.com and example.com as differnet files. Personally I don't think Google sees example.com and example.com/ as differnet files but I was wanting some one in the know to clearly prove that, as we have had some here on SEW say differnet. Hence the examples at the beginning of the post: http*://www.avismauritius.com/en/locations/ PR=3 http*://www.avismauritius.com/en/locations PR=0 If you have comments on that I would please post them too. And don't get me wrong your other info was very good. Also while we are at it what about www*example.com and example.com , what is the process for dealing with the host name in this process. Quote:
|
|
#14
|
||||
|
||||
|
Hi, bob.
I hope this help. Quote:
Quote:
Quote:
I still view as an architectural failure (others could view this as a gaming opportunity) a search engine that scores a document differently because it belongs to two different urls that were already normalized to conform to the same page id checksum. I don’t see any valid reason from the relevancy standpoint, to score differently the two cases, above. I would accept few instances due to collisions, but it will be hard to invoke a “collision probability” defense if there is a pattern. Quote:
Normalization to a common format is pretty much straightforward regexp find-replace task. If you need a particular regexp for doing this I can PM you examples. Orion Last edited by orion : 12-27-2004 at 10:00 PM. Reason: typos |
|
#15
|
||||
|
||||
|
Bobmutch,
Here are Googleguys comments on the trailing slash and the 301 redirect you avoid: http://www.webmasterworld.com/forum3/15894.htm Hope that helps Chris_D |
|
#16
|
|||
|
|||
|
Chris_D: Thanks:
GG quote from WMW listed above: "I've got a few minutes free, so let's go into detective mode for a bit. Most webservers are configured to append the "/" automatically via a 301 redirect. For example, if you try to fetch www*google.com/webmasters , our web server will do a permanent 301 redirect to the canonical page, which is www*google.com/webmasters/ (note the trailing slash). " That thread is pretty old so I am not going to revive it at WMW. Here is GG saying out of the box most webservers are configured to append a / via a 301 redirect or people are configuring their server that way? If it is not a configuration out of the box is there some code to stick in your .htaccess or httpd.conf to make those changes? |
|
#17
|
|||
|
|||
|
Great topic and some very interesting discussion.
Yes, with and without a trailing forward slash are treated as two different files unless otherwise specified at the server level. Now, bear with me as I can really only explain this in laymen's terms... Content Negotiation The W3C and other large website structures are now utilizing content negotiation. That means that this... www.example.com/sub ...could be different than this... www.example.com/sub/ With the use of content negotiation, there are no file extensions. Basically you are cleaning the URI of all underlying identifying technologies. example.com is the root domain. www.example.com is a sub-domain of the root and both should be treated as different locations. Unless of course you've implemented a 301 to redirect the root to the sub-domain of www. When doing rewrites on IIS, we always force the trailing forward slash as the server seems to interpret things differently when using an ini file on IIS, it doesn't automatically apend the trailing forward slash. We don't mind as it forces us to produce a rewrite that is flawless in its function. We'll typically back our way through a URI and make sure that the proper server headers and content are being presented depending on where you are in the string. That part gets deep for me! ![]() So yes, these are all different locations based on the specifications... example.com www.example.com www.example.com/name www.example.com/name/ Last edited by pageoneresults : 12-28-2004 at 03:12 PM. |
|
#18
|
||||
|
||||
|
ISAPI and Apache url rewrite works fine for making clean urls, but is not the same as url normalization (from the search engine side).
After normalizing the urls the system should assign a unique checksum to each of these in order to map urls to documents. Thus, from the search engine side the common urls (with/without slashes) should map to the same page identifier. So far that's how the IR implementations I'm familiar with work. Orion Last edited by orion : 12-29-2004 at 12:10 AM. Reason: typos |
|
#19
|
||||
|
||||
|
Thought this might be of interest to this thread. I was doing some searching on client sites at Yahoo and I noticed a problem with the way Yahoo was handling the normalization of the URLs. I posted a real life example and put the thread in the Yahoo Web Search forum with the title Trailing Slash Issues - Normalizing URLs the Wrong Way.
|
|
#20
|
|||
|
|||
|
There is also the aspect of:
www.example.com vs. example.com These also will give different results. EDIT: Oops - I see it was noted about 3 posts up. Last edited by ISO9000 Guy : 12-30-2004 at 12:15 PM. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|