Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 12-06-2004   #1
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
Is A Trailing / On A Directory Seen As A Differnet File By Google?

I would like to open this subject up for discussion. Is a trailing / on a directory seen as a differnet file by Google? Does Apache or IIS see them as different files? Can you 301 a example.com/dir into a example.com/dir/ ? Please post examples for different files with and with out a trailing / that have differnet PR.

Here is a smaple I got off the ranking folders instead of pages thread.

http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
bobmutch is offline   Reply With Quote
Old 12-06-2004   #2
ThouShaltSeo
Member
 
Join Date: Dec 2004
Posts: 206
ThouShaltSeo is on a distinguished road
I believe so (that they're separate).
Also, if you link to domain-com/file.html?track it will be seen as a different page from domain-com/file.html and who knows, possibly trip a dupe filter since it's the same exact content.

Quote:
Originally Posted by bobmutch
I would like to open this subject up for discussion. Is a trailing / on a directory seen as a differnet file by Google? Does Apache or IIS see them as different files? Can you 301 a example.com/dir into a example.com/dir/ ? Please post examples for different files with and with out a trailing / that have differnet PR.

Here is a smaple I got off the ranking folders instead of pages thread.

http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0
ThouShaltSeo is offline   Reply With Quote
Old 12-06-2004   #3
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
ThouShaltSeo: Yes I understand that the query string ? on the end of a file will product an addtion file as far as Google is concerned but not as far as Apache of IIS are concerned. I just finished fixing a site that had a directory/? in the source on about 6 pages. Each one of the directory/? pages had a PR7. Pretty big bleed.

Same thing on example.com/dir/ and www*example.com/dir/ . They are seen as 2 differnt locations by Google but the nice thing the are seen as two differnet locations by IIS and Apcahe also so you just 301 one into the other.
bobmutch is offline   Reply With Quote
Old 12-06-2004   #4
ThouShaltSeo
Member
 
Join Date: Dec 2004
Posts: 206
ThouShaltSeo is on a distinguished road
can you please tell me how did you fix the /? thing? It can't be 301d
thanks in advance,

Quote:
Originally Posted by bobmutch
ThouShaltSeoI just finished fixing a site that had a directory/? in the source on about 6 pages. Each one of the directory/? pages had a PR7. Pretty big bleed.
ThouShaltSeo is offline   Reply With Quote
Old 12-06-2004   #5
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
ThouShaltSeo: In the site I just fixed I just changed the code in a javascript function that put a /directory/? in the source of each offending page so it no longer added theh ?.

If you are dealing with a dynamic site where you have example.com/inventory.cfm?id=777 then you just do a replace() fuction that changes the ? and = to a / in your source and then add SafeSpiderURL dll to IIS if you hosting is on a win32 box and if your hosting is on Apache use mod_rewrite to convert your changed form in the source what what you want the server to read.
bobmutch is offline   Reply With Quote
Old 12-24-2004   #6
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
Has anyone else come up with some example how a trailing / can be seen as a differnet file by the search engines as in this case.

http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0

I have been looking and havn't seen anything. I am wondering now if this is not just a case of mod rewrite?
bobmutch is offline   Reply With Quote
Old 12-24-2004   #7
JohnW
 
JohnW's Avatar
 
Join Date: Jun 2004
Location: Virginia Beach, VA.
Posts: 976
JohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud of
It's not as completely clear-cut, to me, as some suggest. There are examples both supporting and contradicting this - some where the page (folder) with the trailing / was considered to be a different page with different PR and BLs, and then other examples where Google has figured it out and treated the the pages as if they were in fact the same page regardless of the /. IMO consistancy is the only thing worth worrying about.
JohnW is offline   Reply With Quote
Old 12-24-2004   #8
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
JohnW: "IMO consistancy is the only thing worth worrying about." Right, sounds good to me. But if people are giving you lnks with no / on them and you are doing links with /'s then you could end up with 2 differnet files according to the search engines, IF they are considered two different files. Now I am quite sure this is not the case. Personally I think the examples in this thread are caused by incorrect mod rewrite code.

I have noticed all the Directories, or most of them, use / on ALL there entries and I have been doing that myself for some time.
bobmutch is offline   Reply With Quote
Old 12-24-2004   #9
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Unique page ids-urls

I hope this help.

URLs need to be normalized and then hashed to conform a unique page identifier. Hashing is done to avoid collisions when documents are mapped to unique page ids.

The procedure is pretty much straightforward. It is also described in WebBase : A repository of web pages

Hector Garcia-Molina’s group write

“2.2 Page identifier"

"Since a web page is the fundamental logical unit being managed by the repository, it is important to have a
well-defined mechanism that all modules can use to uniquely refer to a specific page. In the WebBase system,
a page identifier is constructed by computing a signature (e.g., checksum or cyclic redundancy check) of the
URL associated with that page. However, a given URL can have multiple text string representations. For
example, http://www.stanford.edu:80/ and http://www.stanford.edu both represent the same web page but
would give rise to different signatures. To avoid this problem, we first normalize the URL string and derive a
canonical representation. We then compute the page identifier as a signature of this canonical representation.

The details are as follows:

Normalization: A URL string is normalized by executing the following steps:
Removal of the protocol prefix (http://) if present
Removal of a :80 port number specification if present (However, non-standard port number
specifications are retained)
Conversion of the server name to lower case
Removal of all trailing slashes ("/")

The resulting text string is hashed using a signature computation to yield a 64-bit page identifier.

The use of a hashing function implies that there is a non-zero collision probability. Nevertheless, a good hash
function along with a large space of hashes makes this a very unlikely occurrence. For example, with 64 bit
identifiers and 100 million pages in the repository, the probability of collision is 0.0003. That is, 3 out of
10,000 repositories would have a collision. With 128 bit identifiers and a 10 billion page collection, the
probability of collision is 10-18. See [CGM98] for more discussion and a derivation of a general formula for
estimating collisions."

In a general sense, the idea of this procedure is to recognize some URL text string cases as having the same signature. An example is given below

http://www.doc1.com:80
http://www.doc1.com:80/
http://www.doc1.com
http://www.doc1.com/

After the treatment they all should have the same page id.

Since text affects the checksums, different directory paths leading to the same document content should produce different page ids.

http://www.doc1.com/aaa/docA.hml
http://www.doc1.com/bbb/docA.hml

Whether this is detected or not depends on the "spam patrol" (human or algorithmic) of the target system.

Orion

Last edited by orion : 12-24-2004 at 05:25 PM. Reason: Adding Some New Lines
orion is offline   Reply With Quote
Old 12-24-2004   #10
JohnW
 
JohnW's Avatar
 
Join Date: Jun 2004
Location: Virginia Beach, VA.
Posts: 976
JohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud ofJohnW has much to be proud of
Bob, it's hopeful that you have some control over how pages link to you ;-) so when it is possible, you do what you can. I like using the / because it saves the web server processing time, and, if volunteer linkers copy/paste from the URL line it will usually have the /.
JohnW is offline   Reply With Quote
Old 12-25-2004   #11
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
orion: Very interesting info. Now how would you apply this to the subject at hand?
bobmutch is offline   Reply With Quote
Old 12-25-2004   #12
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Url Normalization

My pleasure, bob.

The originator of this thread asked several different questions. The first one was

Quote:
Is a trailing / on a directory seen as a differnet file by Google?
File recognition with large-scale search architectures is a non trivial task, as the system needs to put in place url-to-document maps via page identifiers.

With WebBase and similar systems, these ids are constructed from the text strings in the directory associated to each file. Whether the slash is in a URL does not matter for file identification purposes since during normalization these are removed anyway. The same would happen with other special characters. They do not contribute to the checksum.

That’s why the researchers assert

“For example, http://www.stanford.edu:80/ and http://www.stanford.edu both represent the same web page but would give rise to different signatures. To avoid this problem, we first normalize the URL string and derive a canonical representation. We then compute the page identifier as a signature of this canonical representation.”

Removal of all trailing slashes ("/") insures http://www.stanford.edu:80/ and http://www.stanford.edu representing the same file would be mapped to the same page id signature. During scoring documents, there is no reason for these two to have different scores.

In my view, I consider a huge architectural flaw a large-scale search engine that uses URL normalization to assign a unique page id but then still score differently the files.

URL normalization is a non-trivial task within a search architecture. Indeed, poor URL normalization or the lack of it is one of several things that can be used to

a. spam a system
b. distinguish between true search engines and oversized site search tools (e.g., javascript "search engines") and search-in-a-CD tools marketed as “search engines”.


I hope this help.

For additional details on WebBase, see this thread http://forums.searchenginewatch.com/...8630#post28630

Orion

Last edited by orion : 12-25-2004 at 11:39 AM.
orion is offline   Reply With Quote
Old 12-25-2004   #13
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
orion: Ok I got all that on the first round. I realize that apache and iis see example.com/? example.com example.com and even example.com:80 and example.com:80/ as all the same thing. The post was good information as I didn't know the process for stripping.

When I asked you to apply that to the subject at hand I wasn't looking for a further explaining, I was wondering how does you post relate to the question at hand.

The subject at hand is does Google is the trailing / as differnet. We know it sees the query string as a driffernt file example.com/ example.com/?, we know it sees www*.example.com and example.com as differnet files. Personally I don't think Google sees example.com and example.com/ as differnet files but I was wanting some one in the know to clearly prove that, as we have had some here on SEW say differnet.

Hence the examples at the beginning of the post:
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0

If you have comments on that I would please post them too. And don't get me wrong your other info was very good.

Also while we are at it what about www*example.com and example.com , what is the process for dealing with the host name in this process.

Quote:
Normalization: A URL string is normalized by executing the following steps:
Removal of the protocol prefix (http://) if present
Removal of a :80 port number specification if present (However, non-standard port number
specifications are retained)
Conversion of the server name to lower case
Removal of all trailing slashes ("/")
I don't see anything in the above that deals with it.
bobmutch is offline   Reply With Quote
Old 12-25-2004   #14
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, bob.

I hope this help.

Quote:
We know it sees the query string as a different file example.com/ example.com/?, we know it sees www*.example.com and example.com as different files.
Text strings for constructing checksums and queries not necessarily go hand-to-hand. Indeed, text string searches and checksum searches involve different resources and are of different nature.



Quote:
Personally I don't think Google sees example.com and example.com/ as different files...
I don’t think so, either. Do they score both cases differently? They should not. Are they actually scoring the two case scenarios differently? Now that’s a different question. To answer this I would need to conduct controlled experiments or to examine any valid evidence and try to replicate the cases.


Quote:
Hence the examples at the beginning of the post:
http*://www.avismauritius.com/en/locations/ PR=3
http*://www.avismauritius.com/en/locations PR=0

If you have comments on that I would please post them too.
These cases need to be examined a bit closer. This should include time-based parameters..

I still view as an architectural failure (others could view this as a gaming opportunity) a search engine that scores a document differently because it belongs to two different urls that were already normalized to conform to the same page id checksum.

I don’t see any valid reason from the relevancy standpoint, to score differently the two cases, above.

I would accept few instances due to collisions, but it will be hard to invoke a “collision probability” defense if there is a pattern.


Quote:
Also while we are at it what about www*example.com and example.com , what is the process for dealing with the host name in this process.
URL normalization convey the idea of conforming them to a common format. Each system has its recipe. In WebBase, all trailing slashes are removed. I remember that the old AltaVista implementation consisted of adding trailing slashes at the end of urls, so as to conform to the generic format http://www.domain.com/

Normalization to a common format is pretty much straightforward regexp find-replace task. If you need a particular regexp for doing this I can PM you examples.


Orion

Last edited by orion : 12-27-2004 at 10:00 PM. Reason: typos
orion is offline   Reply With Quote
Old 12-28-2004   #15
Chris_D
 
Chris_D's Avatar
 
Join Date: Jun 2004
Location: Sydney Australia
Posts: 1,099
Chris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud of
Bobmutch,

Here are Googleguys comments on the trailing slash and the 301 redirect you avoid:

http://www.webmasterworld.com/forum3/15894.htm

Hope that helps

Chris_D
Chris_D is offline   Reply With Quote
Old 12-28-2004   #16
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
Chris_D: Thanks:

GG quote from WMW listed above:
"I've got a few minutes free, so let's go into detective mode for a bit. Most webservers are configured to append the "/" automatically via a 301 redirect. For example, if you try to fetch www*google.com/webmasters , our web server will do a permanent 301 redirect to the canonical page, which is www*google.com/webmasters/ (note the trailing slash). "

That thread is pretty old so I am not going to revive it at WMW. Here is GG saying out of the box most webservers are configured to append a / via a 301 redirect or people are configuring their server that way?

If it is not a configuration out of the box is there some code to stick in your .htaccess or httpd.conf to make those changes?
bobmutch is offline   Reply With Quote
Old 12-28-2004   #17
pageoneresults
Member
 
pageoneresults's Avatar
 
Join Date: Jun 2004
Location: California
Posts: 102
pageoneresults will become famous soon enoughpageoneresults will become famous soon enough
Great topic and some very interesting discussion.

Yes, with and without a trailing forward slash are treated as two different files unless otherwise specified at the server level. Now, bear with me as I can really only explain this in laymen's terms...

Content Negotiation

The W3C and other large website structures are now utilizing content negotiation. That means that this...

www.example.com/sub

...could be different than this...

www.example.com/sub/

With the use of content negotiation, there are no file extensions. Basically you are cleaning the URI of all underlying identifying technologies.

example.com is the root domain. www.example.com is a sub-domain of the root and both should be treated as different locations. Unless of course you've implemented a 301 to redirect the root to the sub-domain of www.

When doing rewrites on IIS, we always force the trailing forward slash as the server seems to interpret things differently when using an ini file on IIS, it doesn't automatically apend the trailing forward slash. We don't mind as it forces us to produce a rewrite that is flawless in its function. We'll typically back our way through a URI and make sure that the proper server headers and content are being presented depending on where you are in the string. That part gets deep for me!

So yes, these are all different locations based on the specifications...

example.com
www.example.com
www.example.com/name
www.example.com/name/
__________________
pageoneresults.com

Last edited by pageoneresults : 12-28-2004 at 03:12 PM.
pageoneresults is offline   Reply With Quote
Old 12-28-2004   #18
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

ISAPI and Apache url rewrite works fine for making clean urls, but is not the same as url normalization (from the search engine side).

After normalizing the urls the system should assign a unique checksum to each of these in order to map urls to documents. Thus, from the search engine side the common urls (with/without slashes) should map to the same page identifier. So far that's how the IR implementations I'm familiar with work.

Orion

Last edited by orion : 12-29-2004 at 12:10 AM. Reason: typos
orion is offline   Reply With Quote
Old 12-29-2004   #19
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Thought this might be of interest to this thread. I was doing some searching on client sites at Yahoo and I noticed a problem with the way Yahoo was handling the normalization of the URLs. I posted a real life example and put the thread in the Yahoo Web Search forum with the title Trailing Slash Issues - Normalizing URLs the Wrong Way.
rustybrick is offline   Reply With Quote
Old 12-30-2004   #20
ISO9000 Guy
Member
 
Join Date: Jun 2004
Location: Southern Ohio
Posts: 10
ISO9000 Guy is on a distinguished road
There is also the aspect of:

www.example.com
vs.
example.com

These also will give different results.

EDIT: Oops - I see it was noted about 3 posts up.

Last edited by ISO9000 Guy : 12-30-2004 at 12:15 PM.
ISO9000 Guy is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off