PDA

View Full Version : Google indexing session-IDs?!


softplus
09-11-2005, 03:09 PM
A site I check every now and then suddenly jumped from the realistic 1000 or so to over 8000 indexed URLs in Google's site:-query (same for inurl). Looking at the URLs indexed, I noticed that it's grabbing session-ids and making seperate URLs out of them. The pages with the session-ids are in the meantime no longer valid :-(, and I'm sure in the next round Google will kill the site for publishing tons of "duplicate content" :-(((

Has anyone else noticed or found an explaination? If you search for "inurl:sid" or "inurl:sessionid" you get over 37mil / 11mil URLs! Part (very few) are normal pages with "sid" or "sessionid" in the URL, but most are URLs with session-ids attached.... I thought Google had learned to remove those?

To make matters more mysterious, the site I found it in is using Google Sitemaps, and their sitemap-file doesn't use session-IDs (which is correct, of course) - so Google must have found the session-IDs in normal crawls. If this is generally the case, no wonder Google's number of indexed URLs has risen so much :-))

Anyone with similar/related observations?

L
09-12-2005, 02:16 PM
We've been using sitemaps for about a month now for two sites, and I've been tracking the # of indexed pages for these sites just about daily.

Around 9/7 - 9/8 our indexed pages jumped drastically from around 200,000-300,00 to 1.2 million for one site, and from about 1.5 million to 5.2 million for another site.

In the meantime, Googlebot seems to be making requests to some of our pages at the rate of 1300+ in 5 minutes recently.

I dont see many URLs with anything after the .html, in the regular search results set for site:www.site.com inurl:www.site.com, and even with the omitted results shown, our URLs look mostly pretty clean, although I can only look at 10 pages of results.

L
09-12-2005, 02:30 PM
I do beleive that the aggressive spidering of one page type is due to Googlebot getting caght in a spider trap - hence possibly following session ids from somewhere - looking into it more....

L
09-13-2005, 09:15 PM
It is most likely due to G & Y's "size matters" battle. Seems Google has a new way of determining the size of their index, which would make sense to see that we are seeing what we're seeing. Now we have more pages indexed in Google than in Yahoo and it always used to be the other way around.

BTW, more pages indexed has had no noticable effect on our SE traffic.

softplus
09-14-2005, 04:37 AM
>BTW, more pages indexed has had no noticable effect on our SE traffic.

but wouldn't it have an effect on the quality of your traffic? In other words, if people see you in the serps with an old session-id, they could easily assume your site is not "professional" since it lists "private" URLs.

Also, depending on your site, it might even "bomb" with a generic "this session has expired - I'm returning you to the home page" error (ouch!).

Perhaps similar to this -- Google is listing URLs with case-differences seperately now, and flagging them as duplicate content - pushing some sites into the sandbox. I bet this is just a symptom of the same "issue".

It seems strange that Google should suddenly jump over to "mine is bigger" from their old "quality over quantity" philosophie. Oh well, there goes the neighborhood.

L
09-14-2005, 01:47 PM
We're having some issues with our URLs not all having their parameters stripped anyway - so its hard for me to say whether or not the parameters I see on our URLs showing up in the Google index is their fault or ours just yet, but none of them as far as I have seen (in search results) are session ids...

Having millions of pages in multiple sites, its also hard for me to tell what new pages are being indexed if any, or if its just a matter of Google counting their index in a different way like this guy says: http://addict3d.org/index.php?page=viewarticle&type=news&ID=10263

Here's a good thread here on SEW too about the size deal:
http://forums.searchenginewatch.com/showthread.php?t=7685

It does make me wonder though - it seems there might be some correlation with Googlebot freaking out one of our "add to list" pages on a shopping search engine. We just blocked Googlebot from those pages.