Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 09-22-2005   #1
amye247
 
Posts: n/a
Exclamation Help Google Saturation/No of pages indexed is too high!!

I have a client who has always had good coverage of their site in Google. Since Aug 28, 2005 I have been watching that number increase to levels I would never have imagined. He is now sweating bullets and worried that something bad (penalties/bans) will happen.

The index was sitting at about 89K but has exploded to 1.28Million. This is not possible as the site does not contain that many pages.

I explained that something is probably wrong with the way Google has been counting this but obviously I need something more definitive as an answer.

Has anyone else experienced this with any type of understanding or knows what is going on?

Thanks
Amye
  Reply With Quote
Old 09-22-2005   #2
softplus
old-n-grey
 
Join Date: May 2005
Location: CH(eeseland)
Posts: 48
softplus is on a distinguished road
You're probably indexing URLs with Session-IDs... not good ... Make sure that the site doesn't give session-ids to bots :-). I thought Google was good at making sure that session-ids don't get indexed, but in the last month or so I've seen a lot of the same game...
softplus is offline   Reply With Quote
Old 09-22-2005   #3
Alan Perkins
Member
 
Join Date: Jun 2004
Location: UK
Posts: 155
Alan Perkins will become famous soon enough
You could be suffering any one of a number of problems.

For a start, try doing a phrase search on a phrase that should appear only one one page on your site. The syntax is:

site:www.yoursite.com "insert your unique phrase here"

If multiple pages appear in the search results, when you would only expect one page, look at the URLs of those multiple pages. Which URL did you expect to be there? Is it there? How did the others get there (i.e what path could the robots have followed to see those other URLs)?

Once you know why the extra URLs are appearing, you can fix it. You may be able to use robots.txt or the robots meta tag, or you may have to adapt your server software.
Alan Perkins is offline   Reply With Quote
Old 09-22-2005   #4
amye247
 
Posts: n/a
So far I have found....

I ran a few test like you guys suggested.

First result
I have 1 page of content and found 3 pages indexed pointing to same page with three different session IDs:

1. Set-Cookie: ASPSESSIONIDQCDRQRRC=CDCMEGIAMDKILAGFBLECHPGB; path=/
2. Set-Cookie: ASPSESSIONIDQCDRQRRC=BJBMEGIAIIIKMBDLGFBINCOB; path=/
3. Set-Cookie: ASPSESSIONIDQCDRQRRC=PFBMEGIAANGPCLIPKBCIGNPG; path=/

So I note 2 things. Looks like the engine is getting the seesion IDs and I have a duplicate content issue.

They have also done this in this manner all along (using a rewite on the URLs) but I have never seen Google index in this way on their site.

I'll note that I also found two URL's pointing to the same page but also have two different seesion IDs as you would expect. However, they have done this multiple times if a particular product falls into two categories. They list it as two different products, with two different URLs but are pointing to the same content.

In addition they have also implemented a third problem - providing "printer friendly" versions of these pages (fired by javascript). None of which has a robots exclusion.

Are we on the right track here? Possibly?

Amye
  Reply With Quote
Old 09-22-2005   #5
softplus
old-n-grey
 
Join Date: May 2005
Location: CH(eeseland)
Posts: 48
softplus is on a distinguished road
Bingo!

I thought Google ignored / removed session-ids earlier, but according to some people I've asked, it's been indexing them since the beginning... even though they mention that they sometimes remove parameters ("don't use anything with 'id' in your parameters..."). My guess is that this has to do with the sitemaps-beta, get as many URLs listed as possible... but personally I don't think it's raising the QUALITY of the index by linking to expired sessions :-)

The print-preview is also a problem, however it shouldn't be the main issue (+ it's easy to add meta-tags to the header); another thing with forums is that you can usually access whole threads or the single posts seperately; if this is the case for your client's site as well, you should consider "banning" robots on one of the two types (meta-tags, etc.).

You can even go so far as to actively force the engines to index the way you want:
when you notice a bot (through useragent, ip):
- if a session id is accessed, 301-redirect to the same page without the session-id
- if a "non-indexable" page is accessed, 301-redirect to the indexable copy
softplus is offline   Reply With Quote
Old 09-22-2005   #6
amye247
 
Posts: n/a
And the plot thickens!

Yes, there is a session id problem. Unfortunately I can't convince the powers that be because the session id is in the http header and not the URL.

But, this gets better - Google has indexed 160,000 iframes that have been specifically excluded in the robots .txt file. What the? There is no exclusion in the page itself and Google obeyed the robots file until now.

While this doesn't account for the 1.2mil figure I believe it is only one of a multitude of problems.

So there are the session IDs, and previously excluded pages.

Then something weird - when doing a search for URLs on Google.com (one data centre) the URL's returned are somthing like:

http://www.site.com/dir/dir/ dir/dir (note the space)

When doing the same search on Google.com.au (another index) the URL is without the space. The one with the space returns a 400error and the other is a 302.

My head is starting to hurt
  Reply With Quote
Old 09-22-2005   #7
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
It's not uncommon for Google to report a lot more of a site's pages as being in the index than are possible. In one of my sites, more than double the maximum was reported for a long time. But jumping from 89k to 1.28M is a bit radical.

It's session IDs in the URLs that matter, and not session IDs in the response headers.

Google often, maybe usually, adds a space in the printed URLs in the listings. It's normal, and nothing to be concerned about.

I'm surprised that they are crawling iframe sources because they didn't do not long ago.

Sorry - no solutions - just comments.
PhilC is offline   Reply With Quote
Old 09-22-2005   #8
amye247
 
Posts: n/a
Ruling out session ID

Thanks Phil - so are you saying that the fact that the server header has a Set Cookie variable, that is different everytime I check the header, it will not be a problem because it is not hardcoded in the URL?

OK - I can live with that. I ran another quick test and found that Google indexed everything that the robots.txt file said not to. This equals a total of 555,195 - nearly half of my 1.2mil saturation level.

The exclusions are for particular pages - page.asp. This is where the session ID is problematic bacause there is no URL rewrite on these. Presumably because they were excluded in the robots file. So now the index has loads of:
page.asp/map=1&fid=537502 (where the id is different)

So I understand that there are multiple listings of the page because of the session id. But why would Google be ignoring the robot file. It validates and has been in place for a long while but suddenly all previously excluded pages are now included.

How do I fix this?

Amye
  Reply With Quote
Old 09-23-2005   #9
softplus
old-n-grey
 
Join Date: May 2005
Location: CH(eeseland)
Posts: 48
softplus is on a distinguished road
Quote:
Originally Posted by amye247
How do I fix this?
Fdisk + Format (on Googles servers)

There is a small site I'm watching, at the moment it has 12 pages physically online, Google is listing over 900 URLs... I can't really believe that they did this kind of thing earlier, but I'm pretty sure the people at Google either aren't taking it seriously or are working their b*** off to make sure it gets cleaned up soon. (I'd prefer the second)

Besides "cloaking" your site to Google, there isn't much you can do but wait. If you want to try cloaking, you'll need Googles IPs and useragents; whenever a bot accesses one of the "blocked" pages, return 404 and let him be off. However, it will not clean up your listings very quickly, I really doubt it will happen automatically before 6 months (I have pages that have 404ed for over 2 years that were still online until I used the manual remove url request, and even that is limited to 180 days removal, even though the URL has been gone for years).

One thing you can try, though, that might have an impact: Make sure the bad urls return 404 to bots and then create a Google Sitemap file for your complete site. Sometimes, possibly when the number of URLs in your sitemap represents a significant portion of the URLs indexed, Google will flush the site from it's index before reindexing based on your sitemap file. There doesn't seem to be a pattern to that and I can't garantee that it will always work (or even keep on working). People have usually complained that it does that, but perhaps in your case it would be a good idea :-). "Value", (PR, etc) will of course be kept, but it might take a few days for Google to get your site back into the serps.

How I wish Google would allow some sort of simple "I'm the webmaster, remove this and that" functionality .... Perhaps it will be the next step from Sitemaps, now that we can verify that we are the webmasters, it would be a logical step. It's a pain to have to wait and watch like that....
softplus is offline   Reply With Quote
Old 09-23-2005   #10
amye247
 
Posts: n/a
How about this though?

If the client has this following disallow in the robots.txt:

/link.asp

I would imagine that the page itself would not be indexed. However Google has this page indexed 27,000 times because while the page has been disallowed the actual listings in Google are:

link.asp?id=1245
link.asp?id=4756

So Google obeyed the robots.txt - there isn't a specific exclusion for all the session ID's. I would surmise that it would be better to exclude the directory so that any page in that directory - which would include all instances - would not be indexed?

Amye
  Reply With Quote
Old 09-23-2005   #11
Alan Perkins
Member
 
Join Date: Jun 2004
Location: UK
Posts: 155
Alan Perkins will become famous soon enough
If you had:

Code:
User-agent: *
Disallow: /link.asp
in your robots.txt file, then "/link.asp?id=1245", "link.asp?id=4756", etc. would all be disallowed too. Robots.txt treats disallow parameters as a partial match of the beginning of the URL - not the whole URL. That's why
Code:
User-agent: *
Disallow: /
disallows the entire site - because every URL on the site starts with "/".

I'm guessing that Google has not fully indexed the URLs that are disallowed, but has merely seen links to them and indexed those links. Such URLs show up in the SERPs with the URL in place of the title, no snippet, no cache, etc. If that's the case, don't worry about it. Google won't actually crawl and index the content, because when it tries to do so, it will see it's disallowed by your robots.txt file.
Alan Perkins is offline   Reply With Quote
Old 09-23-2005   #12
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
Quote:
Originally Posted by amye247
Thanks Phil - so are you saying that the fact that the server header has a Set Cookie variable, that is different everytime I check the header, it will not be a problem because it is not hardcoded in the URL?
That's right. The header doesn't matter in this respect. It's only session IDs in the URLs that matter.
PhilC is offline   Reply With Quote
Old 09-24-2005   #13
amye247
 
Posts: n/a
Darn! I thought I had this nailed!

OK - The session id in the header does not matter. The disallowed pages are being indexed by session id's even though the robots.txt has an exclusion.

Alan, interesting statement - you are right. The listings are just snippets showing the page neame. No Title, no description, no cache. This is because the pages have none (metadata). In most cases they are iFrames with just a phone number or something of the like. So they would obviously have links pointing to them from some page in which they appear.

You also said Google won't crawl and index the content. But, haven't they? I'd imagine I'd have trouble finding the pages in a natural search but they are there if I do a specific search for the page and I can see the content.

If this ends up why Google's saturation of the site seems so large I can understand. However, I may not have discovered these extra pages had it not been for the number being so high and I researched. So had Google been counting them previously? Who knows. Is this why the index got so high - that suddenly these extra pages are being counted in the total? If so, why?

I know that there is no magic answer but all of your opinions are valued.

Thanks - Amye
  Reply With Quote
Old 09-24-2005   #14
Alan Perkins
Member
 
Join Date: Jun 2004
Location: UK
Posts: 155
Alan Perkins will become famous soon enough
Quote:
Originally Posted by amye247
Alan, interesting statement - you are right. The listings are just snippets showing the page neame. No Title, no description, no cache. This is because the pages have none (metadata). In most cases they are iFrames with just a phone number or something of the like. So they would obviously have links pointing to them from some page in which they appear.
The page content has not been indexed. If the page content had been indexed, you would see a cache link (assuming you had not specifically prevented this using a NOARCHIVE attribute in the robots meta tag). You would still see a title (assuming your iframe document had a title) and, in most circumstances, a snipppet.

Quote:
You also said Google won't crawl and index the content. But, haven't they? I'd imagine I'd have trouble finding the pages in a natural search but they are there if I do a specific search for the page and I can see the content.
No, Google has not indexed the content. Only the URLs. Check your log files and see if you can find any example of Googlebot actually accessing those pages. You won't.
Alan Perkins is offline   Reply With Quote
Old 09-24-2005   #15
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
There are a number of reasons whay Google shows a URL instead of a Title and snippet/description. One of them is that it is a page that Google knows about because of a link, or links, pointing to it, but they haven't indexed it. That's the reason they give us for the URL only listings. They do sometimes show up in the the serps because of the link text that points to them.

It sounds likely that the pages haven't been crawled, but are showing up on a site: search because they are known about due to the links.
PhilC is offline   Reply With Quote
Old 09-25-2005   #16
amye247
 
Posts: n/a
If that's the case...

So, these pages aren't indexed or crawled but known because of a link which is why the robots.txt file seems to be disregarded.

These pages have been counted as the total amount of indexed pages (saturation) from this domain but the content has not been indexed. Would this be a correct statement?

Amye
-Only half confused now
  Reply With Quote
Old 09-25-2005   #17
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
It looks that way. Maybe Google is including all the pages that it knows about so that they can announce a huge increase in the number of pages in the index.
PhilC is offline   Reply With Quote
Old 09-25-2005   #18
amye247
 
Posts: n/a
Thanks Phil

Index this morning has gone up again 1.33mil. I have also noted that some of the pages in question have been cached but not all of them.

Amye
  Reply With Quote
Old 09-25-2005   #19
amye247
 
Posts: n/a
I'll elaborate

One of the excluded pages shows URL only listings 24,000 times - Based on what was said earlier these are not crawled or indexed but appear because of a link.

Another excluded page shows 172,000 listings. Again these pages have no metadata and the link Google shows is the text from the page (i.e. just a phone number). However, all of these pages have a "Cached" version available. So have these been crawled and indexed? Obviously I'll need to check the log files to be sure but this just seems to confuse the matter.

Amye
  Reply With Quote
Old 09-26-2005   #20
Alan Perkins
Member
 
Join Date: Jun 2004
Location: UK
Posts: 155
Alan Perkins will become famous soon enough
Quote:
Originally Posted by amye247
Another excluded page shows 172,000 listings. Again these pages have no metadata and the link Google shows is the text from the page (i.e. just a phone number). However, all of these pages have a "Cached" version available. So have these been crawled and indexed?
Probably, yes.

If you have a specific example of 172,000 URLs that have been indexed when robots.txt should have prevented this (i.e. your robots.txt file at the time the URLs were crawled contained Disallow lines for those URLs), I'd be very interested. As I am sure would Google be. Check your robots.txt file. Is the URL spelt correctly? Does it begin with "/"? Is "Disallow" spelt correctly? Does that record(s) in which you disallow those URLs apply to Googlebot? Is the robots.txt file valid? And so on...
Alan Perkins is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off