View Full Version : Help Google Saturation/No of pages indexed is too high!!
amye247
09-22-2005, 03:48 AM
I have a client who has always had good coverage of their site in Google. Since Aug 28, 2005 I have been watching that number increase to levels I would never have imagined. He is now sweating bullets and worried that something bad (penalties/bans) will happen.
The index was sitting at about 89K but has exploded to 1.28Million. This is not possible as the site does not contain that many pages.
I explained that something is probably wrong with the way Google has been counting this but obviously I need something more definitive as an answer.
Has anyone else experienced this with any type of understanding or knows what is going on?
Thanks
Amye
softplus
09-22-2005, 05:08 AM
You're probably indexing URLs with Session-IDs... not good ... Make sure that the site doesn't give session-ids to bots :-). I thought Google was good at making sure that session-ids don't get indexed, but in the last month or so I've seen a lot of the same game...
Alan Perkins
09-22-2005, 06:03 AM
You could be suffering any one of a number of problems.
For a start, try doing a phrase search on a phrase that should appear only one one page on your site. The syntax is:
site:www.yoursite.com "insert your unique phrase here"
If multiple pages appear in the search results, when you would only expect one page, look at the URLs of those multiple pages. Which URL did you expect to be there? Is it there? How did the others get there (i.e what path could the robots have followed to see those other URLs)?
Once you know why the extra URLs are appearing, you can fix it. You may be able to use robots.txt or the robots meta tag (http://www.robotstxt.org), or you may have to adapt your server software.
amye247
09-22-2005, 06:43 AM
I ran a few test like you guys suggested.
First result
I have 1 page of content and found 3 pages indexed pointing to same page with three different session IDs:
1. Set-Cookie: ASPSESSIONIDQCDRQRRC=CDCMEGIAMDKILAGFBLECHPGB; path=/
2. Set-Cookie: ASPSESSIONIDQCDRQRRC=BJBMEGIAIIIKMBDLGFBINCOB; path=/
3. Set-Cookie: ASPSESSIONIDQCDRQRRC=PFBMEGIAANGPCLIPKBCIGNPG; path=/
So I note 2 things. Looks like the engine is getting the seesion IDs and I have a duplicate content issue.
They have also done this in this manner all along (using a rewite on the URLs) but I have never seen Google index in this way on their site.
I'll note that I also found two URL's pointing to the same page but also have two different seesion IDs as you would expect. However, they have done this multiple times if a particular product falls into two categories. They list it as two different products, with two different URLs but are pointing to the same content.
In addition they have also implemented a third problem - providing "printer friendly" versions of these pages (fired by javascript). None of which has a robots exclusion.
Are we on the right track here? Possibly?
Amye
softplus
09-22-2005, 08:49 AM
Bingo!
I thought Google ignored / removed session-ids earlier, but according to some people I've asked, it's been indexing them since the beginning... even though they mention that they sometimes remove parameters ("don't use anything with 'id' in your parameters..."). My guess is that this has to do with the sitemaps-beta, get as many URLs listed as possible... but personally I don't think it's raising the QUALITY of the index by linking to expired sessions :-)
The print-preview is also a problem, however it shouldn't be the main issue (+ it's easy to add meta-tags to the header); another thing with forums is that you can usually access whole threads or the single posts seperately; if this is the case for your client's site as well, you should consider "banning" robots on one of the two types (meta-tags, etc.).
You can even go so far as to actively force the engines to index the way you want:
when you notice a bot (through useragent, ip):
- if a session id is accessed, 301-redirect to the same page without the session-id
- if a "non-indexable" page is accessed, 301-redirect to the indexable copy
amye247
09-22-2005, 09:56 PM
Yes, there is a session id problem. Unfortunately I can't convince the powers that be because the session id is in the http header and not the URL.
But, this gets better - Google has indexed 160,000 iframes that have been specifically excluded in the robots .txt file. What the? There is no exclusion in the page itself and Google obeyed the robots file until now.
While this doesn't account for the 1.2mil figure I believe it is only one of a multitude of problems.
So there are the session IDs, and previously excluded pages.
Then something weird - when doing a search for URLs on Google.com (one data centre) the URL's returned are somthing like:
http://www.site.com/dir/dir/ dir/dir (note the space)
When doing the same search on Google.com.au (another index) the URL is without the space. The one with the space returns a 400error and the other is a 302.
My head is starting to hurt :eek:
PhilC
09-22-2005, 10:55 PM
It's not uncommon for Google to report a lot more of a site's pages as being in the index than are possible. In one of my sites, more than double the maximum was reported for a long time. But jumping from 89k to 1.28M is a bit radical.
It's session IDs in the URLs that matter, and not session IDs in the response headers.
Google often, maybe usually, adds a space in the printed URLs in the listings. It's normal, and nothing to be concerned about.
I'm surprised that they are crawling iframe sources because they didn't do not long ago.
Sorry - no solutions - just comments.
amye247
09-22-2005, 11:20 PM
Thanks Phil - so are you saying that the fact that the server header has a Set Cookie variable, that is different everytime I check the header, it will not be a problem because it is not hardcoded in the URL?
OK - I can live with that. I ran another quick test and found that Google indexed everything that the robots.txt file said not to. This equals a total of 555,195 - nearly half of my 1.2mil saturation level.
The exclusions are for particular pages - page.asp. This is where the session ID is problematic bacause there is no URL rewrite on these. Presumably because they were excluded in the robots file. So now the index has loads of:
page.asp/map=1&fid=537502 (where the id is different)
So I understand that there are multiple listings of the page because of the session id. But why would Google be ignoring the robot file. It validates and has been in place for a long while but suddenly all previously excluded pages are now included.
How do I fix this?
Amye
softplus
09-23-2005, 03:52 AM
How do I fix this?
Fdisk + Format :D (on Googles servers)
There is a small site I'm watching, at the moment it has 12 pages physically online, Google is listing over 900 URLs... I can't really believe that they did this kind of thing earlier, but I'm pretty sure the people at Google either aren't taking it seriously or are working their b*** off to make sure it gets cleaned up soon. (I'd prefer the second)
Besides "cloaking" your site to Google, there isn't much you can do but wait. If you want to try cloaking, you'll need Googles IPs and useragents; whenever a bot accesses one of the "blocked" pages, return 404 and let him be off. However, it will not clean up your listings very quickly, I really doubt it will happen automatically before 6 months (I have pages that have 404ed for over 2 years that were still online until I used the manual remove url request, and even that is limited to 180 days removal, even though the URL has been gone for years).
One thing you can try, though, that might have an impact: Make sure the bad urls return 404 to bots and then create a Google Sitemap file for your complete site. Sometimes, possibly when the number of URLs in your sitemap represents a significant portion of the URLs indexed, Google will flush the site from it's index before reindexing based on your sitemap file. There doesn't seem to be a pattern to that and I can't garantee that it will always work (or even keep on working). People have usually complained that it does that, but perhaps in your case it would be a good idea :-). "Value", (PR, etc) will of course be kept, but it might take a few days for Google to get your site back into the serps.
How I wish Google would allow some sort of simple "I'm the webmaster, remove this and that" functionality .... Perhaps it will be the next step from Sitemaps, now that we can verify that we are the webmasters, it would be a logical step. It's a pain to have to wait and watch like that....
amye247
09-23-2005, 04:33 AM
If the client has this following disallow in the robots.txt:
/link.asp
I would imagine that the page itself would not be indexed. However Google has this page indexed 27,000 times because while the page has been disallowed the actual listings in Google are:
link.asp?id=1245
link.asp?id=4756
So Google obeyed the robots.txt - there isn't a specific exclusion for all the session ID's. I would surmise that it would be better to exclude the directory so that any page in that directory - which would include all instances - would not be indexed?
Amye
Alan Perkins
09-23-2005, 06:43 AM
If you had:
User-agent: *
Disallow: /link.asp
in your robots.txt file, then "/link.asp?id=1245", "link.asp?id=4756", etc. would all be disallowed too. Robots.txt treats disallow parameters as a partial match of the beginning of the URL - not the whole URL. That's why User-agent: *
Disallow: /disallows the entire site - because every URL on the site starts with "/".
I'm guessing that Google has not fully indexed the URLs that are disallowed, but has merely seen links to them and indexed those links. Such URLs show up in the SERPs with the URL in place of the title, no snippet, no cache, etc. If that's the case, don't worry about it. Google won't actually crawl and index the content, because when it tries to do so, it will see it's disallowed by your robots.txt file.
PhilC
09-23-2005, 12:01 PM
Thanks Phil - so are you saying that the fact that the server header has a Set Cookie variable, that is different everytime I check the header, it will not be a problem because it is not hardcoded in the URL?That's right. The header doesn't matter in this respect. It's only session IDs in the URLs that matter.
amye247
09-24-2005, 05:44 AM
OK - The session id in the header does not matter. The disallowed pages are being indexed by session id's even though the robots.txt has an exclusion.
Alan, interesting statement - you are right. The listings are just snippets showing the page neame. No Title, no description, no cache. This is because the pages have none (metadata). In most cases they are iFrames with just a phone number or something of the like. So they would obviously have links pointing to them from some page in which they appear.
You also said Google won't crawl and index the content. But, haven't they? I'd imagine I'd have trouble finding the pages in a natural search but they are there if I do a specific search for the page and I can see the content.
If this ends up why Google's saturation of the site seems so large I can understand. However, I may not have discovered these extra pages had it not been for the number being so high and I researched. So had Google been counting them previously? Who knows. Is this why the index got so high - that suddenly these extra pages are being counted in the total? If so, why?
I know that there is no magic answer but all of your opinions are valued.
Thanks - Amye
Alan Perkins
09-24-2005, 06:44 AM
Alan, interesting statement - you are right. The listings are just snippets showing the page neame. No Title, no description, no cache. This is because the pages have none (metadata). In most cases they are iFrames with just a phone number or something of the like. So they would obviously have links pointing to them from some page in which they appear.The page content has not been indexed. If the page content had been indexed, you would see a cache link (assuming you had not specifically prevented this using a NOARCHIVE attribute in the robots meta tag). You would still see a title (assuming your iframe document had a title) and, in most circumstances, a snipppet.
You also said Google won't crawl and index the content. But, haven't they? I'd imagine I'd have trouble finding the pages in a natural search but they are there if I do a specific search for the page and I can see the content.No, Google has not indexed the content. Only the URLs. Check your log files and see if you can find any example of Googlebot actually accessing those pages. You won't. :)
PhilC
09-24-2005, 09:23 AM
There are a number of reasons whay Google shows a URL instead of a Title and snippet/description. One of them is that it is a page that Google knows about because of a link, or links, pointing to it, but they haven't indexed it. That's the reason they give us for the URL only listings. They do sometimes show up in the the serps because of the link text that points to them.
It sounds likely that the pages haven't been crawled, but are showing up on a site: search because they are known about due to the links.
amye247
09-25-2005, 08:12 PM
So, these pages aren't indexed or crawled but known because of a link which is why the robots.txt file seems to be disregarded.
These pages have been counted as the total amount of indexed pages (saturation) from this domain but the content has not been indexed. Would this be a correct statement?
Amye
-Only half confused now :)
PhilC
09-25-2005, 09:09 PM
It looks that way. Maybe Google is including all the pages that it knows about so that they can announce a huge increase in the number of pages in the index.
amye247
09-25-2005, 09:17 PM
Index this morning has gone up again 1.33mil. I have also noted that some of the pages in question have been cached but not all of them.
Amye
amye247
09-25-2005, 11:11 PM
One of the excluded pages shows URL only listings 24,000 times - Based on what was said earlier these are not crawled or indexed but appear because of a link.
Another excluded page shows 172,000 listings. Again these pages have no metadata and the link Google shows is the text from the page (i.e. just a phone number). However, all of these pages have a "Cached" version available. So have these been crawled and indexed? Obviously I'll need to check the log files to be sure but this just seems to confuse the matter.
Amye
Alan Perkins
09-26-2005, 07:57 AM
Another excluded page shows 172,000 listings. Again these pages have no metadata and the link Google shows is the text from the page (i.e. just a phone number). However, all of these pages have a "Cached" version available. So have these been crawled and indexed?Probably, yes.
If you have a specific example of 172,000 URLs that have been indexed when robots.txt should have prevented this (i.e. your robots.txt file at the time the URLs were crawled contained Disallow lines for those URLs), I'd be very interested. As I am sure would Google be. Check your robots.txt file. Is the URL spelt correctly? Does it begin with "/"? Is "Disallow" spelt correctly? Does that record(s) in which you disallow those URLs apply to Googlebot? Is the robots.txt file valid? And so on...
amye247
09-26-2005, 08:17 PM
I checked the robots file. Everything seems to be spelled correctly.
The only things I can note:
1. There is a #comment at the top of the file - which in theory is OK to be there.
2. The disallow (as explained earlier)
User-Agent: *
Disallow: /phone.asp
Now - that page http://www.domain.com/phone.asp is NOT in the index. However, http://www.domain.com/phone.asp?id=8585, http://www.domain.com/phone.asp?id=7474 etc (times 172K) are indexed and cached. I should also note that the ID's are doc IDs and not sessions.
Earlier in this thread it was stated that "in your robots.txt file, then "/link.asp?id=1245", "link.asp?id=4756", etc. would all be disallowed too."
This may not be correct? Another idea is maybe the robots file was unavailable at some point causing everything to get in?
PhilC
09-26-2005, 09:30 PM
I think that Alan is much more of an expert on the robots.txt file than me, but I've never seen it anywhere that a line like "Disallow: /phone.asp" would also disallow URLs like "/phone.asp?this=that". I've seen it written where "Disallow: /phone" will disallow anything that follows that part of the path, such as "/phone.asp?this=that" and "/phone/filename.html", but not when a full filename like "/phone.asp" is stated.
I'm not saying that Alan is mistaken - I'm saying that I haven't seen it written anywhere, and that it might be a mistake.
Alan Perkins
09-27-2005, 07:02 AM
Earlier in this thread it was stated that "in your robots.txt file, then "/link.asp?id=1245", "link.asp?id=4756", etc. would all be disallowed too."
This may not be correct?It's definitely correct. :) See A Standard For Robot Exclusion : Format (http://www.robotstxt.org/wc/norobots.html#format):Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.Google interprets this standard correctly - to the letter, in fact.
Another idea is maybe the robots file was unavailable at some point causing everything to get in?The file would have to be missing for this to occur. Try checking the cache dates of the cached pages to see when they were cached, and check your server logs to see if robots.txt was successfully accessed on those dates.
softplus
09-27-2005, 07:15 AM
Now - that page http://www.domain.com/phone.asp is NOT in the index. However, http://www.domain.com/phone.asp?id=8585, http://www.domain.com/phone.asp?id=7474 etc (times 172K) are indexed and cached. I should also note that the ID's are doc IDs and not sessions.
Looking at googles http://www.google.com/webmasters/remove.html, perhaps this would work:
User-Agent: *
Disallow: phone.asp?
Though I didn't see any specific comments regarding the trailing ?, from Googles examples it might match those active pages...
Alan Perkins
09-27-2005, 08:29 AM
Looking at googles http://www.google.com/webmasters/remove.html, perhaps this would work:
User-Agent: *
Disallow: phone.asp? That would not work as you are missing a / before phone.asp. It should be:
User-Agent: *
Disallow: /phone.asp?
But that really won't make any difference, apart from allowing /phone.asp itself to be indexed.
PhilC
09-27-2005, 10:12 AM
I've read that "A Standard for Robot Exclusion" page and I'm still not convinced that you are correct, Alan. It gives this example:-
The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html
At first glance, it seems to show exactly what you are saying, but a closer look reveals that it is saying something different, or at least it makes it very uncertain.
"... any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html" is a 2 element, comma seperated list, and either the grammar is poor, or the grammar is good and it says that spiders shouldn't visit:-
(1) any URL starting with "/cyberworld/map/" or "/tmp/" (partial paths)
or
(2) /foo.html (whole path)
It's the lack of a comma between the first two, and the comma between the second and third that make the difference. Also, the quotes around the first two but not around the third seem to differentiate between the first two (partial paths) and the third (whole path). We have to assume that the grammar is fine, and that the /foo.html element is intended as an example of a whole path, and not a partial one.
Having read that page, I'd say that "Disallow: /phone.asp" does not disallow "/phone.asp?this=that" and similar URLs. To disallow them, "Disallow: /phone" would be needed.
Alan Perkins
09-27-2005, 10:37 AM
If I were to put money on it, I'd back that the /foo.html element is intended as an example of a whole URL, and not a part of a URL.Sorry, Phil, but you'd be backing a loser. :)
You're playing semantics with the commas in an example. What part of this definition from the standard is not clear:
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved."Any" is pretty unequivocal there.
I have plenty of practical experience which tells me that what I'm posting is factually accurate. There aren't many black-and-whites in SEO, but this is one: Google will correctly interpret "Disallow: /phone.asp" as "Don't crawl any URL on this site, including any dynamic URL, that starts with /phone.asp". In extremely rare circumstances I've heard that Googlebot can make a mistake. This could be one of those circumstances. But I've never seen it myself, and I've been watching these things for longer than Google has been around.
PhilC
09-27-2005, 10:45 AM
Yes, that is pretty unequivocal. So it makes me wonder why the part that I quoted is written the way it is. I'm not playing with semantics, because I don't believe that the grammar is bad, and it really does say what I wrote. It really does treat /foo.html as a whole URL. Your quote trumps it, though, and the example is lacking a little bit.