View Full Version : It's Official: Google Now Searching 8,058,044,651 web pages
Chris Sherman
11-10-2004, 10:13 PM
Moments ago, Google quietly changed the number of pages it's reporting on its home page. Now "searching 8,058,044,651 web pages." (http://www.google.com). According to spokesperson Nate Tyler, these are "real pages," meaning they've been fully indexed. This suggests that the total number of "items" may exceed 10 billion, if you count images, groups postings and pages inferred from links.
Don't expect Yahoo or Microsoft to counter with larger numbers any time soon--even if they do increase index size. Instead, expect statements along the lines of "our search results are competitive because they are high quality," which ultimately is both a valid assertion and the only thing that matters in the long run.
AussieWebmaster
11-10-2004, 10:24 PM
Don't expect Yahoo or Microsoft to counter with larger numbers any time soon--even if they do increase index size. Instead, expect statements along the lines of "our search results are competitive because they are high quality," which ultimately is both a valid assertion and the only thing that matters in the long run.
And the only counter they have.... the addition of more pages does not automatically mean more relevance... though it should allow for more variety and with tight filtering and sorting will offer more variety.
craig34
11-10-2004, 11:34 PM
Google is now reporting 57,000 pages for my site. Considering I was at 26,000 earlier in the week, this is awesome news - except for one thing. I know for a fact that my site doesn't contain more than 30,000 pages at the very most...
mcanerin
11-10-2004, 11:38 PM
How many images on that site are indexed, craig?
You would not happen to have about 25,000, would you? I'm starting to wonder if G is counting an image in it's image search as a "page"....
Random thought, no proof. Just wondering.
Ian
craig34
11-10-2004, 11:53 PM
There's no way I have 25,000 separate images. Maybe 5,000 max. Good thought though.
bobmutch
11-11-2004, 12:05 AM
Just dropped down to 8,000,000,000 now when you search for the word "the". Nov 10 2300 GMT-5.
projectphp
11-11-2004, 12:17 AM
42,000,000 pages that have the word *a* but not the word *the* (http://www.google.com/search?hl=en&lr=&c2coff=1&q=-the+a&btnG=Search).
So, that is 8,042,000,000 :)
Robert_Charlton
11-11-2004, 12:21 AM
http://www.google.com/googleblog/2004/11/googles-index-nearly-doubles.html
Hmmm... the afternoon before the new MSN Search beta goes live. :rolleyes:
I'm not yet seeing any big ranking shifts in areas I monitor, and sandboxed sites are still sandboxed. But soon? Can they be on a 64 bit architecture yet???
bobmutch
11-11-2004, 02:09 AM
Projectphp:
8,000,000,000 the
41,900,000 -the a
29,000,000 -the -a to
35,300,000 -the -a -to de
37,100,000 -the -a -to -de 1
22,900,000 -the -a -to -de -1 2
18,700,000 -the -a -to -de -1 -2 3
14,900,000 -the -a -to -de -1 -2 -3 4
19,600,000 -the -a -to -de -1 -2 -3 -4
23,800,000 -the -a -to -de -1 -2 -3 -4 com
~8,243,200,000 in the index form latin charactor words and numbers
GoogleGuy
11-11-2004, 02:24 AM
So in case anyone was still wondering, we're not limited by four-byte docids. But I suppose that was pretty clear. :)
GoogleGuy
11-11-2004, 02:27 AM
If you missed it, the Google blog also discusses the new Google Advertising Professionals program as well. That could be interesting to the folks on this forum..
I'm helping out the Google blog folks a little bit. If there's any particular topic you want me to talk about feel free to post. Or maybe we should start a separate thread? I'm still getting the hang of this new-fangled forum. :)
seobook
11-11-2004, 02:46 AM
I'm helping out the Google blog folks a little bit. If there's any particular topic you want me to talk about feel free to post.
any and all upcomming algorithm shifts...perhaps a subscribe feature with one month advanced notification? ;)
notice that this index size increase coincides with news from MSN Search. you guys don't just time it that way to spoil the news for the other search engines do you? :D
bobmutch
11-11-2004, 03:17 AM
GoogleGuy: I have lost many sleepless nights over the below 4 questions. If you would give me any kind of hints, ideas or even answer I will promise to jump up and down with joy!
1. Why is there a drop in PR when there is a toolbar PR update? (37 of the 152 sites on my PR10 pages list dropped off on the Oct 5th update).
2. What is the real PR range of the toolbar PR scale. From 0.15 to the real PR of the highest PR10 page?
3. During June 22 to Oct 5th when there was no toolbar PR update for 106 days, there were 4 BL updates. Would there of been a real PR update at the BL update times? I keep a list of the BL/GD/TB/Algo updates.
4. The scale of the Google Directory PR is it a scale with 7 units or 8 units?
cleardot.gif 5/35, 11/29, 16/24, 22/18, 27/13, 32/8, 38/2 (pos.gif/neg.gif).
And why does google.com have a GD PR of 44/0? Note link below in GD where www.google.com (http://www.google.com/) has pos.gif 44.
http://directory.google.com/Top/Computers/Internet/Searching/Search_Engines/Google/
Thanks!
Robert_Charlton
11-11-2004, 05:03 AM
So in case anyone was still wondering, we're not limited by four-byte docids. But I suppose that was pretty clear. :)
Sure... everything's been pretty clear for the last six months. :)
rustybrick
11-11-2004, 09:22 AM
I'm helping out the Google blog folks a little bit. If there's any particular topic you want me to talk about feel free to post. Or maybe we should start a separate thread? I'm still getting the hang of this new-fangled forum. :)
Why subject yourself to such torture? :D
iamrussell
11-11-2004, 10:31 AM
GoogleGuy,
Please go and flip off the sandbox switch. I know there must be one. I picture it as a giant electrical throw switch. Thank you.
NetinsertGuy
11-11-2004, 11:08 AM
Google is now reporting 57,000 pages for my site. Considering I was at 26,000 earlier in the week, this is awesome news - except for one thing. I know for a fact that my site doesn't contain more than 30,000 pages at the very most...
I too have noticed a similar rise in the page count. I doubt that the page count number is accurate. We don't have that many categories in the directory.
Nacho
11-11-2004, 12:23 PM
Someone over there must be saying, "if we could only let our crawlers breath for a little bit". :p
Everyman
11-11-2004, 01:21 PM
So in case anyone was still wondering, we're not limited by four-byte docids. But I suppose that was pretty clear.
It's true. There are so many things broken now that the old docID theory alone cannot possibly explain what's going on at the Googleplex.
My site has 129,000 pages. Google reports 173,000 with the site: command. All images are in a disallowed directory, so they don't count. This site has been very stable for over a year now, with less than three percent variation in total pages, or the content of those pages.
If I use the site: command and exclude some words that are on nearly every page, Google reports 171,000 URL-only links. If use the site command and include the same words, I get 84,700 fully-indexed pages.
A couple months ago I decided that all numbers reported by Google over 1,000 are utterly unreliable and meaningless. The same is true at Yahoo. So I have a secret word that I use with the site: command that should bring up 765 pages when the site is fully indexed. I've been tracking the percentage of inclusion for Google, Yahoo, and Microsoft using this secret word.
Google: 71 percent (it was the same in September, lower in October, and now is back to September levels).
Yahoo: Between 89 and 95 percent, and much more stable than Google.
Old MSN (Yahoo's crawler): 80 percent, and very stable.
New MSN beta: 21 percent. (This sucks, but at least they grab the top level site map pages instead of randomly grabbing deeper pages.)
Unfortunately, the 8 billion figure will convince all the pundits, even if it convinces no webmasters. Now that Google is a public company, the uninformed Wall Street pundits are the only people who matter.
mcanerin
11-11-2004, 01:50 PM
Just wanted to point out that although you could probably use search words like "the" and "a" to preform apples to apples comparisions between SE's there are many completely non-latin alphabet pages that would not show up that are indexed.
For example, this very visible, well indexed PR8 site and most sites linked from it do not contain "a" or "the", etc on it.
http://cn.yahoo.com/
Just a warning about assuming that the pages you understand are the same as the pages Google indexes. The "the" search is a good general measurment, but I would be careful about stating anything that sounded like it was any where near accurate.
Cheers,
Ian
bobmutch
11-11-2004, 02:56 PM
mcanerin: I agree. Those results numbers we are given by the search engines are estimates anyway. It was kind of fun for me to find the top 10 latin charactors or word and numbers in the Google index.
jfbruns
11-11-2004, 03:44 PM
Hi,
I too have noticed a similar rise in the page count. I doubt that the page count number is accurate. We don't have that many categories in the directory.
I guess the reason for that is that google is estimating the result count. Especially if you do boolean AND searches, they won't go through the whole index, that is far too expensive. Sometimes if you move on in the number of search results from page to page and the total number of results is below 1000 (they only display 1000 results), the number of results changes because it is recalculated.
Example:
Enter the following search example at google.com: google msn beta search relaunch
I got 868 results. When you move to page #10 there are only 865 results.
Since the total number of indexed documents doubled, the algorithm changed as well.
That might explain why the number of pages some users see have doubled.
Nevertheless, I also realised that some pages of mine have suddenly been indexed and I have been waiting for that to happen for a long time.
BTW: The google result information only says 'about' .
:cool:
Jan
vicaya
11-11-2004, 08:17 PM
Their number _includes_ dup entries (aliases to the same content). If your site redirects invalid links, especially with timestamps (Yahoo at least try to remove some dups at index time), the dups in the index could be significant. They dedup at display time. That might explain the over reporting of numbers.
One way to find out some of these stats is to use two secret id words, one that's the same on all the pages (to use with the site:* command) and one that's unique on every page. Then you'll know your dup profiles by querying these words.
My quick rare word/phrase test indicates that they have indeed about double the amount of valid results from either Yahoo (about 3-4 billion served, 10+ billion in webmap) or MSN beta (5 billion claimed)
Good job Google.
Nacho
11-12-2004, 01:25 AM
Without any doubt GOOGLE.COM is the #1 search engine!
I just launched a new website a couple days ago, result:
Google: Has indexed 2 levels deep, 78 pages indexed & cached as of 11/11/2004.
MSN Search (beta): Has indexed only the homepage as of 11/3/2004.
Yahoo! Search: Has indexed only the homepage as of before 11/9/2004.
You're talking about a SUPER spider that is capable of finding this one homepage with only ONE inbound link to it, and continue to crawl it for indexing and analyzing more pages almost instantaneously. All to have it ready for users to search the engine for a related keword and give it an accurate rank on the SERPS in just a matter of 48 hrs.
You want FREE Immediate inclusion . . . well here you go! Why spend your money on other PFI programs because of other search engine's lack of efficiencies. They should all follow this leadership.
That is just amazing!
GoogleGuy, please congratulate everyone at the plex for me. To me Google continues to be the best engine by far. I can only wish your competitors can some day be capable of this level of scientific engineering with algorithmic search.
Saludos,
Nacho
vicaya
11-12-2004, 03:16 AM
Well, there are actually a lot of room for improvement, for example they only index about first 100k of a page. Given google's resource, this cap is just silly. It's one of the pet peeves of literary types who want to search online books. Hope they'll pick this one up :)
zareef
11-12-2004, 03:59 AM
my one site yesterday report a total page count of around 12,000 while actual page count on site is just around 5,000.
Today they are reproting again 4,480.
So with my other sites
I think google is facing some kind of technical or logical problem in page count.
mcanerin
11-12-2004, 04:09 AM
I think I discovered how they did it - TIME TRAVEL! I came across this site while checking my backlinks (they stole my home page while they were at it :mad: )
This is G o o g l e's cache of <spam site removed to protect the guilty - copyright infringers need love too...> as retrieved on 31 Dec 1969 23:59:59 GMT.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
This cached page may reference images which are no longer available. Click here for the cached text only.
To link to or bookmark this page, use the following url: http://www.google.com/search?
Bold Added - 1969? Wow, they were stealing my content when I was only 2 years old! That's amazing! Truly stealing candy from a baby....
Ian
Marcia
11-12-2004, 04:55 AM
31 Dec 1969 23:59:59 GMT.
It's the Google Hippie Bug, mcanerin. Some can't forget the good old days and a few must have sneaked their way in there. ;)
Did the index all of a sudden increase in size, which would help a bit to confirm suspicions of multiple iterations and batch processing going on back in the "sandbox"? Or has it been progressively increasing and is just now being publicly reported as having increased in size?
Neither would totally preclude the possibility of multiple iterations and batch processing taking place, which it's been looking like (to me, anyway) with the protracted intervals betwen periodic "updates." At least it can look that way to anyone somewhat or even slightly familiar with a batch processing environment, with a bit of mainframe programming background.
Kind of funny that some data centers went off line from public viewing just about the same time the timed-release Toolbar PR update capsules started being released, isn't it? Hate to sound too existentialistic, but nothing can be looked at as just an "in itself" disconnected from everything else that takes place.
Nacho
11-12-2004, 12:20 PM
I found a great post by a Member on a thread titled Google Cache Date: Dec 31, 1969 (http://www.webmasterworld.com/forum3/26062.htm) at WebmasterWorld:
Found this definition of that timestamp:
"Sometimes a non-existing time, such as the time of creation of something that does not exist, is indicated as 23:59, 31 dec 1969" on UNIX based systems.
That figures the "Google Hippie Bug" <---- excellent name Marcia :)
Probably nothing to do with the "sandbox", but I could be wrong.
Nacho
11-14-2004, 02:11 AM
Here's something interesting . . .
You do a search for britney spears (http://www.google.com/search?hl=en&q=britney+spears) and what do you get?
Results 1 - 10 of about 9,260,000 for britney spears. (0.13 seconds)
Will Google really serve 9,260,000 documents? I found out today that they don't. The will actually serve the first 1,000 (http://www.google.com/search?q=britney+spears&hl=en&lr=&c2coff=1&start=990&sa=N&filter=0) documents only. I try to cheat a little bit a tweak the URL from start=990 to start=1090, but I had no luck, instead I got this (http://www.google.com/search?q=britney+spears&hl=en&lr=&c2coff=1&start=1090&sa=N&filter=0).
OK, like usual I'm not the first geek to spot this, but the only interesting discussion I could find was over at WebmasterWorld: Google won't let me see search results (http://www.webmasterworld.com/forum3/24288.htm). This was earlier in June this year.
So, what's the point to display search result among 9,260,000??? Is it just to show off between search engines? Is it just to confuse webmasters/site owners? Is it for real?
It's probably for real, I think. However, it's an opitmization of their servers and bandwidth most probable. Who knows, perhaps GoogleGuy can shine some light here.
wiseMouse
11-14-2004, 06:49 AM
Just dropped down to 8,000,000,000 now when you search for the word "the". Nov 10 2300 GMT-5.
As if the whole world was speaking English...
bobmutch
11-14-2004, 01:59 PM
wiseMouse: We know that the whole world doesn't speak English. But we that do speak English, check out the results in Google for the most common used English word "the".
Everyman
11-14-2004, 11:47 PM
Let's say the user is using the default of 10 results per page.
The user puts in "Britney Spears". You go to the inverted index and see that there are 9 million docIDs for the first word, "Britney." (No, you don't count them. They are precounted from the last major update.)
You only need 10 results to satisfy the request. The docIDs are sorted by PageRank, and most likely also by how often the word appears in anchor text in backlinks to that page. Some combination of these two would be my guess. Freshness counts too, but those are probably inserted from a separate index.
You grab the top 30 results for sampling. You then look under "Spears" to see how many of those 30 docIDs also show up under "Spears." You discover that 80 percent of the "Britney" pages also contain the word "Spears." Bingo! You're finished.
This particular search can even be done by merely searching the "fancy" inverted index, as Sergey and Larry termed it in their original paper. This index is much smaller than the full-text inverted index, because it only includes the most important words related to a web page. If the words are obscure, then you'll need a larger sample for dual-word searches, and you'll may even have to do your lookups in the larger, full-text inverted index.
The size of the initial sample is a trade-off. If you sample too deep for the number of results requested, then you waste CPU cycles. If you sample too shallow, you may have to increase your sample and make a second pass, which also wastes CPU cycles. You might even save CPU cycles by keeping a dictionary of commonly-paired words, such as "Britney Spears," so that you know the optimum depth for sampling on these searches. Heck, for Britney they probably have it all pre-searched, and they only have to repeat the search once after every major update.
Any numbers over 1000 are very crude estimates. You know how deep you had to go into the inverted indexes to get your first 10 results, and you know the total number of docIDs after each word in the inverted index. You extrapolate from this what the total hits might be. It's extremely crude. Why bother with anything more complex? No one outside the Googleplex will ever be able to prove that any number you offer larger than 1000 is inaccurate. All you need is some number that's roughly believable.
If you are more interested in what journalists think than in what programmers think, you may even want to double any numbers over 1000. Use a sliding scale to reduce suspicion, or phase it in over time. Journalists aren't smart enough to ask, "Wow, how did the total pages increase from 4 billion to 8 billion overnight?" A few webmasters will point out that the totals are screwy for their own websites, but journalists won't have a clue.
Anyone who thinks that when Google says they have 8 million of anything, or 8 billion of anything, that they went through and examined every one of the pages and counted them just for you, has very little idea of how software engineers optimize searches. The number could be off by an order of magnitude.
Who can call them on it? No one. So who believes the number? Almost the entire world, apparently.
bobmutch
11-26-2004, 02:19 AM
Everyman: Google index doubled from 5bil to 8bil the same way their index doubled for the search "bob mutch" from 23k to 41K in the matter of 1 week. They didn't fool me. I though it was funny though.
I just happen to know that there didn't all of a sudden end up 18k more documents on the net with "bob mutch" in it. I know I am a bit of a spam machine but I am quite sure I didn't do 18k of posts in a week or two.
Of course they couldn't have MSN come online claiming to have the biggest index (5billion) so they, um, when out and crawled all the site they have not been crawling I guess. And all in one week. See M$ is good for the net. They caused Google to crawl the other half of the net a week before they went online.
Dave Hawley
11-26-2004, 04:15 AM
Google index doubled from 5bil to 8bil the same way their index doubled for the search "bob mutch" from 23k to 41K in the matter of 1 week. They didn't fool me. I though it was funny though.
I don't get it. Are you saying that Google's database is not 8 billion pages? If so, how would you ever know???
vicaya
11-26-2004, 07:55 AM
Google index doubled from 5bil to 8bil the same way their index doubled for the search "bob mutch" from 23k to 41K in the matter of 1 week. They didn't fool me. I though it was funny though.
Ever heard of sandboxing? AFAIK, all search engines have multiple clusters. They update/test one until it passes internal QA and then flip it live. Google must have anticipated the msn launch and pulled a good one.
BTW, 8 billion docs searchable index is no big f* deal, given the resources G/Y/M have. The current state of the art technology can scale up to 80 billion or more with reasonable amount of hardware. The current crawlable content is about 25-30 billion and is growing at 5-7% rate per month.
bobmutch
11-26-2004, 02:27 PM
vicaya: Yes I have heard of sandboxing. But what does sandboxing have to do with doubling of the number of reported pages over a one week period for a obscure phrase like 'Bob Mutch'. I don't think having 8billion documents is a big deal either. That was not my point.
Dave Hawley: I noted that an obscure phrase doubled in number over a one week period time. Also during the time they were "updating" the size of the index, a search on the word 'the' reported 9 billion entries. I wonder where the other 1 billion went. Do I think Google has an index of 8 billion. I really don't know. I just thought it was interesting how the number of documents for 2 key phrases I was following doubled.
vicaya
11-26-2004, 07:28 PM
Yes I have heard of sandboxing. But what does sandboxing have to do with doubling of the number of reported pages over a one week period for a obscure phrase like 'Bob Mutch'
Sandboxing means that the crawling doesn't really happen over a week but probably over many months. I won't be surprised if some se doubles their serving size within a day due to sandboxing. If you really monitor search index like I do, you monitor it by using a set of random (in stat sense) rare queries (bob mutch without quotes is hardly a rare query) that return less than 200 (or so) results so that every result can be verified. When these results are increased and verified, you know the index is grown. The larger the random query set, the more confidence you have about your estimate.
BTW, Thompson's book on statitiscal sampling (http://www.amazon.com/exec/obidos/external-search?search-type=ss&tag=theendsofthew-20&keyword=sampling&mode=books) is a very good and practical reference.
bobmutch
11-26-2004, 07:44 PM
vicaya: I am still not sure I understand why you think sandboxing would have something to do with the reported pages of 'Bob Mutch' doubling over a one week period. Its been at 20,000 for a long time. I have not added to that number. I just don't see how you can say sandboxing has anything to do with the index adding 20k to their results for a key phase that has been out there in internet land for months.
What is your reasoning on that one?
Dave Hawley
11-26-2004, 08:23 PM
a search on the word 'the' reported 9 billion entries. I wonder where the other 1 billion went. These estimates have never been accurate. They cannot be used for anything.
vicaya
11-29-2004, 12:38 AM
vicaya: I am still not sure I understand why you think sandboxing would have something to do with the reported pages of 'Bob Mutch' doubling over a one week period. Its been at 20,000 for a long time. I have not added to that number. I just don't see how you can say sandboxing has anything to do with the index adding 20k to their results for a key phase that has been out there in internet land for months.
What is your reasoning on that one?
Sandboxing means that the updating and serving clusters are two distinct clusters with two different roles. A serving cluster is never updated until it switches state with the updating cluster. So while the serving cluster is returning results r + r' (where r' << r, which is mixed from a small/fast daily updating cluster), which is what you see (the 20k or so result), for a long time, while the updating cluster is crawling and updating. You'll never see the results from the updating cluster while it's being updated and tested. Technically the switch can happen instantenously. G basically timed the switch and stole the thunder from M.
Dave Hawley
11-29-2004, 12:46 AM
G basically timed the switch and stole the thunder from M. Again. :p to m$
bobmutch
11-29-2004, 11:31 AM
vicaya: I again seem to be missing your point. "Bob Mutch" has been 20k for a long time. Are you saying that Google has "sandboxed" 22,700 documents of "Bob Mutch" and that 22,700 of "Bob Mutch" was unsandboxed and added to the index at the time they "updated" their index.
This exchange has been concerning your comment "Ever heard of sandboxing? AFAIK, all search engines have multiple clusters. They update/test one until it passes internal QA and then flip it live." I just don't see where sandboxing has anything to do with the increase of the number of documents shown for "Bob Mutch" from 20k to 22,700 when it has been at 20k more months.
The 3 main theories on Sandboxing is 1) new sites are sandboxed and not able to get high Rankings with Google for about 90 days. 2) new links are sandboxed and are doing give a site any Ranking weight for about 90 days 3) sandboxing is a myth.
I dont' see at all where sandboxing can have anything to do with how many results Google shows. My point in my post was just to note that at the time that Google doubled there index some of the key phrases I was watching that have been very static doubled. You came back with this "ever heard of sandboxing," which still makes no sense to me at all.
vicaya
11-30-2004, 04:15 AM
vicaya: I again seem to be missing your point. "Bob Mutch" has been 20k for a long time. Are you saying that Google has "sandboxed" 22,700 documents of "Bob Mutch" and that 22,700 of "Bob Mutch" was unsandboxed and added to the index at the time they "updated" their index.
That's right. It's a very plausible explanation. The additional 22700 results became searchable by the time they "switched on/unsandboxed" the index with the content that's been crawled for many months.
The 3 main theories on Sandboxing is 1) new sites are sandboxed and not able to get high Rankings with Google for about 90 days. 2) new links are sandboxed and are doing give a site any Ranking weight for about 90 days 3) sandboxing is a myth.
Ah, looks like the term is the source of the confusion. I was using sandboxing as a generic engineering term, not specific to page rank update, but a way to roll out new service without disrupting existing services. Basically the sandboxed new sites are not searchable at all, until it's unsandboxed.
I dont' see at all where sandboxing can have anything to do with how many results Google shows.
It has everything to do how to roll out a new service, which could be a brand new search cluster without the rumored 4G(32bit) doc limitation. Hopefully, my above explanation makes some sense.
bobmutch
11-30-2004, 10:47 AM
vicaya: Yes I all makes since now. It was the term 'sandbox' that send me for a loop. I just don't figure out what sandboxing, as we use the term in SEO, had to do with it all : )
Well it is possible that Google came up with twice as many pages for 'bob mutch' and for 'mutch' also. It just seems very fishy to me as 'mutch' would be a pretty static word. Its a last name and there are not every many of us around : )
Also for those that think 'Mutch' is a dictionary word with a defination, it isn't, it's a Scottish surname, but surnames are of course considered a word.
Dave Hawley
11-30-2004, 09:21 PM
Its a last name and there are not every many of us around It is also a word, which is part of what I tried to point out before.