Special thanks to:
|
#1
|
|||
|
|||
|
How is this possible?
According to the present Google Home page it shows that - Searching 4,285,199,774 web pages.
But When i search for term "the" in Google it shows 5,780,000,000 pages. If it has only 4,285,199,774 web pages in its index how can it display 5,780,000,000 pages. Guess! the remaining pages are from Supplemental Index????? One more speculation is Google uses unsigned long integer in ANSI C to assign a unique ID for each and every web page on the Internet. This variable is four bytes long, and can accept upto 4,294,967,295. So if it is ture that Google uses the above method how come its display 5,780,000,000 pages for the search term "the". Can anyone explain? |
|
#2
|
|||
|
|||
|
Probably because the count on Google's home page isn't live but instead manually updated every so often. Sort of like the old McDonald's signs that said 200 million served. They might have actually served 300 million, but no one bothered to change the sign.
|
|
#3
|
|||
|
|||
|
I agree that they would not have changed the numbers on their home page .... But what about this .....
One more speculation is Google uses unsigned long integer in ANSI C to assign a unique ID for each and every web page on the Internet inorder to identify them. This variable is four bytes long, and can accept upto 4,294,967,295. So if it is ture that Google uses the above method how come its display 5,780,000,000 pages for the search term "the". |
|
#4
|
|||
|
|||
|
That's being discussed over here: Google malfunctioning?
I also posted a summary article yesterday for our SEW paid members here: Google Out Of Index Space? It's a short recap wrapping up some previous and past articles about that topic. |
|
#5
|
|||
|
|||
|
I've noticed some referrals from that "members only" piece, but I'm not a member so I haven't read it. Since I haven't read it, I can't comment on it.
But I hope you do a better job of criticism than Chris Riding. the owner of Search Guild, who has written an article critical of me at http://www.searchguild.com/article215.html I tried to point out the deficiencies in his technical assumptions on his forum over a year ago and got nowhere. Then I tried again last week, and after a few exchanges he locked the thread. Now he has written the above article, which I feel deserves an answer. Since he locked that thread, I can't post my answer on his site. First of all, Chris has made remarkable progress in his understanding of inverted indexes. A year ago his arguments were quite absurd. Now his description of how an inverted index works is fairly good. However, he's obviously spinning things. I'd be less suspicious of his motives if Google Adsense ads weren't plastered all over his site. But I'll leave that aside for now. Chris and I finally agree that if Google is still using a 4-byte docID, then they have to go to 5 bytes. I suggested in my piece a over a year ago that they could read in an extra byte, mask off the bits they don't need for the new docID, and use this as a multiplier for the old 4-byte integer. Chris says I'm full of it because he would use a "long long" integer of 8 bytes, and strip off the unused 3 bytes on read and write. Both methods require several extra lines of code for reading and writing. Sure, I'll do it the way Chris suggests. Six in one and a half-dozen in the other. No big deal. Undoubtedly Google would study the respective CPU cycles for each method and pick the one that is most efficient. I wouldn't hazard a guess about the relative efficiency of each method without experimenting. This issue of methodology that Chris uses as a basis for his entire criticism is a red herring. Each method requires extra code, and that is the essential point. Lots of code has to be changed. Since the docIDs have now gone from 4 to 5 bytes, and they are all packed tightly back-to-back, all the offsets have now shifted. That means the all code that scans the inverted index for docIDs has to be changed. And the space problem is not trivial, as Chris wants us to believe. Every word on every web page in Google's index gets its own docID. In fact, since Google uses two inverted indexes, this means that every word on every web page uses, on average, two docIDs. That's a lot of space -- about 2.4 terabytes of added space per copy of the inverted index. There are many copies of this index in RAM on Google's distributed system at any one time, because it's the first index that has to be consulted when handling any search request. Chris's argument is an exercise in generating fog. I expect this sort of thing from the Googleplex, but not from a mere fan of Google's Adsense program. |
|
#6
|
|||
|
|||
|
Quote:
|
|
#7
|
|||
|
|||
|
Quote:
search for the = 5,780,000,000 search for allinurl: the = 5,780,000,000 search for allintitle: the = 5,780,000,000 search for allintext: the = 5,780,000,000 I don't believe the "Searching 4,285,199,774 web pages" either, and have never used this number as a major argument in support of my position, even though it would seem to support my position. |
|
#8
|
||||||
|
||||||
|
Quote:
Quote:
Quote:
I don't specify what system/format Google use for docIDs as how would anyone know for sure, the article merely counters the incorrect arguments that are being made by people saying it's stuck on 4 bytes and that's a problem. Quote:
Quote:
Quote:
Okay, so maybe this post is a little bit naughty. But honestly, if the only arguments provided for the case for now rest on personal attacks against me then I consider my point proved. If the article is a better explanation and thinking about the issues than I made before (a year ago, a week ago, whatever) then it's solely because it only becomes worth thinking about properly when people start to believe it. I make no apologies for that, the logic stands alone and logic either is or isn't. Last edited by chris : 09-10-2004 at 09:45 AM. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|