Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 09-09-2004   #1
hajith
Member
 
Join Date: Jun 2004
Location: Chennai
Posts: 12
hajith is on a distinguished road
How is this possible?

According to the present Google Home page it shows that - Searching 4,285,199,774 web pages.

But When i search for term "the" in Google it shows 5,780,000,000 pages.

If it has only 4,285,199,774 web pages in its index how can it display 5,780,000,000 pages.

Guess! the remaining pages are from Supplemental Index?????

One more speculation is Google uses unsigned long integer in ANSI C to assign a unique ID for each and every web page on the Internet. This variable is four bytes long, and can accept upto 4,294,967,295. So if it is ture that Google uses the above method how come its display 5,780,000,000 pages for the search term "the".

Can anyone explain?
hajith is offline   Reply With Quote
Old 09-09-2004   #2
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
Probably because the count on Google's home page isn't live but instead manually updated every so often. Sort of like the old McDonald's signs that said 200 million served. They might have actually served 300 million, but no one bothered to change the sign.
dannysullivan is offline   Reply With Quote
Old 09-09-2004   #3
hajith
Member
 
Join Date: Jun 2004
Location: Chennai
Posts: 12
hajith is on a distinguished road
I agree that they would not have changed the numbers on their home page .... But what about this .....

One more speculation is Google uses unsigned long integer in ANSI C to assign a unique ID for each and every web page on the Internet inorder to identify them. This variable is four bytes long, and can accept upto 4,294,967,295.

So if it is ture that Google uses the above method how come its display 5,780,000,000 pages for the search term "the".
hajith is offline   Reply With Quote
Old 09-09-2004   #4
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
That's being discussed over here: Google malfunctioning?

I also posted a summary article yesterday for our SEW paid members here:

Google Out Of Index Space?

It's a short recap wrapping up some previous and past articles about that topic.
dannysullivan is offline   Reply With Quote
Old 09-09-2004   #5
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
I've noticed some referrals from that "members only" piece, but I'm not a member so I haven't read it. Since I haven't read it, I can't comment on it.

But I hope you do a better job of criticism than Chris Riding. the owner of Search Guild, who has written an article critical of me at http://www.searchguild.com/article215.html

I tried to point out the deficiencies in his technical assumptions on his forum over a year ago and got nowhere.

Then I tried again last week, and after a few exchanges he locked the thread. Now he has written the above article, which I feel deserves an answer. Since he locked that thread, I can't post my answer on his site.

First of all, Chris has made remarkable progress in his understanding of inverted indexes. A year ago his arguments were quite absurd. Now his description of how an inverted index works is fairly good.

However, he's obviously spinning things. I'd be less suspicious of his motives if Google Adsense ads weren't plastered all over his site. But I'll leave that aside for now.

Chris and I finally agree that if Google is still using a 4-byte docID, then they have to go to 5 bytes. I suggested in my piece a over a year ago that they could read in an extra byte, mask off the bits they don't need for the new docID, and use this as a multiplier for the old 4-byte integer.

Chris says I'm full of it because he would use a "long long" integer of 8 bytes, and strip off the unused 3 bytes on read and write.

Both methods require several extra lines of code for reading and writing. Sure, I'll do it the way Chris suggests. Six in one and a half-dozen in the other. No big deal. Undoubtedly Google would study the respective CPU cycles for each method and pick the one that is most efficient. I wouldn't hazard a guess about the relative efficiency of each method without experimenting.

This issue of methodology that Chris uses as a basis for his entire criticism is a red herring. Each method requires extra code, and that is the essential point. Lots of code has to be changed. Since the docIDs have now gone from 4 to 5 bytes, and they are all packed tightly back-to-back, all the offsets have now shifted. That means the all code that scans the inverted index for docIDs has to be changed.

And the space problem is not trivial, as Chris wants us to believe. Every word on every web page in Google's index gets its own docID. In fact, since Google uses two inverted indexes, this means that every word on every web page uses, on average, two docIDs. That's a lot of space -- about 2.4 terabytes of added space per copy of the inverted index. There are many copies of this index in RAM on Google's distributed system at any one time, because it's the first index that has to be consulted when handling any search request.

Chris's argument is an exercise in generating fog. I expect this sort of thing from the Googleplex, but not from a mere fan of Google's Adsense program.
Everyman is offline   Reply With Quote
Old 09-09-2004   #6
dannysullivan
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,085
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
Quote:
But I hope you do a better job of criticism than Chris Riding. the owner of Search Guild, who has written an article critical of me.
As said, it was a short recap of the allegations put out lately -- not an in-depth analysis like you're talking about. I'll send you a copy.
dannysullivan is offline   Reply With Quote
Old 09-09-2004   #7
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
Quote:
So if it is true that Google uses the above method how come its display 5,780,000,000 pages for the search term "the".
Because Google isn't counting them.

search for the = 5,780,000,000
search for allinurl: the = 5,780,000,000
search for allintitle: the = 5,780,000,000
search for allintext: the = 5,780,000,000

I don't believe the "Searching 4,285,199,774 web pages" either, and have never used this number as a major argument in support of my position, even though it would seem to support my position.
Everyman is offline   Reply With Quote
Old 09-10-2004   #8
chris
www.searchguild.com - like Threadwatch.org but with more threads :)
 
Join Date: Jun 2004
Posts: 21
chris has disabled reputation
Quote:
First of all, Chris has made remarkable progress in his understanding of inverted indexes. A year ago his arguments were quite absurd. Now his description of how an inverted index works is fairly good.
I really can't accept credit. It must've been the computer adding in bits when I copied and pasted that section from the articles I wrote a year ago ( here ) . <checks> No...it's the same.

Quote:
However, he's obviously spinning things. I'd be less suspicious of his motives if Google Adsense ads weren't plastered all over his site. But I'll leave that aside for now.
How very decent of you. Let's not mention that, I'd rather it secret. Oh, hang on...you have mentioned it. Bottoms...and I put those Adsense ads on so discreetly too.

Quote:
Chris and I finally agree that if Google is still using a 4-byte docID
No we don't. I never said that. I never will. Read again, I cunningly hid my disagreement in phrases like "Google docIDs aint broke", "Im fed up with it so Im going to tell you why the theory is wrong", "There is strong evidence to suggest that in the original Google code they used a 4 byte docID (they wrote it in their papers!). That of course doesnt mean thats what they continued with after that point. Nor does it mean it cant be changed (well go into that later, for now I just want to explain where the 4.2 billion comes from).", and "Ive shown that Google likely has more than 4.2 billion pages which also means that they likely have more than a four byte docID". Maybe I should have been clearer?

I don't specify what system/format Google use for docIDs as how would anyone know for sure, the article merely counters the incorrect arguments that are being made by people saying it's stuck on 4 bytes and that's a problem.

Quote:
Both methods require several extra lines of code for reading and writing.
I apologise for forgetting to put line numbers next to my code experiments given as examples.

Quote:
That's a lot of space -- about 2.4 terabytes
Or about 2.4 times the size of the original BackRub in Larry's dorm room. Maybe we could all club together and buy them some lego and a couple of hard drives? Who's in?

Quote:
Chris's argument is an exercise in generating fog. I expect this sort of thing from the Googleplex, but not from a mere fan of Google's Adsense program.
Okay Okay I confess. The Googleplex called me up and told me that they had this really big problem with docIDs and it was getting out and they thought that seen as they'd been sending me Adsense checks every month that I should help them out. I tried to say no but they were just oh so sweet and mentioned something like "do I want my sites to rank for xyz". I found myself agreeing with everything they said. Logic, of course, has nothing to do with it. I tell you what. Call me up, speak nicely to me, offer me a few hundred dollars and I'll generate some fog for you. How does "Google docIDs are broke - absolute proof" work for you?


Okay, so maybe this post is a little bit naughty. But honestly, if the only arguments provided for the case for now rest on personal attacks against me then I consider my point proved. If the article is a better explanation and thinking about the issues than I made before (a year ago, a week ago, whatever) then it's solely because it only becomes worth thinking about properly when people start to believe it. I make no apologies for that, the logic stands alone and logic either is or isn't.

Last edited by chris : 09-10-2004 at 09:45 AM.
chris is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off