Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 03-09-2005   #1
inlogicalbearer
SEO and Marketing News from North
 
Join Date: Aug 2004
Location: Montreal, Quebec, Canada
Posts: 66
inlogicalbearer is on a distinguished road
Google Index Size Numbers have decreased

My fast estimate is around 50%. If you take the Veronis list made 6 weeks ago and go check today on Google you'll see a huge difference.

http://aixtal.blogspot.com/2005/01/web-googles-counts-faked.html

Also I check the 12 most used words in english and it's almost the same metrics.

http://inlogicalbearer.blogspot.com/2005/03/something-have-changed-at-google.html

Did they start using the new clustering algo?

http://news.com.com/Google+practices+dividing+to+conquer/2100-1024_3-5605127.html?tag=cd.top

Last edited by Marcia : 03-11-2005 at 11:27 AM. Reason: Minor modifications.
inlogicalbearer is offline   Reply With Quote
Old 03-09-2005   #2
Michael Martinez
Member
 
Join Date: Jul 2004
Posts: 336
Michael Martinez is on a distinguished road
Post

Mr. Veronis doesn't seem to understand that "about" means the query tool took a guess at how many actual pages may contain the word "the" (which is also a word in other languages, not simply English -- it can be a name in some Asian languages, for example).

"to", "in", and "it" are also found in other languages. So are "for", "that", "a", and "of".

This is a flawed metric, although it may still be an indication of an algorithmic change.

The query tool cannot possibly process every Web site in the space of a few seconds. It actually cuts off results after a certain amount of time or iterations in collecting data and then just works on the abbreviated data set.

You cannot use the search engine to accurately measure the extent of the indexing of common words in any popular language.
Michael Martinez is offline   Reply With Quote
Old 03-09-2005   #3
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
The numbers are consistantly changing. Google continually changes the datacenters from which you receive the results. The datacenter (and numbers) very often change from page to page of the results.

Of the 45 datacenters that I know about, there are currently (as I write this) 8 different sets of results, each giving different "about" numbers. They have been different for so long that I am certain it has nothing to do with converging the datacenters.

I can think of 2 reasons for it. One is that different datacenter sets use different algos, and the other is that they have different indexes. I see no reason for them to have different indexes for such a long period of time (weeks), so I assume they employ slightly different algos. It does make some sense to me - they can test new tweaks without affecting all the datacenters.

Matt Cutts has said that Google uses several different algos at random. I didn't believe it until I started watching the datacenters, but now I am inclined to believe it.
PhilC is offline   Reply With Quote
Old 03-09-2005   #4
inlogicalbearer
SEO and Marketing News from North
 
Join Date: Aug 2004
Location: Montreal, Quebec, Canada
Posts: 66
inlogicalbearer is on a distinguished road
I monitor Google since 99 and beside Florida, I never saw metrics on Google going down to half... and third... of their previous few weeks results on so many keywords like "linux" who decline from 222 million results to 81,3 !!!

Even the brand "Firefox" who is in a growing move since a few months loose 3,8 millions pages since 6 weeks ??? Nah ! the datacenters.
inlogicalbearer is offline   Reply With Quote
Old 03-09-2005   #5
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
I don't think I've seen the "abouts" halve between the DCs, but I've seen close to that in the last few weeks - 10.6 million in some DCs compared with 6.x and 7.x million in others. Right now they aren't so spread, but the numbers keep changing in all the DCs anyway.
PhilC is offline   Reply With Quote
Old 03-09-2005   #6
Michael Martinez
Member
 
Join Date: Jul 2004
Posts: 336
Michael Martinez is on a distinguished road
Post

As long as there are considerable inconsistencies between the various data centers, it's generally impossible to draw any reliable conclusions about what is going on with Google. It's best to just wait until they finish doing whatever it is that they are doing and then pick up the pieces and start over.
Michael Martinez is offline   Reply With Quote
Old 03-09-2005   #7
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
Believable conclusions, no. But it's interestingly pleasant to consider what might be happening.
PhilC is offline   Reply With Quote
Old 03-09-2005   #8
jmandrake
Member
 
Join Date: Mar 2005
Location: San Diego, CA
Posts: 9
jmandrake is on a distinguished road
Quote:
Originally Posted by Michael Martinez
As long as there are considerable inconsistencies between the various data centers, it's generally impossible to draw any reliable conclusions about what is going on with Google. It's best to just wait until they finish doing whatever it is that they are doing and then pick up the pieces and start over.
It would be nice to know when they're "finished doing whatever they're doing". Seems they're always in a state of flux somewhere.

Just a thought: seeing that they have so many datacenters, wouldn't it make sense that each one follows its own array of links to spider? Then they'd just compare and combine results on an ongoing basis... If they all follow their own paths on the web with their own spiders it would make sense that each DC's index would be different from the next at any given time even if they all follow the same algorithms.
jmandrake is offline   Reply With Quote
Old 03-10-2005   #9
Marcia
 
Marcia's Avatar
 
Join Date: Jun 2004
Location: Los Angeles, CA
Posts: 5,476
Marcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond reputeMarcia has a reputation beyond repute
Truth be told, I think we need to chill out, wind down and not jump to any premature conclusions or suppositions.

Google has been very big on link analysis for their algo; PageRank (and HITS, right around the same time) were foundational for link_based relevancy - they're the grand-daddies of it. PR and links as relevancy indicators were foundational for Google, and there's conjecture (with substantial credibility) that they have a separate database for titles and anchor text.

We're moving into what some call the "third generation" of search technology. It just may be that they are working on correlations between lexical and link analysis for relevancy. Even a surface look at the published papers on that kind of algos or systems can show how complex it is, and how difficult it would be to implement and still maintain relevancy in search results.

I personally don't believe Google or any search engine will ever be "finished" with refining and moving forward to the closest they can get to relevant results, with regard to meeting the needs of what's relevant from the perspective of the everyday user.

If search were dealing with fixed entities as in a lab environment it could be considered a "science" but considering the volatility of the web altogether and the nature of the sites that any engine indexes, it's no surprise when it turns from science into art. And if we stand on grounds of conjecturing that, we're standing on shifting sands. We can't expect anything less than "testing" on a constant basis as search technology evolves.

We just have to keep up with trying to evolve enough within our own capacity for thinking to be able to adapt to what's going on inside their minds.

I think any of us who think Google, or any engine for that matter, will ever be finished are dreaming. It'll never be finished.

Last edited by Marcia : 03-10-2005 at 01:28 AM.
Marcia is offline   Reply With Quote
Old 03-10-2005   #10
Michael Martinez
Member
 
Join Date: Jul 2004
Posts: 336
Michael Martinez is on a distinguished road
Post

Quote:
Originally Posted by jmandrake
It would be nice to know when they're "finished doing whatever they're doing". Seems they're always in a state of flux somewhere.
The last time they went through something like this, it took a couple of months, give or take, until things settled down again.

Of course, there is always the chance that they have turned the box upside down for good (they have done that a couple of times in the past).

Quote:
Just a thought: seeing that they have so many datacenters, wouldn't it make sense that each one follows its own array of links to spider? Then they'd just compare and combine results on an ongoing basis... If they all follow their own paths on the web with their own spiders it would make sense that each DC's index would be different from the next at any given time even if they all follow the same algorithms.
That would really depend on what is done with the spidering results. If the retrieved documents are shared between data centers immediately, so that each could process the documents with its own servers, then they should be able to maintain a pretty good syncronization.

If they are indeed using more than one algorithm across the various data centers, then one should expect variant results (at least for some searches).

Each document is broken down into components and the content is indexed so that it can be identified for relevance to queries:

Quote:
Important, high-quality sites receive a higher PageRank, which Google remembers each time it conducts a search. Of course, important pages mean nothing to you if they don't match your query. So, Google combines PageRank with sophisticated text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines all aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.
Emphasis is mine.

Source: http://www.google.com/technology/

Doing that for 8,000,000,000 pages is going to take a while even with their vast resources.
Michael Martinez is offline   Reply With Quote
Old 03-10-2005   #11
hardball
Member
 
Join Date: Oct 2004
Posts: 83
hardball will become famous soon enough
Quote:
Originally Posted by inlogicalbearer
Did they start using the new clustering algo ?
Looks like it. Most clustering engines show clustered results to the side or as a supplemental feature, I believe google has integrated clustered results directly into the SERPs.

Using some open source indexing and clustering tools, I've been able to get pretty close to google result sets in topic areas with data sets in the 1/2 million range. Clustering would explain most of the speculation regarding LSI and goes very far in explaining less than pinpoint kw based accuracy yet oddly relevant results google is now showing.
hardball is offline   Reply With Quote
Old 03-10-2005   #12
PhilC
Member
 
Join Date: Oct 2004
Location: UK
Posts: 1,657
PhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud ofPhilC has much to be proud of
Just out of interest:-

For the last 3 weeks or so (since I started watching closely), the DCs have been split into 2, roughly equal, groups. The DCs within a group show reasonably close results, although they contain sets of DCs that produce different "about" figures, so the results within a group are not all identical.

Until 2 days ago, the 2 groups showed no real signs of coming together, but then one group started to grow. Right now it comprises about 2 thirds of the DCs, and the growth seems to be lasting. It's too early to say that the DCs are slowly converging, and I actually expect things to move the other way, but maybe they won't.

One thing to know if you watch the DCs is that you don't always receive the results from the DC that you request them from. Sometimes a request to DC will be redirected to another DC, perhaps while an update is being done, or changes are being made, or for load balancing. So some DCs can appear to be 'dancing' when they aren't. That's my belief from watching them, anyway.

The seperate index for Titles and anchor texts was described in Brin and Page's "Anatomy of a Search Engine" paper, and it's been assumed that the paper generally described the architecture of Google. The paper said that, when Google receives a search query, it first tries to get a sufficiently large results set (about 40,000) from the index that contains the Titles and anchor texts, and it only goes to the main index if a suffiently large set can't be obtained from there. That is one very big reason why Titles and anchor texts are so important to rankings.
PhilC is offline   Reply With Quote
Old 03-11-2005   #13
inlogicalbearer
SEO and Marketing News from North
 
Join Date: Aug 2004
Location: Montreal, Quebec, Canada
Posts: 66
inlogicalbearer is on a distinguished road
Seem Google is back to normal today. The main stopwords are back to 8 billion and the others keywords results digits are back.
inlogicalbearer is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off