Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 12-23-2004   #1
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
When Does Google Really Index a Page?

In an earlier thread named Google Not Obeying the NoIndex NoFollow Meta Tag, we kind of moved into a discussion about when is a page considered to be in the Google index.

As you can see by that thread, it is not black and white. If any of you have documentation from Google or a quote from GoogleGuy on the official definition of when a page is really indexed, please state it here.

But to make this a fun thread, lets discuss the different factors that might make for a logical reason to say a page is not in the index versus is in the index.

One example is when the URL is simply listed and nothing else in the SERPs. In the example in the Google Not Obeying the NoIndex NoFollow Meta Tag thread, I had a page that had the noindex, nofollow meta tag. That page's URL was found in the Google search results page. The thread moved into a discussion on if it is enough for a page URL to be shown in the SERPs to be considered in the Google index.

So what are your thoughts?
rustybrick is offline   Reply With Quote
Old 12-23-2004   #2
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
I think url only listings appear for a variety of reasons and not all can be put in the same category.

I remember reading something somewhere on google.com that "url listings only" means google is aware of the pages but for some reason hasn't fully spidered them. I don't think that really explains all "url only" listings or that it means that they are not in the google index.

Last edited by zamolxes : 12-23-2004 at 09:50 AM.
zamolxes is offline   Reply With Quote
Old 12-23-2004   #3
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
I found it, here it is:
Quote:
"Where is my page's title?

Unlike many search engines, Googlebot can return results for pages that are known but haven't been crawled yet. Since we haven't looked at those pages yet, their titles aren't shown; the Google results page displays the URL instead."
http://www.google.com/webmasters/faq.html

As I said above I don't believe that explains all "url only" listings.
In fact I often think Google doesn't have a very good/up to date/accurate webmaster section at google.com
zamolxes is offline   Reply With Quote
Old 12-23-2004   #4
martinuboo
Free Directory Listings Reviews
 
Join Date: Nov 2004
Location: Michigan
Posts: 118
martinuboo will become famous soon enough
when Does Google Really Index a Page?

I had 2 under development sites appear as URL only listings in the SERPs. I wrote to googlebot@google.com and this was their reply:
Quote:
Thank you for your note. Although your robots.txt file prevents our robots
from crawling your pages, it will not prevent our robots from adding a
link to your page without crawling it. This is why the pages you have
mentioned do not have a detailed titles or descriptions.

Although a robots.txt file usually prevents pages from appearing in our
search results, the only fool-proof ways to keep them out of our index are
to make sure that no sites link to them, password protect them, or remove
the robots.txt file and use a NOINDEX meta tag instead. For more
information on meta tags, please visit
http://www.google.com/remove.html#exclude_pages<snip>
Although I had excluded the entire site in robots.txt and had NOINDEX, NOFOLLOW, and NOARCHIVE meta tags on the index page, googlebot still picked up the link (since there were not any incoming links, I think the Google ToolBar phoned home...but I know that was in another thread).

I do not consider this "to be indexed" though, since there is nothing listed, except the URL, that is searchable.

Last edited by martinuboo : 12-23-2004 at 10:45 AM. Reason: typo
martinuboo is offline   Reply With Quote
Old 12-23-2004   #5
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
That's why I said before that not all "url only" listings can be put in the same category.

There are plenty of other "url only listings" though that not only appear but rank well in google results. Those obviously must be indexed. (otherwise google would be totally surreal)

I for one don't like the fact that google still lists pages/files excluded in the robots.txt

Whatever they say - the robots.txt file is there for search engines not for someone else. If they make a page/file available, even if only as a "url only" link in my opinion they are not fully obeing the webmaster's wishes.

And that's how we get into these debates on when/if a page is indexed or not!

Google seem to be getting better and better at confusing everyone lately: pagerank, sandbox, url only listings.... What's next?

Answering rustybrick:
Quote:
lets discuss the different factors that might make for a logical reason to say a page is not in the index versus is in the index.
I would say if a page ranks in search results (other than it's own name or a "site:domain.com" search) it is definitely indexed - if not it might be just a ghost!

Last edited by zamolxes : 12-23-2004 at 11:25 AM.
zamolxes is offline   Reply With Quote
Old 12-23-2004   #6
I, Brian
Whitehat on...Whitehat off...Whitehat on...Whitehat off...
 
Join Date: Jun 2004
Location: Scotland
Posts: 940
I, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of lightI, Brian is a glorious beacon of light
It's maybe not simply a question of "when" but of "how".

For example, it's easy to think that Google simply follows links from pages to pages, and that's simply how Google finds content - which is almost certainly not the case, as anyone else who has had their WS_FTP logs indexed will probably know (which specifically excludes toolbar referrals).

The whole issue of "how" begins with an understanding of http protocol, and IMO it's only then that the question of "when" can then be applied. Even then, there are marked differences in activity of Googlebots that still somehow seem to mirror the old "freshbot" + "deepbot" behaviours.
I, Brian is offline   Reply With Quote
Old 12-23-2004   #7
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
Quote:
Originally Posted by zamolxes
....

I would say if a page ranks in search results (other than it's own name or a "site:domain.com" search) it is definitely indexed - if not it might be just a ghost!
While it may not be intuitive to think that Google can know about pages which it has not yet indexed, a look at the Google indexing process can show how and why this can occur.

First let's make the assumption (justified I think) that a page is not Indexed until the Indexer parses the document and distributes what it finds there to various locations within the databases. As a part of this operation all links are parsed out and placed in the anchors file and from there the URL resolver reads the anchors file and places the anchor text into the forward index associated with the docID that the anchor points to. At this point in time the page that the link points to may not yet have been spidered, but it has an entry in the forward and all susequent indexes for that word.

Here we have a situation where the page has not yet been spidered, but it does have an entry in the word barrels and can be returned for a search for that word, even though Google knows nothing else about that page. This is largely a result of the fact that anchor text is associated with the page it points to not the page its on.

If Google does return such a page in the search results, the only information about that page it has is the URL and thus that is all it can show in the search results.

This is not in violation of the Robots.txt file or the no index meta, as the page has not been spidered or indexed as requested.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 12-23-2004   #8
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
And then we come back to what I was saying in another thread: as long as you have the right anchor text pointing to you nothing else matters! That's a terrible ranking algorithm!

Quote:
This is not in violation of the Robots.txt file or the no index meta, as the page has not been spidered or indexed as requested.
It has not been spidered but it has been at least partially indexed as it comes up in certain restricted searches! (As far as I knew Google results were "served" from Google's index?!)

Last edited by zamolxes : 12-23-2004 at 12:41 PM.
zamolxes is offline   Reply With Quote
Old 12-23-2004   #9
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
The only thing that has been indexed is a link pointing to that page, not the page itself.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 12-23-2004   #10
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
Quote:
Originally Posted by Mel
The only thing that has been indexed is a link pointing to that page, not the page itself.
I agree, but that link can appear in certain searches. What many want when using robots.txt is also not to make public certain pages/files.

I still don't understand why google keeps all those ghost urls in their index. So that they can say they have the biggest index?!
There are many "url only" listings that don't get updated or dropped for ages.

At some point when they doubled their index there was even some talk of duplicate listings in google - I was having over 10000 pages come up for one site for site:domain.com when the site never had more than 4000 pages all static, always same url's, etc. (now is back to normal)

Last edited by zamolxes : 12-23-2004 at 01:03 PM.
zamolxes is offline   Reply With Quote
Old 12-23-2004   #11
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
That may be what users want but it is not what the robots.txt or the noindex metas provide. The first says do not spider the page, and the second says do not index the page. It may not be exactly what some folks want, but thats what they are set up to do and they seem to do it pretty well.

Wanting these tags to do things that they were not designed to do is like buying an SUV and being unhappy that it won't carry a two ton load - it just wasn't designed to do that.

If there are files you want to be kept secret you might considre putting them in a password protected directory.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 12-23-2004   #12
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
What's the point, google will still link to them - people will still know they are there! It's just annoying that nowadays it's harder and harder to keep anything private. What is the point of Google showing a link to them? In fact what is the point behind this idea at google (concerning the "url only" listings): "we know the pages are there but we haven't or can't or can't be bothered or won't crawl them (well, maybe we will .... sometime)"


Maybe we should discuss more about the "url only" listings, I think it's more interesting - I still don't see the point of most of them in the index (some have been there, unchanged for so many months!). However I leave it up to you and others, I'm flying to Spain in 2hrs and I haven't slept at all this night!

Last edited by zamolxes : 12-23-2004 at 10:59 PM.
zamolxes is offline   Reply With Quote
Old 12-23-2004   #13
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
Yes people may still know that there is a page with such and such a page name at such and such a site, but they will not have the faintest idea what is on that page.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 12-23-2004   #14
zamolxes
Member
 
Join Date: Dec 2004
Posts: 23
zamolxes has a little shameless behaviour in the past
Quote:
Originally Posted by Mel
Yes people may still know that there is a page with such and such a page name at such and such a site, but they will not have the faintest idea what is on that page.
Unless they click on it. Which they probably wouldn't if the link was not displayed.
zamolxes is offline   Reply With Quote
Old 12-23-2004   #15
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
If as I suggested it is in a password protected directory they will see nothing but a login.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 12-24-2004   #16
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
OK, so we know that the only real way to keep the bot off your pages is to password protect or block the bots IP address.

I am more interested in the distinction of when a page is indexed.

A page is obviously not indexed if the bot can not even get to the page.

If a bot gets to the page and puts the URL in the database, is that considered "indexed"?
rustybrick is offline   Reply With Quote
Old 12-24-2004   #17
Mel
Just the facts ma'm
 
Join Date: Jun 2004
Location: Malaysia
Posts: 793
Mel is just really niceMel is just really niceMel is just really niceMel is just really nice
IMO a page is indexed only after it has been crawled, the page stored in the repository, and the indexer has parsed the page to the various databases and files, so that the entire content of the words and links on the page can be searched. In most cases it will also have a cache available, but this will not be the case for those who have requested that a page not be cached.
__________________
Mel Nelson
Expert SEO Dont settle for average SEO
Singapore Search Engine Optimization and web design
Mel is offline   Reply With Quote
Old 12-24-2004   #18
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Indexing

Hi, Rusty

There is a difference between indexing and storing (Indexers and storage managers)

You may want to check WebBase : A repository of web pages, under 3:Storage Manager" section

Orion
orion is offline   Reply With Quote
Old 12-24-2004   #19
rustybrick
 
rustybrick's Avatar
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,810
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Thank you Orion!
rustybrick is offline   Reply With Quote
Old 12-25-2004   #20
bobmutch
seocomapny.ca|Project Support Open Source|Top 40 Dirs rated by Inbound Link Quality
 
Join Date: Aug 2004
Location: london.on.ca
Posts: 575
bobmutch has a spectacular aura aboutbobmutch has a spectacular aura about
For those that dislike pdf's here is the WebBase: A Repository of Web Pages in html http://www9.org/w9cdrom/296/296.html .
bobmutch is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off