|
#1
|
||||
|
||||
|
When Does Google Really Index a Page?
In an earlier thread named Google Not Obeying the NoIndex NoFollow Meta Tag, we kind of moved into a discussion about when is a page considered to be in the Google index.
As you can see by that thread, it is not black and white. If any of you have documentation from Google or a quote from GoogleGuy on the official definition of when a page is really indexed, please state it here. But to make this a fun thread, lets discuss the different factors that might make for a logical reason to say a page is not in the index versus is in the index. One example is when the URL is simply listed and nothing else in the SERPs. In the example in the Google Not Obeying the NoIndex NoFollow Meta Tag thread, I had a page that had the noindex, nofollow meta tag. That page's URL was found in the Google search results page. The thread moved into a discussion on if it is enough for a page URL to be shown in the SERPs to be considered in the Google index. So what are your thoughts? |
|
#2
|
|||
|
|||
|
I think url only listings appear for a variety of reasons and not all can be put in the same category.
I remember reading something somewhere on google.com that "url listings only" means google is aware of the pages but for some reason hasn't fully spidered them. I don't think that really explains all "url only" listings or that it means that they are not in the google index. Last edited by zamolxes : 12-23-2004 at 08:50 AM. |
|
#3
|
|||
|
|||
|
I found it, here it is:
Quote:
As I said above I don't believe that explains all "url only" listings. In fact I often think Google doesn't have a very good/up to date/accurate webmaster section at google.com |
|
#4
|
|||
|
|||
|
when Does Google Really Index a Page?
I had 2 under development sites appear as URL only listings in the SERPs. I wrote to googlebot@google.com and this was their reply:
Quote:
I do not consider this "to be indexed" though, since there is nothing listed, except the URL, that is searchable. Last edited by martinuboo : 12-23-2004 at 09:45 AM. Reason: typo |
|
#5
|
|||
|
|||
|
That's why I said before that not all "url only" listings can be put in the same category.
There are plenty of other "url only listings" though that not only appear but rank well in google results. Those obviously must be indexed. (otherwise google would be totally surreal) ![]() I for one don't like the fact that google still lists pages/files excluded in the robots.txt Whatever they say - the robots.txt file is there for search engines not for someone else. If they make a page/file available, even if only as a "url only" link in my opinion they are not fully obeing the webmaster's wishes. And that's how we get into these debates on when/if a page is indexed or not! ![]() Google seem to be getting better and better at confusing everyone lately: pagerank, sandbox, url only listings.... What's next? Answering rustybrick: Quote:
![]() Last edited by zamolxes : 12-23-2004 at 10:25 AM. |
|
#6
|
|||
|
|||
|
It's maybe not simply a question of "when" but of "how".
For example, it's easy to think that Google simply follows links from pages to pages, and that's simply how Google finds content - which is almost certainly not the case, as anyone else who has had their WS_FTP logs indexed will probably know (which specifically excludes toolbar referrals). The whole issue of "how" begins with an understanding of http protocol, and IMO it's only then that the question of "when" can then be applied. Even then, there are marked differences in activity of Googlebots that still somehow seem to mirror the old "freshbot" + "deepbot" behaviours. |
|
#7
|
|||
|
|||
|
Quote:
First let's make the assumption (justified I think) that a page is not Indexed until the Indexer parses the document and distributes what it finds there to various locations within the databases. As a part of this operation all links are parsed out and placed in the anchors file and from there the URL resolver reads the anchors file and places the anchor text into the forward index associated with the docID that the anchor points to. At this point in time the page that the link points to may not yet have been spidered, but it has an entry in the forward and all susequent indexes for that word. Here we have a situation where the page has not yet been spidered, but it does have an entry in the word barrels and can be returned for a search for that word, even though Google knows nothing else about that page. This is largely a result of the fact that anchor text is associated with the page it points to not the page its on. If Google does return such a page in the search results, the only information about that page it has is the URL and thus that is all it can show in the search results. This is not in violation of the Robots.txt file or the no index meta, as the page has not been spidered or indexed as requested.
__________________
Mel Nelson Expert SEO Dont settle for average SEO Singapore Search Engine Optimization and web design |
|
#8
|
|||
|
|||
|
And then we come back to what I was saying in another thread: as long as you have the right anchor text pointing to you nothing else matters! That's a terrible ranking algorithm!
Quote:
Last edited by zamolxes : 12-23-2004 at 11:41 AM. |
|
#9
|
|||
|
|||
|
The only thing that has been indexed is a link pointing to that page, not the page itself.
__________________
Mel Nelson Expert SEO Dont settle for average SEO Singapore Search Engine Optimization and web design |
|
#10
|
|||
|
|||
|
Quote:
I still don't understand why google keeps all those ghost urls in their index. So that they can say they have the biggest index?! There are many "url only" listings that don't get updated or dropped for ages. At some point when they doubled their index there was even some talk of duplicate listings in google - I was having over 10000 pages come up for one site for site:domain.com when the site never had more than 4000 pages all static, always same url's, etc. (now is back to normal) Last edited by zamolxes : 12-23-2004 at 12:03 PM. |
|
#11
|
|||
|
|||
|
That may be what users want but it is not what the robots.txt or the noindex metas provide. The first says do not spider the page, and the second says do not index the page. It may not be exactly what some folks want, but thats what they are set up to do and they seem to do it pretty well.
Wanting these tags to do things that they were not designed to do is like buying an SUV and being unhappy that it won't carry a two ton load - it just wasn't designed to do that. If there are files you want to be kept secret you might considre putting them in a password protected directory.
__________________
Mel Nelson Expert SEO Dont settle for average SEO Singapore Search Engine Optimization and web design |
|
#12
|
|||
|
|||
|
What's the point, google will still link to them - people will still know they are there! It's just annoying that nowadays it's harder and harder to keep anything private. What is the point of Google showing a link to them? In fact what is the point behind this idea at google (concerning the "url only" listings): "we know the pages are there but we haven't or can't or can't be bothered or won't crawl them (well, maybe we will .... sometime)"
Maybe we should discuss more about the "url only" listings, I think it's more interesting - I still don't see the point of most of them in the index (some have been there, unchanged for so many months!). However I leave it up to you and others, I'm flying to Spain in 2hrs and I haven't slept at all this night! Last edited by zamolxes : 12-23-2004 at 09:59 PM. |
|
#13
|
|||
|
|||
|
Yes people may still know that there is a page with such and such a page name at such and such a site, but they will not have the faintest idea what is on that page.
__________________
Mel Nelson Expert SEO Dont settle for average SEO Singapore Search Engine Optimization and web design |
|
#14
|
|||
|
|||
|
Quote:
|
|
#15
|
|||
|
|||
|
If as I suggested it is in a password protected directory they will see nothing but a login.
__________________
Mel Nelson Expert SEO Dont settle for average SEO Singapore Search Engine Optimization and web design |
|
#16
|
||||
|
||||
|
OK, so we know that the only real way to keep the bot off your pages is to password protect or block the bots IP address.
I am more interested in the distinction of when a page is indexed. A page is obviously not indexed if the bot can not even get to the page. If a bot gets to the page and puts the URL in the database, is that considered "indexed"? |
|
#17
|
|||
|
|||
|
IMO a page is indexed only after it has been crawled, the page stored in the repository, and the indexer has parsed the page to the various databases and files, so that the entire content of the words and links on the page can be searched. In most cases it will also have a cache available, but this will not be the case for those who have requested that a page not be cached.
__________________
Mel Nelson Expert SEO Dont settle for average SEO Singapore Search Engine Optimization and web design |
|
#18
|
||||
|
||||
|
Hi, Rusty
There is a difference between indexing and storing (Indexers and storage managers) You may want to check WebBase : A repository of web pages, under 3:Storage Manager" section Orion |
|
#19
|
||||
|
||||
|
Thank you Orion!
|
|
#20
|
|||
|
|||
|
For those that dislike pdf's here is the WebBase: A Repository of Web Pages in html http://www9.org/w9cdrom/296/296.html .
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|