View Full Version : Number of pages indexed based on PageRank?
Jeremy_Goodrich
06-15-2004, 02:54 PM
After doing a bit of digging this morning, I can't seem to find any academic research on this subject - is the number of pages indexed for any given tld limited by the pagerank for that tld?
There are many pointers to "yes" but it would be nice to have another case study aside from my own to point to as "proof".
AussieWebmaster
06-15-2004, 02:57 PM
After doing a bit of digging this morning, I can't seem to find any academic research on this subject - is the number of pages indexed for any given tld limited by the pagerank for that tld?
There are many pointers to "yes" but it would be nice to have another case study aside from my own to point to as "proof".
Actually it works the other way... the more pages the more potential PR to pass around...
Jeremy_Goodrich
06-15-2004, 03:12 PM
Roflmao, you don't create pagerank by generating pages ;)
If that was true, then I could increase my PR easily enough, and not worry about the limit of my pages in Google.
Try doing a few searches - before you throw out false consclusions like that, please.
dannysullivan
06-15-2004, 03:21 PM
PageRank is assigned on a page by page basis, so a given site shouldn't have a PageRank.
The home page will have a PR value, of course. And Google will use this to create an estimate of PR values for other pages within the web site, for the Google Toolbar, if it doesn't have an actual value for that page.
I've never seen Google say that a high home page PR value will mean more internal pages will be indexed. However, I do think it would help.
Google and other search engines have said that they'll tend to index pages that have more links pointing at them or that are seen to be more important. If you have a high PR home page, point at your internal pages, you pass on some of that value to them -- and so on. That would likely give them an edge on being indexed more rapidly than someone with a low PR home page.
AussieWebmaster
06-15-2004, 03:38 PM
Roflmao, you don't create pagerank by generating pages ;)
If that was true, then I could increase my PR easily enough, and not worry about the limit of my pages in Google.
Try doing a few searches - before you throw out false consclusions like that, please.
Okay... purely as a number no you will not get more PR... however, the volume of relevant content at your site will contribute to the authority level of the site.
If you need me to post links to the numerous references to increasing the content at your site as a part of overall optimization strategy I will... but I will wait for a reply as possibly we mistook the others comments.
Rereading your post I realized I had missed the question... happens alot when I start answering a question before having read all that was being stated.
seomike
06-15-2004, 03:47 PM
I agree with danny and aussiewebmaster but also remember that page mass has alot to do with it too. Good crosslinking in a site with a lot of pages is like shooting a laser into a hall of mirrors. PR is passed easier to pages that are farther from the root that way and inbound links that go into internal pages trickle PR back to the top.
Structure plays a big part too. Don't skip directories in linking. I've noticed in the last 5 sites I've done that spiders will come in and crawl the root/ documents. then come back and crawl the second level directory root/directory2/ and so on.
If you skip from root/ to root/directory/directory/directory/ i've noticed it will take a higher PR to get that spider to go there. Don't ask me why I'm just reporting on what I've been watching. I ran into trouble with one of my directories and had to change the entire site structure. I was making the jump from domain.com to domain.com/us/smallbusiness/category/ and couldn't get G to index anything. I changed it from that to domain.com/category and the entire site of 1,000 + pages was crawled in about a month.
Jeremy_Goodrich
06-15-2004, 04:17 PM
M'kay, far afield of the topic - is the number of pages indexed on a particular domain dependent on the amount of pagerank for that domain?
Taking a look @ http://www.webmasterworld.com/forum3/8467.htm
(thread over a year old) the 2 theories being floated are 1) server response time and 2) pagerank as the limit of pages to index.
It's only logical that Google would limit the number of pages by pagerank that they include in their index - or time it takes to crawl the site. Say your site was very highly connected in the webmap, and you had a billion pages...should google index them all? Or, would time stop them before they crawled the whole lot? Or does their rule set indicate that they should stop crawling the page set, after a certain connectivity limit has been passed?
I'm looking for some proof, though, if there is such a thing...and, let's forget about lasers & such seomike :)
AussieWebmaster
06-15-2004, 04:31 PM
I would nod towards server response time and give the thumbs down to PR.
The crawler is greedy so it tries to crawl and index as many pages as it
can. PR does come into play for frequency of crawl. For example, a full
crawl cycle is usually between 20 and 30 days. A higher PR page (not site)
will be crawled more frequently (closer to 20 days)... a low PR page less so
(closer to 30 days). Not a huge difference either way.
If the spider encounters problems it may not get to the other pages and so be restricted that way.
seomike
06-15-2004, 05:15 PM
You looking for something like this :confused:
I have a tracking script on just about every page.
http://www.mouse-house-tour.com/crawl-report.php
I just built this real quick so if the average PR of the site goes up to a 5 I guess we'll see if the site gets more pages crawled per visit.
Roflmao, you don't create pagerank by generating pages ;)
If that was true, then I could increase my PR easily enough, and not worry about the limit of my pages in Google.
Try doing a few searches - before you throw out false consclusions like that, please.
IMO the pot should do the research before calling the kettle black. Every page has its own pagerank to pass on as its webmaster chooses, and if you understand how to structure your site the more pages you have the more PR potential you control and the higher PR you can generate for a particular page.
M'kay, far afield of the topic - is the number of pages indexed on a particular domain dependent on the amount of pagerank for that domain?
Taking a look @ http://www.webmasterworld.com/forum3/8467.htm
(thread over a year old) the 2 theories being floated are 1) server response time and 2) pagerank as the limit of pages to index.
It's only logical that Google would limit the number of pages by pagerank that they include in their index - or time it takes to crawl the site. Say your site was very highly connected in the webmap, and you had a billion pages...should google index them all? Or, would time stop them before they crawled the whole lot? Or does their rule set indicate that they should stop crawling the page set, after a certain connectivity limit has been passed?
I'm looking for some proof, though, if there is such a thing...and, let's forget about lasers & such seomike :)
Seems like everyone has forgotten that googlebot uses the if-modified-since conditional get when spidering a site, and IMO sites that implement proper responses to IMS will get more pages crawled as opposed to those who do not.
Ever notice that some sites get a higher percentage of their pages indexed than others of the same PR?
Jeremy_Goodrich
06-16-2004, 01:17 PM
The "toolbar pr" is a rounded, whole number...NOT the real PR for a site. So you're saying that you know two sites with exactly the same REAL pagerank value have the different numbers of pages indexed...? Very few people (if any) outside Google would be able to pinpoint the real pagerank score of a page...tell me - I'm sure everybody would love to know the secret ;)
Dude, I've done research, yesterday spent four hours refreshing my mind from stuff from Herzinger, Bharat, etc...and nope, didn't find any explicit mention of what I'm after.
This thread has been fun, but unhelpful.
St0n3y
06-16-2004, 02:20 PM
Roflmao, you don't create pagerank by generating pages
If that was true, then I could increase my PR easily enough, and not worry about the limit of my pages in Google.
Try doing a few searches - before you throw out false consclusions like that, please.
Crossinglinking within a large site in and of itself is not enough to generate a PR, however I have seen multiple exaples of large high PR sites where their Google backlinks are over 75% from their own internal pages.
Is this a difinitive conclusion? No, but I strongly believe that more pages of quality and unique content will both build a site's authoritative status and help improve PR.
>academic research on this subject
http://www7.scu.edu.au/programme/fullpapers/1919/com1919.htm is a good place to start, covers many aspects of the decision tree regarding crawling.
All threories about crawling have to have a couple of modifiers imho. Firstly you must assume that there is a scarcity of resources [bandwidth in this case] before it becomes a factor, in what seems to be the new dot com goldrush search engines may choose to ignore the financial costs of crawling. Secondly time is a big factor, if you choose to crawl in a friendly way then only so many requests per server can be made in a given time frame, and if you have a set update schedule [e.g. monthly crawls] server response time must have an impact.
[waves to Jeremy :)]
Jeremy_Goodrich
06-16-2004, 04:22 PM
Hi, NFFC - nice to "see you".
FYI, I'll be in London later this month...don't know if I'll have time to meet, but thought I'd mention it.
I remember reading that one, when I was digging yesterday, I couldn't find it.
URL ordering, in a nutshell, you'd go from most to least connected, yes? Logically, I mean...so, it would seem that sites / pages with least connectivity wouldn't be a priority.
AussieWebmaster
06-16-2004, 04:40 PM
I would also look into how good the linking structure of a site is... if the crawler is pushed to a deadend your other pages may be missed.
Jeff Martin
06-16-2004, 04:59 PM
I wont argue that PR isnt a factor in the frequency of crawling a site, however I launched a news website with two inbound links (a tad shallow - maybe 15 pages) and I updated the home page everyday with new content and archived the old home pages. After about two months, the site was lightly crawled almost everyday (1-5 pages with a deep crawl 1-2x a month) and I had the freshness stamp on my listing in the SERPs. My indexed time stamp was usally never older than three days.
G bot likes it fresh! Just because your a PR3 doesnt mean you can't get G bot to visit more often. Try updating your home page every day (get a news section or something you can add to) for two months, see if that helps.
>don't know if I'll have time to meet
You best make time!
>URL ordering, in a nutshell
Myself I think in terms of what can I know about the URL without spidering it, what signals of quality does it send.
I'm assuming that we all know how to structure a site to get the spiders in there but when you have a million plus url's [think how much PR that would create, be at least a 12 ;)] you have to think differently.