PDA

View Full Version : Getting millions of dynamic pages indexed


Mikkel deMib Svendsen
08-25-2004, 10:01 AM
I have worked on a couple of sites that have hundred of thousands or millions of good content pages and would like to share some experience in getting that many pages indexed, ranked well and not the least benchmarked.

In most cases I have been furtunate in the fact that these large sites have already had good or decent linkpopularity/PageRank. Prominent sites but with a site architecture so messed up that they hardly get any pages indexed and ranked.

Removing indexing barriers on such sites is often not a trival task. Not the least the "human element" can be a huge limitation in the radical changes it often takes. However, it is indeed possible. But be prepared: It takes time! And, I found that the kind of companies that operate million-page sites often require extensive documentation before they change as much as a comma in the web-coding.

Once the indexing barriers is removed I found it very important to implement a solid robots.txt file to block all the agent that you don't want in, and that respect this file. And that's actually quite a few. Hundreds. Imagine you have just "opened up" your website to spiders (removed the indexing barriers) and all these hundreds of agents starts to spider your millions of pages. Not good. Not good at all. This can in fact take down your servers or force you to invest in heavier servers and bandwidth upgrades. So get protected.

I have often used a sligthly modified version of Brett Tabkes robots.txt file that he publish under the GNU license. This one is very strict - I usually take a few of the bots off the list, but thats up to you.

www.webmasterworld.com/robots.txt


After the indexing barriers are removed the pages should get indexed, right? Well, they don't allways. At least not all of the them - the millions you have. At least, I found that it takes a lot of work on your linking structure and how you update the content (freshness). And this is a tricky part ...

It is not so difficult to create site maps on a site with a few hundred or thousand pages in a herarchy that makes sense and with a resonable number of levels. With million of pages that gets tricky. Often I find that some site-maps, and links to pages end up so deep that spiders don't prioritise them high enough in crawling.

There are a couple of things I found that helps: Theming areas and getting external links to the entry points of those areas. This way you can create a number of internal hubs that you link to from the main site map as well as create external links to. From those hubs you can start the nested site map for each section. This will give you fewer pages to deal with in each site-map hierachy.

Before this post gets too long, I want to hear others experience in getting that many pages indexed. Next we can go into how we rank them well too :)

Nick W
08-25-2004, 10:47 AM
>>I want to hear others experience in getting that many pages indexed

Pagination: See pages: 1 2 3 - 50

next page (2)
See pages: 3 4 5 - 51

etc...

Works well for me..

Nick

Mikkel deMib Svendsen
08-25-2004, 10:52 AM
Yes, pagination works too. However, in some cases I have had problems making sure the pages do not become to identical. In any case, I always adjust titles and META-tags for paged pages, so as a minimum titles get a "... page 2" added. I just don't want hundred of pages with too identical content and identical headers indexed. It's not healthy :)

rustybrick
08-25-2004, 11:18 AM
Rotating Featured Articles/Products on the homepage and landing pages, on a daily basis works well.

Having links to related articles/products from articles and products also works well.

Mikkel deMib Svendsen
08-26-2004, 08:32 AM
How much do you think freshness impact indexing? Personally I do think it helps.

rustybrick
08-26-2004, 08:36 AM
Its hard to say, of course you want the search bots revisiting your pages as frequently as possible (if you have the server resources), this way it can pick up on new pages more frequently. In that case, I feel freshness is very important. So if you have a sub category landing page that rotates articles/products on a daily basis, then you can be sure that when you add a new article/product to that page, the search bots will pick them up more rapidly then a page that did not have a dynamic portion to it.

But in regards to ranking purposes, its really hard to say. I have some static pages (built dynamically, but really static in nature) that rank very well as well. I am unsure, how much freshness plays into ranking for existing pages... :confused:

seomike
08-26-2004, 11:22 AM
I hear ya when it comes to server load. I know there is a meta tag that I've seen that tells spiders to revisit every 10, 15, 20... days. Has anyone ever used that or can someone tell me if it works LOL.

Mikkel deMib Svendsen
08-26-2004, 01:53 PM
I do not know of any search engines that supprot the revisit-after META-tag. I think it's just as effective as the pagerank=10 tag :)

seobook
08-26-2004, 05:58 PM
I do not know of any search engines that supprot the revisit-after META-tag. I think it's just as effective as the pagerank=10 tag :)

exactly, since everyone is using the pagerank=10 tag you now need to use pagerank=11 to get the same effect out of it ;)

Mikkel deMib Svendsen
08-29-2004, 10:13 AM
Yes, and if you want to be evil you can even use this trick:

<a href="YourCompetitor.com" PageRank="-1">Negative link</a>

:eek:

Nicky
09-22-2004, 06:11 AM
Ohhh....That sounds naughty ;)

Nacho
11-16-2004, 01:21 AM
Mikkel, this is such a great post! Gotta give it a bump <<

Here's another thread here at SEW related to yours: Let's discuss ROBOTS.TXT (http://forums.searchenginewatch.com/showthread.php?t=2060&highlight=robots.txt).