PDA

View Full Version : SEO & Server Load Balancing: How To Do It Right?


seoloo
02-10-2005, 03:37 PM
Hi. Quick question. I'm helping out a company that's implementing load balancing. In other words, they need to split traffic coming into www onto different servers--www1, www2, www3, etc.

Problem is, if the GoogleBot gets redirected to www2, www3, etc., that is going to cause Google to start having Search Results for those sites. We would prefer to have all search results be on the www site.

I've noticed other companies who implement load balancing this way have been successful at having www spidered, but keeping Googlebot off www2 and www3 servers:

www.cnn.com - 399,000 results
www2.cnn.com - 9,280 results
www3.cnn.com - 5,900 results

www.jcpenney.com - 57,700 results
www1.jcpenney.com - 164 results
www2.jcpenney.com - 76 results
www3.jcpenney.com - 207 results

Can someone give insight on how they're doing this? I checked, and it doesn't look like CNN has a different robots exclusion file on each server.

Can someone share best practices for SEO for this kind of setup? I know it's been done, but I can't find too much information on the Web.

Thanks.

JohnW
02-10-2005, 10:31 PM
Load balancing is done at least 3 ways - with hardware, software or DNS. What is your implementation?

dannysullivan
02-11-2005, 06:57 AM
FYI, I've moved this thread to the general SEO area, as the issue is actually applicable to engines beyond Google, as well. I also slightly renamed the title, as well.

seoloo
02-11-2005, 09:01 AM
Thanks, John and Danny.

I believe our load balancing is mainly done with DNS, or perhaps a combination of DNS and hardware. When a user types in www, the user is automatically round-robined to a cluster of servers at www1, www2, or www3.

Any advice you could provide would be most appreciated!

JohnW
02-11-2005, 09:08 AM
If using DNS, this is not really load balancing as it does not account for a down server and does not exactly balance but distributes. It is a popular method however. Also it can cause caching problems so you will need to look at adjusting your TTL. Without knowing the specifics, you might look at rewriting the urls to make all of the servers show up as www to the outside.

seoloo
02-11-2005, 08:35 PM
Thanks John. Re-writing is a good idea, but I'm afraid they *may* need to keep the URLs distinct.

I've been doing some analysis of different retail sites, and have found that many do still serve their site on multiple subdomains, and yet they seem to be performing okay vis a vis Google. For example:

www.jcpenney.com - I'd mentioned that they have 57,700 pages spidered on "www", and yet less than 500 on their combined www2, www3, www4 servers. It seems that they way they accomplish this is through "cloaking" (surfing to their www site as the GoogleBot, my URL stays "www"; through a regular browser user agent, I get bounced to different servers). Now, I've always heard that cloaking can result in several penalties...my question is...is this legal, or are they playing with fire? (I've observed the same thing for www.flower.com...Googlebot stays on www, while regular users go to another server).

Here's another intriguing one. When I type site:products.proflowers.com, I get results on www.proflowers.com, www.protulips.com, and a bunch of URLs that are NOT "products.proflowers.com". I've never seen this before--do you know what ProFlowers is doing to achieve this? On the surface, it looks like even though users end up on their "products.proflowers.com" site, Google views these pages as if they were on the "www.proflowers.com" site.

I have more examples, but I'm wondering if you could shed some light on these question?

Thanks so much,
seoloo

Mikkel deMib Svendsen
02-12-2005, 05:29 AM
This is not just a spider issue - it is also a linking issue! If you send users to different URLs (www, ww1, ww2 etc) then thats the ones they will link to and not just your www domain.

I have seen this kind of DNS "load balancing" and it is, I must say, a very risky strategy for SEO. You don't want the engines to decide, or guess, what version to index - YOU should control that. You also want to direct all inbound links to one domain.

I would definately advise you to either rewrite the URLs or move to a better kind of load ballancing.

JohnW
02-12-2005, 12:56 PM
If you have the resources to do it, I would suggest a reverse proxy with rewrite. If this is not possible you could look at some products like localdirector from Cisco or there is a product called squid to look at. However you do it, you really must keep all of the servers resolving as www or you will have problems.

JCPenny does not not rank very well (being polite here) for search so it really does not make a good example. My guess is that their internet business model is to support catalog sales or whatever, so maybe they do not care much about search results.
I am not going to really comment on your other question, but something to remember is that just because someone else is getting away with something, that does not make it a good long term idea.

lots0
02-12-2005, 04:36 PM
...It seems that they way they accomplish this is through "cloaking" (surfing to their www site as the GoogleBot, my URL stays "www"; through a regular browser user agent, I get bounced to different servers). Now, I've always heard that cloaking can result in several penalties...my question is...is this legal, or are they playing with fire?
Sure it is "legal" no law against IP redirection, anywhere in the world.
Are they playing with fire? No way, the googlies would never ban or penalize jc penny, just like google will never penalize Amazon, flower.com, Coke.com or any of the many many large well known sites that use IP redirection (cloaking). Google even uses the same technology for their GEO targeting.

The thing is, these large and well know sites/companies are not going to be "tricking" people into going to porn sites when they are searching for kids toys or other such deceptions that the SE's (and everyone else) frown on. So why should any SE penalize them if they use IP redirection to better their users experience?

seoloo
03-03-2005, 09:03 PM
Hi again,

Thanks for your insights. I have some new information about our implementation. It is GSLB: Global Server Load Balancing. To make a long story short, I'm told there is no way around the issue of servers resolving as ww1, ww2, etc. You can do a search on "GSLB" on Google to see what I mean.

Unfortunately, since very few firms have implemented this, there's not a whole lot of case material on the Web.

Given this, can anyone recommend a specific solution? I assume it will involve some kind of rewriting or redirecting (i.e. acceptable "cloaking"), but I could use some specific recommendations from the experts out there.

Thanks very much.

NevDull
03-04-2005, 02:23 AM
Apparently you need to talk to the vendor of the load balancing solution. "GSLB" as a term specifies few details, but in general, multiple site load balancing, when done by companies that aren't super-cheap, is done by devices like the F5 Networks 3DNS product with BigIP, or the Cisco CSS content switches, etc. They work by having the DNS for the www.foo.com delegated to the load balancers. The load balancers, when answering queries, respond with the IP address to hit... based on several things like current load, which servers are up or down, and proximity of the data center to that user, network-wise.

Rather than ending up looking like an ass to your client, find out what hardware/software they're using for the load balancing, and contact the vendor yourself. Do this with the assumption that you need www.foo.com and not www1 www2 www3, and don't even mention that kind of a solution. It's not a solution, it's a "hack". And for SEO, it's crap.

MANY companies do balancing across data centers. You're looking in the wrong way.

risk
04-05-2005, 04:16 AM
Load balancing is done at least 3 ways - with hardware, software or DNS. What is your implementation?

this is a bad classification scheme. ignore whether load balancing is done in hardware or software, because that has no bearing on what the users (and the spiders) see. in fact, most load balancers which are considered hardware based (presumably because they use embedded platforms) are indeed software only running on top of a custom platform. the only load balancers which should be called hardware based are the ones which implement load balancing logic in asics and even then a lot of packets will hit the cpu anyway. these technical details are only interesting when discussing reliability, performance and capacity at traffic levels approaching the aggregate line rate.

for the purposes of this discussion, we are interested in classifying load balancing techniques, such as dns, inline layer 3, anycast etc.


If using DNS, this is not really load balancing as it does not account for a down server and does not exactly balance but distributes.


load balancing involves distributing traffic over a cluster of nodes. failure detection is *not* part of the definition. you are thinking of failover, which would be a feature of a high availability cluster. failover features are indeed often present in load balancing solutions in practice, but only because they're a natural extension of load balancing weighting algorithms (if you're testing which server responds the fastest, it's easy to determine if a server doesn't respond at all). there is, of course, much more complexity to a complete high availability solution.

there is no single 'DNS load balancing' technique as it applies to http/https. what you are thinking of is round robin dns, which was popular at one time because it was (and still is) trivial to implement. round robin results in very even distribution of connections (a record is selected from a record set randomly, for the purposes of this discussion at least, and the distribution of choices will thus be even over a statistically valid set of datapoints), which fits into the definition of load balancing very intuitively.

however, you don't really want even distribution of connections at all. you want a new connection to be directed to the node which is most suitable at a given time with suitability being determined using a given set of metrics (most often you want the shortest response time). this is not possible with vanilla round robin because the zone being served is static.

there are plenty of commercial solutions which do not have the above limitations. when you connect to a large irc network through eg irc.efnet.net or irc.dal.net, you're seeing one in action (there's more going on of course, since some may be doing geotargetting as well). akamai's got one (but theirs is infinitely more complex), google's got one (same caveat as with akamai) etc. heck, i wrote one and it was good enough to withstand multiple slashdottings and flash mobs.

Also it can cause caching problems so you will need to look at adjusting your TTL.


you definitely want your TTL below 60. however, it won't help in all cases. it is important to understand that caching issues will affect your ability to distribute new connections and compensate for node failure, but users will still be able to access your site under normal circumstances. the easiest edge case to dismiss is isps operating over high latency backhaul links such as satellite which employ aggressive dns response caching to improve performance for their customers and may ignore ttls - this usually happens in developing countries in africa and you probably don't care if you can't force one connection you get from the sahara to the least busy cluster node. the most inconveniece is caused by the fact that windows xp caches dns responsed for a minimum of 15 or 30 minutes by default (i forget which) regardless of the TTL. yes, microsoft has deliberately fubarred an internet standard in their implementation *again*.


Without knowing the specifics, you might look at rewriting the urls to make all of the servers show up as www to the outside.


if he's using some form of dns load balancing, there's no reason to use www1 and www2 in the first place - the distribution of connections occurs when a fqdn (eg www.domain.com) gets resolved to an ip address. url rewriting would have to happen at an earlier stage. i'll explore this in a subsequent post.

-rsk

risk
04-05-2005, 04:33 AM
If you have the resources to do it, I would suggest a reverse proxy with rewrite. If this is not possible you could look at some products like localdirector from Cisco or there is a product called squid to look at. However you do it, you really must keep all of the servers resolving as www or you will have problems.


neither reverse proxying nor squid are good load balancing solutions, they are hacks folks put together a while ago because there were no other good low cost solutions, the capacity/performance requirements were much more modest and last but not least because they had that diy bug ;]

cisco localdirector is an absolutely horrific product. cisco do cetain specific things well, but the rest of their product line is chaff (mainly sourced through acquisitions) they can push into fortune 1000 accounts where it's easier for the folks involved to buy crap from an approved vendor than try to buy the appropriate solution for the job and have to fight for approval. the proverbial "noone gets fired for buying ibm" applies to cisco in the networking space nowadays.


I am not going to really comment on your other question, but something to remember is that just because someone else is getting away with something, that does not make it a good long term idea.

excellent piece of advice. i'll take it one step further and apply it to your current situation:

1. big companies rarely make the best technological choices. in fact, most of their purchasing decisions have nothing to do with technology. you would probably fare better by doing the exact opposite of what they're doing.

i won't buy a car because my girlfriend likes the colour, i'll buy the one that delivers the most vroom-vroooooom ;]

2. you should not be looking at large players such as ebay and google - your deployment is nothing like theirs, your requirements are different and you have a heck of a lot less money than them.

following up on the automotive analogy them we've got going, trying to compare your car's engine to the one in michael schumacher's formula 1 ride isn't very helpful at all.

in practical terms, large players don't just need to load balance their traffic for performance and ensure high availability within a cluster. they also need to ensure availability in case of a site failure (ie google can stay up if one of their datacenters is offline). this is hard to achieve because a lot of applications (such as databases) have very explicit consistency requirements and that's what's eating up the engineering dollars the likes of ebay spend.

-rsk

risk
04-05-2005, 04:56 AM
Hi again,

Thanks for your insights. I have some new information about our implementation. It is GSLB: Global Server Load Balancing. To make a long story short, I'm told there is no way around the issue of servers resolving as ww1, ww2, etc. You can do a search on "GSLB" on Google to see what I mean.

Unfortunately, since very few firms have implemented this, there's not a whole lot of case material on the Web.

Given this, can anyone recommend a specific solution? I assume it will involve some kind of rewriting or redirecting (i.e. acceptable "cloaking"), but I could use some specific recommendations from the experts out there.


this is bollocks. first off, gslb is an incredibly bad fit for what you likely need. secondly, none of the gslb implementations i've worked with and would recommend for your needs have this requirement. my first suggestion is to punch whoever is perpetrating this horror on you in the nose at the earliest opportunity.

now, on to the meaty details:

gslb, or global server load balancing, involves distributing traffic not only across a cluster of nodes but rather over several clusters which are usually geographicall diverse. the two most common implementations are dns-based (*not* vanilla round robin) and anycast. unless your client is a very large company or outsources this to a cdn vendor such as speedera or akamai, chances are they aren't using anycast. a good reference on how both techniques work can be found in the documentation for the foundry networks serveriron range of lbs. when we don't use in-house implementations, we use foundry and they've pretty much got the textbook implementation for those features.

first, find out whether there is a specific requirement for *global* load balancing. if they don't have one, tell them not to do gslb because it's suboptimal for all cases where there is no geographic diversity requirement. if you're successful with this, you're likely going to get an inline layer 3-7 solution which will be invisible to you, the visitors and search engine spiders. if you still end up with this rewriting crap, tell them to hire new engineers and stop trying to sell you the worst engineered solution possible. rinse, repeat.

if you must use gslb, see above for not using the worst solution possible. punch people in the nose as appropriate. there is very little you can do to make what they're proposing se friendly and it's also horrible from an engineering perspective.

-rsk