[SSI] re[2]: HA and process migration
Brought to you by:
brucewalker,
rogertsang
From: Greg F. <fre...@No...> - 2001-07-18 22:34:28
|
Greg, I have answered some of your why questions below. My question for everyone is:=20 Can process migration be implemented is such a way as to move a process and its = associated open sockets such that when the original node fails (or is = shutdown), there is no client observable errors. If the answer is yes, then I think it should be a goal of the HA clusters to = support this. If the answer is no, then I agree with Alan Robertson that process migration is = anathema to HA clusters. FYI: The SSI for Linux project has Process Migration listed as a goal, but I = don't know if it is to facilitate failover, or if it is to chase idle cpu's, or = both. >> On Tue, Jul 17, 2001 at 06:57:13PM -0400, Greg Freemyer wrote: >> > I'm afraid you overestimate the quality of today's HA clusters. >> Well, it depends on which kind you're talking about. Many do better >> than "fallover". I am aware there are several technology specific solutions out there, but the = general solution used in HA clusters today is failover. >> And I must say it's a complete mystery why you don't >> want to change your client to mitigate the problem. First, we don't always control the client (i.e. browsers, java plug-ins, = database engines, report generators, etc.) Even when we can, the best we can do is 'mitigate the problem', rarely can we = solve the problem. In the HA world, there are lots of solutions and techniques that mitigate the = loss of state during the failover/restart problem. Unfortunately, there always seems to be a few situations under which the = current session must be dropped and restarted and the end-user is made aware of = the situation. The controlled process migration solution is merely one more arrow in the bow, = but I think it is a powerful arrow and a great ally to the cluster = administrator. For instance in the main HA cluster I work with there are typically 250 = end-users actively connected. If there is a need to perform some form of maintenance on the primary server = which will cause it to come down, I know that it is likely that at least one of = the users will be in one of their windows of vulnerability. =20 (The good news is that for the few end-users we take down it only takes a small = effort on their part to reconnect to the backup, but we still try hard to avoid = the situation.) Therefore, we do all serious maintenance in the middle of the night. (Patches, = upgrades, OS reconfigurations, etc.) This works okay for now because we only have users connected from 6am to = midnight. But, of course, the whole goal of HA clusters is 7x24x365, so the above is not = good enough. If a process migration solution can be achieved that would allow an HA cluster = administrator to move the services with no external impact, we will have made a = major improvement in the operation and administration of HA clusters. >> > I believe that the above has caused many HA cluster developers to >> > look into using process migration as a well-behaved method of >> > maintaining state information in the controlled failover situation. A >> > few even have it working, but none I have worked with. >> I'd think that they'd just fix their software to failover properly. If >> your failover doesn't work, how can you deal with failures? Unfortunately, the current HA cluster solutions are not bullet proof in the = general case. There are cracks in the armor and probably always will be. I am an application architect designing application for HA clusters. Some of = the components are custom written, some are off the shelf. As such, I do my best to cost-effectively reduce the likelihood of end-user = observable occurrence, and when they must occur, I do my best to make it as = painless as possible for the end-user. Greg Freemyer Internet Engineer Deployment and Integration Specialist The Norcross Group www.NorcrossGroup.com |