[SSI] re[2]: HA and process migration

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Greg,

I have answered some of your why questions below.

My question for everyone is:=20

Can process migration be implemented is such a way as to move a process and its =
associated open sockets such that when the original node fails (or is =
shutdown), there is no client observable errors.

If the answer is yes, then I think it should be a goal of the HA clusters to =
support this.

If the answer is no, then I agree with Alan Robertson that process migration is =
anathema to HA clusters.

FYI: The SSI for Linux project has Process Migration listed as a goal, but I =
don't know if it is to facilitate failover, or if it is to chase idle cpu's, or =
both.

 >>  On Tue, Jul 17, 2001 at 06:57:13PM -0400, Greg Freemyer wrote:

 >>  > I'm afraid you overestimate the quality of today's HA clusters.

 >>  Well, it depends on which kind you're talking about. Many do better
 >>  than "fallover".

I am aware there are several technology specific solutions out there, but the =
general solution used in HA clusters today is failover.

 >>  And I must say it's a complete mystery why you don't
 >>  want to change your client to mitigate the problem.

First, we don't always control the client (i.e. browsers, java plug-ins, =
database engines, report generators, etc.)

Even when we can, the best we can do is 'mitigate the problem', rarely can we =
solve the problem.

In the HA world, there are lots of solutions and techniques that mitigate the =
loss of state during the failover/restart problem.

Unfortunately, there always seems to be a few situations under which the =
current session must be dropped and restarted and the end-user is made aware of =
the situation.

The controlled process migration solution is merely one more arrow in the bow, =
but I think it is a powerful arrow and a great ally to the cluster =
administrator.

For instance in the main HA cluster I work with there are typically 250 =
end-users actively connected.

If there is a need to perform some form of maintenance on the primary server =
which will cause it to come down, I know that it is likely that at least one of =
the users will be in one of their windows of vulnerability. =20

(The good news is that for the few end-users we take down it only takes a small =
effort on their part to reconnect to the backup, but we still try hard to avoid =
the situation.)

Therefore, we do all serious maintenance in the middle of the night.  (Patches, =
upgrades, OS reconfigurations, etc.)

This works okay for now because we only have users connected from 6am to =
midnight.

But, of course, the whole goal of HA clusters is 7x24x365, so the above is not =
good enough.

If a process migration solution can be achieved that would allow an HA cluster =
administrator to move the services with no external impact, we will have made a =
major improvement in the operation and administration of HA clusters.

  >>  > I believe that the above has caused many HA cluster developers to
 >>  > look into using process migration as a well-behaved method of
 >>  > maintaining state information in the controlled failover situation.  A
 >>  > few even have it working, but none I have worked with.

 >>  I'd think that they'd just fix their software to failover properly. If
 >>  your failover doesn't work, how can you deal with failures?

Unfortunately, the current HA cluster solutions are not bullet proof in the =
general case.  There are cracks in the armor and probably always will be.

I am an application architect designing application for HA clusters.  Some of =
the components are custom written, some are off the shelf.

As such, I do my best to cost-effectively reduce the likelihood of end-user =
observable occurrence, and when they must occur, I do my best to make it as =
painless as possible for the end-user.

Greg Freemyer
Internet Engineer
Deployment and Integration Specialist
The Norcross Group
www.NorcrossGroup.com