|
From: Bryan T. <br...@sy...> - 2010-07-20 22:04:58
|
Fred,
If we assume that the deployment consists of installing a full stack suitable for running each of the different kinds of services and that the machines are capable of running those services, then we can certainly have an operator make a decision to allocate a hot spare to a specific logical service.
The HotSpareXXXService could be used to provide some indirection for that purpose. I assume that this would be a trivial service which discloses the kind of services it is willing to start using Jini. The operator could then use a console or web application to list the known hot spares and the service types each could support and then make the decision to allocate a hot spare.
In terms of the complexity of decision making, hot spare allocation (under the design that we have been targetting) would occur after a suitable unplanned downtime interval - on the order of one or two minutes. At that point a pre-imaged node capable of starting the desired service would start the service and enter into the resynchronization protocol with the quorum. The purpose of the interval before automated allocation is to hide transient failures. If a node was having continuing transient failures then you would probably want to force the allocation of the hot spare. On the other hand, if the node came back online and was able to resynchronize then the hot spare would be retired.
I agree that hot spare (de-)allocation adds complexity to the HA milestone. A related issue which I had planned to defer beyond the initial HA milestone is to dynamically change the size of the quorum. E.g., migrating a cluster from k=1 (no failover) to k=3 or from k=3 to k=5 (increased redundency).
Bryan
> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...]
> Sent: Tuesday, July 20, 2010 5:41 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
>
> Bryan,
>
> We have no plan to use hot spares. That is, we expect an HA
> system to stay up and running long enough for an operator to
> diagnose a problem, and either fix it or manually configure
> and start a cold spare if necessary. I feel that automating
> this process when the actual faults are not understood is too
> likely to cause harm than help.
>
> Otherwise, why not have a set of HotSpare{Data,Metadata,...}Service
> instances configured and running on the hot spare machines,
> ready to become full participants when an HA quorum leader
> (or whatever
> mechanism) identifies a need? When a new XXXService is
> needed, a HotSpareXXXService is discovered and activated,
> registering a real XXXService. (Credit to Sean.)
>
> Fred
>
>
> On Tue, Jul 20, 2010 at 5:16 PM, Bryan Thompson
> <br...@sy...> wrote:
> > Fred,
> >
> > If you are not running the SMS, then you can simply start
> whatever services you want to start locally. The SMS is not
> used for anything besides actually starting the various
> services. Alternatively, a simpler SMS implementation could
> be used which read the list of services to start from a local
> configuration and ignored zookeeper.
> >
> > How would you propose to handle HA in that scenario? There
> is still a problem with dynamic recruitment from the pool of
> hot spares.
> >
> > Thanks,
> > Bryan
>
|