Thread: [Bigdata-developers] Why zookeeper?

Fast, scalable, robust graph database platform

Brought to you by: beebs, hyandell, mrpersonick, thompsonbry

bigdata-developers

[Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-20 03:29:58

Why does bigdata use zookeeper?  Isn't the bigdata system more complex
with zookeeper that it would be without? Or is zookeeper going away as
part of the "share nothing" effort?

I have the impression that zookeeper is used for three things:

Zookeeper appears to be used to communicate information between
bigdata services. (Like spies passing messages in dead drops?)
Wouldn't it be simpler and more straight forward for services to ask
each other for the information they need?

Zookeeper appears to be used to persist data about each service. Why
wouldn't each service persist its own state, privately? What is the
benefit of allowing any service to see another service's persisted
state?

Zookeeper has something to do with locks and synchronization? What
needs to be synchronized? For example, ticket #111 shows that
zookeeper is involved in preventing a DataService from starting if a
TransactionService is not running. But the DataService should be
robust enough to operate to some extent even though the
TransactionService fails or becomes unreachable. And if the
DataService is robust enough to wait for the TransactionService to
come online, why make this artificial synchronization?

Could zookeeper be evolved out of the system, simplifying the system?

Thank you,

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-20 13:26:43

Fred,

Bigdata uses zookeeper for:

- configuration management.
- synchronous distributed locks.
- coordination of the service leader elections for HA.

At this time, I believe that the only information persisted in zookeeper is the configuration metadata and the description of distributed jobs (such as the bulk loader).  You can see all of this using the DumpZookeeper utility (dumpZoo.sh).

I prefer to think of zookeeper as providing robust REST-like data transparency with preconditions on data updates and triggers.  BrianM and I have discussed that capabilities for creating global synchronous locks and leader election patterns are available in some Jini distributions, but not in the open source Jini release that we are bundling.  However, using Jini for this would only move the state from one distributed system to another, not eliminate "dead drops".  Also, one of the attractive propositions of zookeeper is the ability to integrate distributed systems which share no other infrastructure and it supports a number of language bindings (C, Java, etc.) for that purpose. In this regard it is more flexible than an Jini system with support for creating global synchronous locks.

Services store their configuration state locally for restart in the service directory.  However, we also need to know which persistent services exist, even if they are not running at the moment. That information is captured in zookeeper.  For example, if the target #of logical data services (LDS) is 10, then we do not want to start a new data service if one goes down because the persistent state of the data service can be terabytes of files on local disks and is part of the semantics of its service "identity".  Likewise, in HA, we have to manage the decision to start a new physical data service (PDS) in a given bank of logical data services.  For example, if we are going to update the installed s/w or h/w on a PDS, then we can schedule down time to differentiate it from a node failure.  The LDS can maintain its quorum in the face of scheduled down time (e.g., 2 out of 3) without triggering the recruitment and synchronization of a new PDS instance for that LDS node.  These are distinctions which could not be captured by local state on the node since the node could be powered off during a hardware update.

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Monday, July 19, 2010 11:30 PM
> To: Bigdata Developers
> Subject: [Bigdata-developers] Why zookeeper?
> 
> Why does bigdata use zookeeper?  Isn't the bigdata system 
> more complex with zookeeper that it would be without? Or is 
> zookeeper going away as part of the "share nothing" effort?
> 
> I have the impression that zookeeper is used for three things:
> 
> Zookeeper appears to be used to communicate information 
> between bigdata services. (Like spies passing messages in 
> dead drops?) Wouldn't it be simpler and more straight forward 
> for services to ask each other for the information they need?
> 
> Zookeeper appears to be used to persist data about each 
> service. Why wouldn't each service persist its own state, 
> privately? What is the benefit of allowing any service to see 
> another service's persisted state?
> 
> Zookeeper has something to do with locks and synchronization? 
> What needs to be synchronized? For example, ticket #111 shows 
> that zookeeper is involved in preventing a DataService from 
> starting if a TransactionService is not running. But the 
> DataService should be robust enough to operate to some extent 
> even though the TransactionService fails or becomes 
> unreachable. And if the DataService is robust enough to wait 
> for the TransactionService to come online, why make this 
> artificial synchronization?
> 
> Could zookeeper be evolved out of the system, simplifying the system?
> 
> Thank you,
> 
> Fred
> 
> --------------------------------------------------------------
> ----------------
> This SF.net email is sponsored by Sprint What will you do 
> first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Bigdata-developers mailing list
> Big...@li...
> https://lists.sourceforge.net/lists/listinfo/bigdata-developers
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-20 14:59:12

On Tue, Jul 20, 2010 at 9:24 AM, Bryan Thompson <br...@sy...> wrote:

> In this regard it is more flexible than an Jini system with support for creating global synchronous locks.

I believe that global synchronous locks are more harmful than helpful,
in general. That's why I would like to know how global synchronous
locks help bigdata (not that zookeeper is a bad way to do it if
necessary).

> Services store their configuration state locally for restart in the service directory. However, we also need to know which persistent services exist, even if they are not running at the moment. That information is captured in zookeeper. For example, if the target #of logical data services (LDS) is 10, then we do not want to start a new data service if one goes down because the persistent state of the data service can be terabytes of files on local disks and is part of the semantics of its service "identity".

I'm not clear about the problem you are describing. Say we have a
DataService #5 configured one host H with a persistence directory D
containing its UUID, journals, indices, etc. It is either running and
registered with the lookup service (with the UUID) or not. If the
service starter/manager on host H needs this service to be running and
is not registered, start it. (There is a strictly local problem of
starting a duplicate java process because of race conditions, but the
service itself should detect and prevent that.) I don't see how global
synchronization is involved here.

Can you give another example of the need for global synchronization
(excluding HA) or point out what I am missing?

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-20 16:38:58

Fred,

We should have a separate discussion concerning how bigdata allocates and start services.  I am a bit crushed for time right now, but maybe we could take that up next week?

We only use global synchronous locks at the moment in service startup logic, HA, locking out different masters for the same distributed job.  However, I think that specific applications of bigdata may well want to use global synchronous locks to make operations such as life cycle management of a triple or quads store atomic.

Concerning your example, service #5 is either running or it is not, but we also need to know whether or not it has been created.  Bigdata services have huge amounts of persistent state.  We can not simply substitute another service as a replacement for #5 if it should fail.  Instead we have to recruit a new service, synchronize it with the state of the quorum to which #5 belonged, and then bring the new service on atomically when it is caught up with the quorum.  This "hot spare" allocation process could take hours to bring a newly recruited node into full synchronization.  We speed that up by working backwards in history so the data service quickly has a view of the current commit point and then builds up its history over time.

Bryan 

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Tuesday, July 20, 2010 10:59 AM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> On Tue, Jul 20, 2010 at 9:24 AM, Bryan Thompson 
> <br...@sy...> wrote:
> 
> > In this regard it is more flexible than an Jini system with 
> support for creating global synchronous locks.
> 
> I believe that global synchronous locks are more harmful than 
> helpful, in general. That's why I would like to know how 
> global synchronous locks help bigdata (not that zookeeper is 
> a bad way to do it if necessary).
> 
> > Services store their configuration state locally for 
> restart in the service directory.  However, we also need to 
> know which persistent services exist, even if they are not 
> running at the moment. That information is captured in 
> zookeeper.  For example, if the target #of logical data 
> services (LDS) is 10, then we do not want to start a new data 
> service if one goes down because the persistent state of the 
> data service can be terabytes of files on local disks and is 
> part of the semantics of its service "identity".
> 
> I'm not clear about the problem you are describing. Say we 
> have a DataService #5 configured one host H with a 
> persistence directory D containing its UUID, journals, 
> indices, etc. It is either running and registered with the 
> lookup service (with the UUID) or not. If the service 
> starter/manager on host H needs this service to be running 
> and is not registered, start it. (There is a strictly local 
> problem of starting a duplicate java process because of race 
> conditions, but the service itself should detect and prevent 
> that.) I don't see how global synchronization is involved here.
> 
> Can you give another example of the need for global 
> synchronization (excluding HA) or point out what I am missing?
> 
> Fred
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-20 17:20:11

On Tue, Jul 20, 2010 at 12:38 PM, Bryan Thompson <br...@sy...> wrote:
> Fred,
>
> We should have a separate discussion concerning how bigdata allocates and start services.  I am a bit crushed for time right now, but maybe we could take that up next week?

I understand you'll be unavailable. Please pick up the conversation
when you can.

> We only use global synchronous locks at the moment in service startup logic, HA, locking out different masters for the same distributed job.  However, I think that specific applications of bigdata may well want to use global synchronous locks to make operations such as life cycle management of a triple or quads store atomic.
>
> Concerning your example, service #5 is either running or it is not, but we also need to know whether or not it has been created.  Bigdata services have huge amounts of persistent state.  We can not simply substitute another service as a replacement for #5 if it should fail.  Instead we have to recruit a new service, synchronize it with the state of the quorum to which #5 belonged, and then bring the new service on atomically when it is caught up with the quorum.  This "hot spare" allocation process could take hours to bring a newly recruited node into full synchronization.  We speed that up by working backwards in history so the data service quickly has a view of the current commit point and then builds up its history over time.

I specifically meant to disregard HA for this conversation, as bigdata
has a long history with zookeeper outside of HA. HA is a much longer
conversation, and I don't want to confuse tomorrow's issues with
today's code.

With that said, I don't understand what "whether or not it has been
created" means. DataService#5 could be called "created" when the
operator wrote (DataService, host H, persistence directory D) in a
config file on host H and created directory D with #5 in it. After
that, the service is either running, or not running.

Please explain where global synchronous locks are needed in service
startup logic?

Fred

>
> Bryan
>
>> -----Original Message-----
>> From: Fred Oliver [mailto:fko...@gm...]
>> Sent: Tuesday, July 20, 2010 10:59 AM
>> To: Bryan Thompson
>> Cc: Bigdata Developers
>> Subject: Re: [Bigdata-developers] Why zookeeper?
>>
>> On Tue, Jul 20, 2010 at 9:24 AM, Bryan Thompson
>> <br...@sy...> wrote:
>>
>> > In this regard it is more flexible than an Jini system with
>> support for creating global synchronous locks.
>>
>> I believe that global synchronous locks are more harmful than
>> helpful, in general. That's why I would like to know how
>> global synchronous locks help bigdata (not that zookeeper is
>> a bad way to do it if necessary).
>>
>> > Services store their configuration state locally for
>> restart in the service directory.  However, we also need to
>> know which persistent services exist, even if they are not
>> running at the moment. That information is captured in
>> zookeeper.  For example, if the target #of logical data
>> services (LDS) is 10, then we do not want to start a new data
>> service if one goes down because the persistent state of the
>> data service can be terabytes of files on local disks and is
>> part of the semantics of its service "identity".
>>
>> I'm not clear about the problem you are describing. Say we
>> have a DataService #5 configured one host H with a
>> persistence directory D containing its UUID, journals,
>> indices, etc. It is either running and registered with the
>> lookup service (with the UUID) or not. If the service
>> starter/manager on host H needs this service to be running
>> and is not registered, start it. (There is a strictly local
>> problem of starting a duplicate java process because of race
>> conditions, but the service itself should detect and prevent
>> that.) I don't see how global synchronization is involved here.
>>
>> Can you give another example of the need for global
>> synchronization (excluding HA) or point out what I am missing?
>>
>> Fred
>>

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-20 18:19:57

Fred,

Zookeeper is used by the ServiceManagerServices (SMS) to make shared decisions about which nodes will start which services.  The configuration information for the services includes optional constraints on the nodes on which they can run, on the services which must be running before they can start (and I agree with you that services should start without such preconditions and await the appropriate events), etc.

Each SMS enters into a queue for each logical service type registered against zookeeper.  When an SMS is at the head of that queue it has the "lock" (this is the zookeeper lock pattern).  It then decides whether or not it can start that service type given the nodes capabilities and the constraints (if any) for that logical service type.  If it can, it starts the service.  This decision making process needs to be globally atomic to prevent nodes from concurrent starts of the same service on different nodes.  

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Tuesday, July 20, 2010 1:20 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> On Tue, Jul 20, 2010 at 12:38 PM, Bryan Thompson 
> <br...@sy...> wrote:
> > Fred,
> >
> > We should have a separate discussion concerning how bigdata 
> allocates and start services.  I am a bit crushed for time 
> right now, but maybe we could take that up next week?
> 
> I understand you'll be unavailable. Please pick up the 
> conversation when you can.
> 
> > We only use global synchronous locks at the moment in 
> service startup logic, HA, locking out different masters for 
> the same distributed job.  However, I think that specific 
> applications of bigdata may well want to use global 
> synchronous locks to make operations such as life cycle 
> management of a triple or quads store atomic.
> >
> > Concerning your example, service #5 is either running or it 
> is not, but we also need to know whether or not it has been 
> created.  Bigdata services have huge amounts of persistent 
> state.  We can not simply substitute another service as a 
> replacement for #5 if it should fail.  Instead we have to 
> recruit a new service, synchronize it with the state of the 
> quorum to which #5 belonged, and then bring the new service 
> on atomically when it is caught up with the quorum.  This 
> "hot spare" allocation process could take hours to bring a 
> newly recruited node into full synchronization.  We speed 
> that up by working backwards in history so the data service 
> quickly has a view of the current commit point and then 
> builds up its history over time.
> 
> I specifically meant to disregard HA for this conversation, 
> as bigdata has a long history with zookeeper outside of HA. 
> HA is a much longer conversation, and I don't want to confuse 
> tomorrow's issues with today's code.
> 
> With that said, I don't understand what "whether or not it 
> has been created" means. DataService#5 could be called 
> "created" when the operator wrote (DataService, host H, 
> persistence directory D) in a config file on host H and 
> created directory D with #5 in it. After that, the service is 
> either running, or not running.
> 
> Please explain where global synchronous locks are needed in 
> service startup logic?
> 
> Fred
> 
> >
> > Bryan
> >
> >> -----Original Message-----
> >> From: Fred Oliver [mailto:fko...@gm...]
> >> Sent: Tuesday, July 20, 2010 10:59 AM
> >> To: Bryan Thompson
> >> Cc: Bigdata Developers
> >> Subject: Re: [Bigdata-developers] Why zookeeper?
> >>
> >> On Tue, Jul 20, 2010 at 9:24 AM, Bryan Thompson <br...@sy...> 
> >> wrote:
> >>
> >> > In this regard it is more flexible than an Jini system with
> >> support for creating global synchronous locks.
> >>
> >> I believe that global synchronous locks are more harmful than 
> >> helpful, in general. That's why I would like to know how global 
> >> synchronous locks help bigdata (not that zookeeper is a 
> bad way to do 
> >> it if necessary).
> >>
> >> > Services store their configuration state locally for
> >> restart in the service directory.  However, we also need to know 
> >> which persistent services exist, even if they are not 
> running at the 
> >> moment. That information is captured in zookeeper.  For 
> example, if 
> >> the target #of logical data services (LDS) is 10, then we 
> do not want 
> >> to start a new data service if one goes down because the 
> persistent 
> >> state of the data service can be terabytes of files on local disks 
> >> and is part of the semantics of its service "identity".
> >>
> >> I'm not clear about the problem you are describing. Say we have a 
> >> DataService #5 configured one host H with a persistence 
> directory D 
> >> containing its UUID, journals, indices, etc. It is either 
> running and 
> >> registered with the lookup service (with the UUID) or not. If the 
> >> service starter/manager on host H needs this service to be running 
> >> and is not registered, start it. (There is a strictly 
> local problem 
> >> of starting a duplicate java process because of race 
> conditions, but 
> >> the service itself should detect and prevent
> >> that.) I don't see how global synchronization is involved here.
> >>
> >> Can you give another example of the need for global 
> synchronization 
> >> (excluding HA) or point out what I am missing?
> >>
> >> Fred
> >>
>

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-20 21:17:05

Fred,

If you are not running the SMS, then you can simply start whatever services you want to start locally.  The SMS is not used for anything besides actually starting the various services.  Alternatively, a simpler SMS implementation could be used which read the list of services to start from a local configuration and ignored zookeeper.

How would you propose to handle HA in that scenario?  There is still a problem with dynamic recruitment from the pool of hot spares.

Thanks,
Bryan 

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Tuesday, July 20, 2010 5:07 PM
> To: Bryan Thompson
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> I see. You've created a dynamic assignment of services to 
> hosts. I understand how this might work for transient 
> services (no persistence), but I don't understand why 
> persistent services would be dynamically assigned. I assume 
> you intended to create or allow a non-deterministic behaviour.
> 
> For a production environment, we want to have as 
> deterministic a behaviour as possible. This allows for 
> simpler code, simpler and more complete tests, and more 
> easily diagnosed faults. I would prefer a startup scheme 
> where each host has its own configuration, which describes 
> exactly those services which are to run on that host. The 
> service layout was already determined in order to acquire the 
> hardware.
> 
> Would you consider changing the startup mechanism to 
> something this simple? There would be some convenience in a 
> tool which is given a cluster configuration and spits out 
> configurations specific for each participating host.
> 
> Fred
> 
> On Tue, Jul 20, 2010 at 2:19 PM, Bryan Thompson 
> <br...@sy...> wrote:
> > Fred,
> >
> > Zookeeper is used by the ServiceManagerServices (SMS) to 
> make shared decisions about which nodes will start which 
> services.  The configuration information for the services 
> includes optional constraints on the nodes on which they can 
> run, on the services which must be running before they can 
> start (and I agree with you that services should start 
> without such preconditions and await the appropriate events), etc.
> >
> > Each SMS enters into a queue for each logical service type 
> registered against zookeeper.  When an SMS is at the head of 
> that queue it has the "lock" (this is the zookeeper lock 
> pattern).  It then decides whether or not it can start that 
> service type given the nodes capabilities and the constraints 
> (if any) for that logical service type.  If it can, it starts 
> the service.  This decision making process needs to be 
> globally atomic to prevent nodes from concurrent starts of 
> the same service on different nodes.
> >
> > Bryan
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-20 21:40:39

Bryan,

We have no plan to use hot spares. That is, we expect an HA system to
stay up and running long enough for an operator to diagnose a problem,
and either fix it or manually configure and start a cold spare if
necessary. I feel that automating this process when the actual faults
are not understood is too likely to cause harm than help.

Otherwise, why not have a set of HotSpare{Data,Metadata,...}Service
instances configured and running on the hot spare machines, ready to
become full participants when an HA quorum leader (or whatever
mechanism) identifies a need? When a new XXXService is needed, a
HotSpareXXXService is discovered and activated, registering a real
XXXService. (Credit to Sean.)

Fred

On Tue, Jul 20, 2010 at 5:16 PM, Bryan Thompson <br...@sy...> wrote:
> Fred,
>
> If you are not running the SMS, then you can simply start whatever services you want to start locally.  The SMS is not used for anything besides actually starting the various services.  Alternatively, a simpler SMS implementation could be used which read the list of services to start from a local configuration and ignored zookeeper.
>
> How would you propose to handle HA in that scenario?  There is still a problem with dynamic recruitment from the pool of hot spares.
>
> Thanks,
> Bryan

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-20 22:04:58

Fred,

If we assume that the deployment consists of installing a full stack suitable for running each of the different kinds of services and that the machines are capable of running those services, then we can certainly have an operator make a decision to allocate a hot spare to a specific logical service.

The HotSpareXXXService could be used to provide some indirection for that purpose.  I assume that this would be a trivial service which discloses the kind of services it is willing to start using Jini.  The operator could then use a console or web application to list the known hot spares and the service types each could support and then make the decision to allocate a hot spare.

In terms of the complexity of decision making, hot spare allocation (under the design that we have been targetting) would occur after a suitable unplanned downtime interval - on the order of one or two minutes.  At that point a pre-imaged node capable of starting the desired service would start the service and enter into the resynchronization protocol with the quorum.  The purpose of the interval before automated allocation is to hide transient failures.  If a node was having continuing transient failures then you would probably want to force the allocation of the hot spare.  On the other hand, if the node came back online and was able to resynchronize then the hot spare would be retired.

I agree that hot spare (de-)allocation adds complexity to the HA milestone.  A related issue which I had planned to defer beyond the initial HA milestone is to dynamically change the size of the quorum.  E.g., migrating a cluster from k=1 (no failover) to k=3 or from k=3 to k=5 (increased redundency).

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Tuesday, July 20, 2010 5:41 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> Bryan,
> 
> We have no plan to use hot spares. That is, we expect an HA 
> system to stay up and running long enough for an operator to 
> diagnose a problem, and either fix it or manually configure 
> and start a cold spare if necessary. I feel that automating 
> this process when the actual faults are not understood is too 
> likely to cause harm than help.
> 
> Otherwise, why not have a set of HotSpare{Data,Metadata,...}Service
> instances configured and running on the hot spare machines, 
> ready to become full participants when an HA quorum leader 
> (or whatever
> mechanism) identifies a need? When a new XXXService is 
> needed, a HotSpareXXXService is discovered and activated, 
> registering a real XXXService. (Credit to Sean.)
> 
> Fred
> 
> 
> On Tue, Jul 20, 2010 at 5:16 PM, Bryan Thompson 
> <br...@sy...> wrote:
> > Fred,
> >
> > If you are not running the SMS, then you can simply start 
> whatever services you want to start locally.  The SMS is not 
> used for anything besides actually starting the various 
> services.  Alternatively, a simpler SMS implementation could 
> be used which read the list of services to start from a local 
> configuration and ignored zookeeper.
> >
> > How would you propose to handle HA in that scenario?  There 
> is still a problem with dynamic recruitment from the pool of 
> hot spares.
> >
> > Thanks,
> > Bryan
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-20 22:26:42

Bryan,

On Tue, Jul 20, 2010 at 6:04 PM, Bryan Thompson <br...@sy...> wrote:
>
> The HotSpareXXXService could be used to provide some indirection for that purpose.  I assume that this would be a trivial service which discloses the kind of services it is willing to start using Jini.  The operator could then use a console or web application to list the known hot spares and the service types each could support and then make the decision to allocate a hot spare.

The operator can create and install whatever configuration file is
necessary on the cold spare host and start the failed services. The
HotSpareDataService (etc.) idea is for the automated approach without
SMS synchronization.

Would you consider changing the startup mechanism to a simple, per
host configuration, non-synchronizing mechanism?

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-20 23:09:34

Fred,

As I said, there is no real dependency on the zookeeper / SMS mechanisms for service start outside of the existing bigdata init.d style script and HA.  HA does have dependencies on zookeeper, but could run with other mechanisms for service start.  It is important for the HA/Quorums that the logical services declare their target replication count.  Whether an operator or the SMS handles the service start is less important.

However, let me ask why you would introduce a different mechanism when there is an existing implementation which is more flexible solution.  Certainly there are a few outstanding issues against the current SMS/Zookeeper integration, but these are readily resolved.  Further, the zookeeper/SMS approach can be templated just as you are suggesting for this alternative.  In fact, the bigdata configuration file essentially provides that template.  

So, why should we explore a new mechanism?  I suggest that it is easy enough to reduce the existing mechanism to the kind of simple template you are describing, but the reverse is not true.  Why would this be a step forward?  Also, for people who want the ability to "flex" the cluster, the existing mechanism is more flexible.  Would your proposal in fact lead to two different service configuration/start mechanisms and if so, might that make the system less maintainable?

I will also note that the existing bigdata init.d style script can be run without shared state.  All it requires is a mechanism, such as puppet, to execute 'bigdata XXX', where XXX is the target run state for that node.  At present this is done with cron, but that is certainly not a requirement of the script.  Likewise, there is no requirement to aggregate logs to a logging service, etc.

Bryan
________________________________________
From: Fred Oliver [fko...@gm...]
Sent: Tuesday, July 20, 2010 6:26 PM
To: Bryan Thompson
Cc: Bigdata Developers
Subject: Re: [Bigdata-developers] Why zookeeper?

Bryan,

On Tue, Jul 20, 2010 at 6:04 PM, Bryan Thompson <br...@sy...> wrote:
>
> The HotSpareXXXService could be used to provide some indirection for that purpose.  I assume that this would be a trivial service which discloses the kind of services it is willing to start using Jini.  The operator could then use a console or web application to list the known hot spares and the service types each could support and then make the decision to allocate a hot spare.

The operator can create and install whatever configuration file is
necessary on the cold spare host and start the failed services. The
HotSpareDataService (etc.) idea is for the automated approach without
SMS synchronization.

Would you consider changing the startup mechanism to a simple, per
host configuration, non-synchronizing mechanism?

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-26 18:18:53

"Flexibility" seems to be in the eyes of the beholder.  I would
characterize the existing mechanism as more adaptive, but less
flexible. That is, you get "flexing", but you lose fine grained
control. I think that adaptivity and fine grained control are mutually
exclusive goals.

The finer grained control has a number of advantages:

* It is deterministic. Services always run on the same host and with
the same configuration every time. Easier to diagnose faults. Easier
to test. Every time the system starts up, it starts up the same way,
and deviations are errors.

* There is no need for locking. That is, I think the locking is needed
as a result of the non-deterministic behavior. (No need for a service
manager to wait to find out if another host's service manager grabbed
a lock to start a service before attempting to start one itself.)

* It allows for matching the distribution of services to heterogeneous
hardware obtained to run them. The operators knew which machines were
purchased to run which services and why, and should be able to specify
that the services run on specific machines.

* If the operator added hardware to a cluster for a specific need, the
operator should be able to specify that the hardware be used to
address the need.

* It allows for more specific control of individual services. (eg. How
would you separate the service directory from the mass storage
directory? How do you configure N data services per host to run on N
independent drives instead of a RAID? How would that perform?)

On the flip side, I am not clear on what the benefits of the
adaptivity or "flexing" really are in this context. Flexing seems more
related to "cloud" environments where hardware is instantly available
but managing persistent data is very difficult. Could you elaborate on
the benefits you perceive from adaptivity?

Removing or separating out the adaptive behavior (and zookeeper, or
limiting zookeeper's use to HA leader election) removes moving parts,
increases visibility and understanding of the code, and improves
maintainability significantly. We would like to see bigdata become
modular, to the point where the service manager (and its use of
zookeeper) can be implemented in its own optional module.

Is the starting of each of the services individually from the command
line or script possible without need for zookeeper (if it really is
limited to the services manager service)? If so, then this isn't a
second mechanism at all.

In either case, supporting a second service starting arrangement seems
like a small price relative to the simplicity gained.

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-26 18:38:44

Fred,

I've been out for a bit with my head wrapped around other things.  Can you remind me how we are going to handle the assignment of physical service nodes to logical service nodes in this design?

Concerning your points below, either scheme can be made fully deterministic.  It is only a matter of specifying that a specific service must run on a specific host (a constraint on what services can run on a given host).

Maybe another way to look at this is instance level specification versus rules based specification.  With instance level specification, you directly state what services are running on each node.  With rules based specification, you describe which kinds of services can run on which kinds of nodes.

I see rules based specification as more flexible because you can administer the rule set rather than making a decision for each node that you bring online.  I agree that it is more adaptive since the constraints are being specified at a level above the individual node. I see the rules as globally transparent because they are just data which could be edited using, for example, a web browser backed by an application looking at the data in zookeeper where as the instance level specification must be edited on each node.  I think of rules as more scalable because you do not have to figure out what you are going to do with each node.  The node will be put to a purpose for which it is fit and for which there is a need.

However, as long as we have a reasonable path for HA service allocation which respects the need to associate specific physical service instances with specific logical service instances then it seems reasonable that either approach could be used.  It just becomes a matter of how we describe what services the node will start and whether or not we run the ServicesManagerService on that node.

Bryan

PS: Concerning "flex", the big leverage for flexing the cluster will come with a shared disk architecture (rather than the present shared nothing architecture).  Using a shared disk architecture, the nodes can then be started or stopped without regard to the persistent state, which would be on managed storage.  That would make it possible to tradeoff dynamically which nodes were assigned to which application, where the application might be bigdata, hadoop, etc.  In that kind of scenario I find it difficult to imagine that an operator will be in the loop when a node is torn down and then repurposed to a different application.  However, this kind of "flex" is outside the scope of the current effort.

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Monday, July 26, 2010 2:19 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> "Flexibility" seems to be in the eyes of the beholder.  I 
> would characterize the existing mechanism as more adaptive, 
> but less flexible. That is, you get "flexing", but you lose 
> fine grained control. I think that adaptivity and fine 
> grained control are mutually exclusive goals.
> 
> The finer grained control has a number of advantages:
> 
> * It is deterministic. Services always run on the same host 
> and with the same configuration every time. Easier to 
> diagnose faults. Easier to test. Every time the system starts 
> up, it starts up the same way, and deviations are errors.
> 
> * There is no need for locking. That is, I think the locking 
> is needed as a result of the non-deterministic behavior. (No 
> need for a service manager to wait to find out if another 
> host's service manager grabbed a lock to start a service 
> before attempting to start one itself.)
> 
> * It allows for matching the distribution of services to 
> heterogeneous hardware obtained to run them. The operators 
> knew which machines were purchased to run which services and 
> why, and should be able to specify that the services run on 
> specific machines.
> 
> * If the operator added hardware to a cluster for a specific 
> need, the operator should be able to specify that the 
> hardware be used to address the need.
> 
> * It allows for more specific control of individual services. 
> (eg. How would you separate the service directory from the 
> mass storage directory? How do you configure N data services 
> per host to run on N independent drives instead of a RAID? 
> How would that perform?)
> 
> On the flip side, I am not clear on what the benefits of the 
> adaptivity or "flexing" really are in this context. Flexing 
> seems more related to "cloud" environments where hardware is 
> instantly available but managing persistent data is very 
> difficult. Could you elaborate on the benefits you perceive 
> from adaptivity?
> 
> Removing or separating out the adaptive behavior (and 
> zookeeper, or limiting zookeeper's use to HA leader election) 
> removes moving parts, increases visibility and understanding 
> of the code, and improves maintainability significantly. We 
> would like to see bigdata become modular, to the point where 
> the service manager (and its use of
> zookeeper) can be implemented in its own optional module.
> 
> Is the starting of each of the services individually from the 
> command line or script possible without need for zookeeper 
> (if it really is limited to the services manager service)? If 
> so, then this isn't a second mechanism at all.
> 
> In either case, supporting a second service starting 
> arrangement seems like a small price relative to the 
> simplicity gained.
> 
> Fred
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-26 22:05:22

On Mon, Jul 26, 2010 at 2:37 PM, Bryan Thompson <br...@sy...> wrote:
> I've been out for a bit with my head wrapped around other things. Can you remind me how we are going to handle the assignment of physical service nodes to logical service nodes in this design?

What is a node (physical or logical) in this context? I think you mean
that a physical node is a machine. If so, then what is a logical node?

As I understand the services, all service instances are physical. The
logical service construct exists only as an abstraction on which the
rules in the rules based specification may operate. If I understand
correctly, then with the instance level specification, there are no
logical services and no logical nodes. But still, what's a logical
node?

> Concerning your points below, either scheme can be made fully deterministic. It is only a matter of specifying that a specific service must run on a specific host (a constraint on what services can run on a given host).

If you can make the rules based scheme deterministic, then please do!
But I think meant only that you can write rules that constrain the
behavior such that the result in those particular cases are
deterministic, which is an entirely different matter. The rules based
scheme makes for much more code, locking and synchronization, very
difficult testing, and a less maintainable environment.

> I see rules based specification as more flexible because you can administer the rule set rather than making a decision for each node that you bring online. I agree that it is more adaptive since the constraints are being specified at a level above the individual node. I see the rules as globally transparent because they are just data which could be edited using, for example, a web browser backed by an application looking at the data in zookeeper where as the instance level specification must be edited on each node. I think of rules as more scalable because you do not have to figure out what you are going to do with each node. The node will be put to a purpose for which it is fit and for which there is a need.

I think we're going to disagree about merits of instance vs. rules
schemes, and I hope we can modularize the system so that the schemes
are separate modules and independent of the core functionality (which
wouldn't need zookeeper).

My biggest concern about that last paragraph (or the whole message?)
is that this use of zookeeper seemed unnecessary and confusing. That
is, why wouldn't the web app interact with the service instances
directly to get/set configurations using well defined, testable public
interfaces, rather than use zookeeper as a hub? (That's the secret
messages in dead drops thing.)

> However, as long as we have a reasonable path for HA service allocation which respects the need to associate specific physical service instances with specific logical service instances then it seems reasonable that either approach could be used. It just becomes a matter of how we describe what services the node will start and whether or not we run the ServicesManagerService on that node.

Clearly HA needs set of like service instances to work in
active/active or active/passive arrangements. The term "logical
service" seems overloaded in that (as far as I have figured out) it
has different meanings in the pre-HA and post-HA discussions. I can
see that an HA logical data service would refer to the group of data
service instances which together host a single shard. But this
definition is very specific and differs from the more general meaning
in the rules based specification discussion, which is confusing.

The instance-based scheme can be used for HA as well as long as the
service configurations are extended to indicate which "HA logical"
group a service belonged to.

Fred

> PS: Concerning "flex", the big leverage for flexing the cluster will come with a shared disk architecture (rather than the present shared nothing architecture). Using a shared disk architecture, the nodes can then be started or stopped without regard to the persistent state, which would be on managed storage. That would make it possible to tradeoff dynamically which nodes were assigned to which application, where the application might be bigdata, hadoop, etc. In that kind of scenario I find it difficult to imagine that an operator will be in the loop when a node is torn down and then repurposed to a different application. However, this kind of "flex" is outside the scope of the current effort.

OK. I see the primary benefit of this arrangement as making hot spares
become operational much more quickly, but I don't see how this applies
to the rules vs. instance based specifications discussion. Both
schemes can handle this arrangement.

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-26 22:17:18

Fred,

> The term 
> "logical service" seems overloaded in that (as far as I have 
> figured out) it has different meanings in the pre-HA and 
> post-HA discussions.  

It is the same usage.  The pre-HA code base was designed with the HA feature set in mind.  A logical service corresponds to some collection of actual service instances which provide the highly available logical service.

I get that you are not fond of the rules-based scheme.  What I would like to know is how HA will be handled within the scheme that you are proposing.

Thanks,
Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Monday, July 26, 2010 6:05 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> On Mon, Jul 26, 2010 at 2:37 PM, Bryan Thompson 
> <br...@sy...> wrote:
> > I've been out for a bit with my head wrapped around other 
> things.  Can you remind me how we are going to handle the 
> assignment of physical service nodes to logical service nodes 
> in this design?
> 
> What is a node (physical or logical) in this context? I think 
> you mean that a physical node is a machine. If so, then what 
> is a logical node?
> 
> As I understand the services, all service instances are 
> physical. The logical service construct exists only as an 
> abstraction on which the rules in the rules based 
> specification may operate. If I understand correctly, then 
> with the instance level specification, there are no logical 
> services and no logical nodes. But still, what's a logical node?
> 
> > Concerning your points below, either scheme can be made 
> fully deterministic.  It is only a matter of specifying that 
> a specific service must run on a specific host (a constraint 
> on what services can run on a given host).
> 
> If you can make the rules based scheme deterministic, then please do!
> But I think meant only that you can write rules that 
> constrain the behavior such that the result in those 
> particular cases are deterministic, which is an entirely 
> different matter. The rules based scheme makes for much more 
> code, locking and synchronization, very difficult testing, 
> and a less maintainable environment.
> 
> > I see rules based specification as more flexible because 
> you can administer the rule set rather than making a decision 
> for each node that you bring online.  I agree that it is more 
> adaptive since the constraints are being specified at a level 
> above the individual node. I see the rules as globally 
> transparent because they are just data which could be edited 
> using, for example, a web browser backed by an application 
> looking at the data in zookeeper where as the instance level 
> specification must be edited on each node.  I think of rules 
> as more scalable because you do not have to figure out what 
> you are going to do with each node.  The node will be put to 
> a purpose for which it is fit and for which there is a need.
> 
> I think we're going to disagree about merits of instance vs. 
> rules schemes, and I hope we can modularize the system so 
> that the schemes are separate modules and independent of the 
> core functionality (which wouldn't need zookeeper).
> 
> My biggest concern about that last paragraph (or the whole 
> message?) is that this use of zookeeper seemed  unnecessary 
> and confusing. That is, why wouldn't the web app interact 
> with the service instances directly to get/set configurations 
> using well defined, testable public interfaces, rather than 
> use zookeeper as a hub? (That's the secret messages in dead 
> drops thing.)
> 
> > However, as long as we have a reasonable path for HA 
> service allocation which respects the need to associate 
> specific physical service instances with specific logical 
> service instances then it seems reasonable that either 
> approach could be used.  It just becomes a matter of how we 
> describe what services the node will start and whether or not 
> we run the ServicesManagerService on that node.
> 
> Clearly HA needs set of like service instances to work in 
> active/active or active/passive arrangements. The term 
> "logical service" seems overloaded in that (as far as I have 
> figured out) it has different meanings in the pre-HA and 
> post-HA discussions. I can see that an HA logical data 
> service would refer to the group of data service instances 
> which together host a single shard. But this definition is 
> very specific and differs from the more general meaning in 
> the rules based specification discussion, which is confusing.
> 
> The instance-based scheme can be used for HA as well as long 
> as the service configurations are extended to indicate which 
> "HA logical"
> group a service belonged to.
> 
> Fred
> 
> > PS: Concerning "flex", the big leverage for flexing the 
> cluster will come with a shared disk architecture (rather 
> than the present shared nothing architecture).  Using a 
> shared disk architecture, the nodes can then be started or 
> stopped without regard to the persistent state, which would 
> be on managed storage.  That would make it possible to 
> tradeoff dynamically which nodes were assigned to which 
> application, where the application might be bigdata, hadoop, 
> etc.  In that kind of scenario I find it difficult to imagine 
> that an operator will be in the loop when a node is torn down 
> and then repurposed to a different application.  However, 
> this kind of "flex" is outside the scope of the current effort.
> 
> OK. I see the primary benefit of this arrangement as making 
> hot spares become operational much more quickly, but I don't 
> see how this applies to the rules vs. instance based 
> specifications discussion. Both schemes can handle this arrangement.
> 
> Fred
>

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-07-26 22:22:10

Attachments: Bigdata-HA-Quorum-Detailed-Design.pdf

Fred,

Can you provide some examples for the HA enabled per-instance configuration?  I would like the instance based configuration to be compatible with the rules-based approach.  In addition, the services will need to publish certain metadata about the logical service instances into zookeeper for the HA quorums.    I've attach a current copy of the HA/zookeeper integration document which specifies the zookeeper paths that are used by the quorum.  Everything is organized under the zpath of the logical service.  Take a look and then let's see if we can bring this down to some concrete points for both mechanisms to operate.

Bryan

> -----Original Message-----
> From: Bryan Thompson 
> Sent: Monday, July 26, 2010 6:16 PM
> To: 'Fred Oliver'
> Cc: Bigdata Developers
> Subject: RE: [Bigdata-developers] Why zookeeper?
> 
> Fred,
> 
> > The term
> > "logical service" seems overloaded in that (as far as I 
> have figured 
> > out) it has different meanings in the pre-HA and post-HA 
> discussions.
> 
> It is the same usage.  The pre-HA code base was designed with 
> the HA feature set in mind.  A logical service corresponds to 
> some collection of actual service instances which provide the 
> highly available logical service.
> 
> I get that you are not fond of the rules-based scheme.  What 
> I would like to know is how HA will be handled within the 
> scheme that you are proposing.
> 
> Thanks,
> Bryan
> 
> > -----Original Message-----
> > From: Fred Oliver [mailto:fko...@gm...]
> > Sent: Monday, July 26, 2010 6:05 PM
> > To: Bryan Thompson
> > Cc: Bigdata Developers
> > Subject: Re: [Bigdata-developers] Why zookeeper?
> > 
> > On Mon, Jul 26, 2010 at 2:37 PM, Bryan Thompson <br...@sy...> 
> > wrote:
> > > I've been out for a bit with my head wrapped around other
> > things.  Can you remind me how we are going to handle the 
> assignment 
> > of physical service nodes to logical service nodes in this design?
> > 
> > What is a node (physical or logical) in this context? I 
> think you mean 
> > that a physical node is a machine. If so, then what is a 
> logical node?
> > 
> > As I understand the services, all service instances are 
> physical. The 
> > logical service construct exists only as an abstraction on 
> which the 
> > rules in the rules based specification may operate. If I understand 
> > correctly, then with the instance level specification, there are no 
> > logical services and no logical nodes. But still, what's a logical 
> > node?
> > 
> > > Concerning your points below, either scheme can be made
> > fully deterministic.  It is only a matter of specifying that a 
> > specific service must run on a specific host (a constraint on what 
> > services can run on a given host).
> > 
> > If you can make the rules based scheme deterministic, then 
> please do!
> > But I think meant only that you can write rules that constrain the 
> > behavior such that the result in those particular cases are 
> > deterministic, which is an entirely different matter. The 
> rules based 
> > scheme makes for much more code, locking and synchronization, very 
> > difficult testing, and a less maintainable environment.
> > 
> > > I see rules based specification as more flexible because
> > you can administer the rule set rather than making a 
> decision for each 
> > node that you bring online.  I agree that it is more adaptive since 
> > the constraints are being specified at a level above the individual 
> > node. I see the rules as globally transparent because they are just 
> > data which could be edited using, for example, a web 
> browser backed by 
> > an application looking at the data in zookeeper where as 
> the instance 
> > level specification must be edited on each node.  I think 
> of rules as 
> > more scalable because you do not have to figure out what 
> you are going 
> > to do with each node.  The node will be put to a purpose 
> for which it 
> > is fit and for which there is a need.
> > 
> > I think we're going to disagree about merits of instance vs. 
> > rules schemes, and I hope we can modularize the system so that the 
> > schemes are separate modules and independent of the core 
> functionality 
> > (which wouldn't need zookeeper).
> > 
> > My biggest concern about that last paragraph (or the whole
> > message?) is that this use of zookeeper seemed  unnecessary and 
> > confusing. That is, why wouldn't the web app interact with 
> the service 
> > instances directly to get/set configurations using well defined, 
> > testable public interfaces, rather than use zookeeper as a hub? 
> > (That's the secret messages in dead drops thing.)
> > 
> > > However, as long as we have a reasonable path for HA
> > service allocation which respects the need to associate specific 
> > physical service instances with specific logical service instances 
> > then it seems reasonable that either approach could be 
> used.  It just 
> > becomes a matter of how we describe what services the node 
> will start 
> > and whether or not we run the ServicesManagerService on that node.
> > 
> > Clearly HA needs set of like service instances to work in 
> > active/active or active/passive arrangements. The term "logical 
> > service" seems overloaded in that (as far as I have figured out) it 
> > has different meanings in the pre-HA and post-HA discussions. I can 
> > see that an HA logical data service would refer to the 
> group of data 
> > service instances which together host a single shard. But this 
> > definition is very specific and differs from the more 
> general meaning 
> > in the rules based specification discussion, which is confusing.
> > 
> > The instance-based scheme can be used for HA as well as long as the 
> > service configurations are extended to indicate which "HA logical"
> > group a service belonged to.
> > 
> > Fred
> > 
> > > PS: Concerning "flex", the big leverage for flexing the
> > cluster will come with a shared disk architecture (rather than the 
> > present shared nothing architecture).  Using a shared disk 
> > architecture, the nodes can then be started or stopped 
> without regard 
> > to the persistent state, which would be on managed storage.  That 
> > would make it possible to tradeoff dynamically which nodes were 
> > assigned to which application, where the application might 
> be bigdata, 
> > hadoop, etc.  In that kind of scenario I find it difficult 
> to imagine 
> > that an operator will be in the loop when a node is torn 
> down and then 
> > repurposed to a different application.  However, this kind 
> of "flex" 
> > is outside the scope of the current effort.
> > 
> > OK. I see the primary benefit of this arrangement as making 
> hot spares 
> > become operational much more quickly, but I don't see how 
> this applies 
> > to the rules vs. instance based specifications discussion. Both 
> > schemes can handle this arrangement.
> > 
> > Fred
> >

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-07-30 21:48:37

Bryan,

Services (started by per-instance configuration) can register with the
lookup service with their physical ID, and include the logical ID of
their group (quorum), if any, in their service entries, allowing them
to be discovered by logical ID. I can imagine a new stateless service
whose sole function is binding logical IDs to unbound physical
services, taking into account hardware configuration, rack/network
awareness, etc.

Now that we have groups of physical services mapped to logical
services, there is still the need for leader election. I'm willing to
accept using zookeeper for that limited purpose, though I think I'd
prefer some embedded paxos implementation if I had one on hand.
Leaders can identify themselves with service entries in their
registrations, allowing service leaders to be easily discovered.
(Timing and failure scenarios would have to be considered carefully.
If the idea of registering as a service leader doesn't work, then
clients can always ask a service who its leader is.) It is important
to me that clients need not consult zookeeper directly, limiting the
number of processes being coordinated to the smallest possible number,
decreasing complexity.

Once service leaders have been established, each leader can
communicate directly with its group members to complete
initialization, identify and share the most recent updates if needed,
etc. I think that using zookeeper as an information conduit when
service to service communication still drives up the complexity
unnecessarily.

Fred

On Mon, Jul 26, 2010 at 6:21 PM, Bryan Thompson <br...@sy...> wrote:
> Fred,
>
> Can you provide some examples for the HA enabled per-instance configuration? I would like the instance based configuration to be compatible with the rules-based approach. In addition, the services will need to publish certain metadata about the logical service instances into zookeeper for the HA quorums. I've attach a current copy of the HA/zookeeper integration document which specifies the zookeeper paths that are used by the quorum. Everything is organized under the zpath of the logical service. Take a look and then let's see if we can bring this down to some concrete points for both mechanisms to operate.

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-08-02 14:12:33

Fred,

You are positing a new service which handles the binding of the available physical services to the required logical services.  How do you plan to make that logical to physical binding service highly available?  It seems to me that this centralizes an aspect of the distributed decision making which is currently performed by the ServicesManagerServer.  If you make this binding service highly available, will you have recreated a distributed decision making service?  It would of course be a simpler service since it does not handle service start decisions, but only service bind decisions.

Clients in this context refers to (a) ClientServices participating in a bulk load; (b) nodes providing SPARQL end points; and (c) the DataServices themselves, which use the same discovery mechanisms to locate other DataServices when moving shards or executing pipeline joins.  The hosts which originate SPARQL requests are HTTP clients, but they are not bigdata federation clients and do not have any visibility into zookeeper, jini, or the bigdata services.

Given that "clients" are themselves bigdata services and that Zookeeper scales nicely to 1000s of nodes, why have clients go directly to the services rather than watching the quorum state in zookeeper?

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Friday, July 30, 2010 5:48 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> Bryan,
> 
> Services (started by per-instance configuration) can register 
> with the lookup service with their physical ID, and include 
> the logical ID of their group (quorum), if any, in their 
> service entries, allowing them to be discovered by logical 
> ID. I can imagine a new stateless service whose sole function 
> is binding logical IDs to unbound physical services, taking 
> into account hardware configuration, rack/network awareness, etc.
> 
> Now that we have groups of physical services mapped to 
> logical services, there is still the need for leader 
> election. I'm willing to accept using zookeeper for that 
> limited purpose, though I think I'd prefer some embedded 
> paxos implementation if I had one on hand.
> Leaders can identify themselves with service entries in their 
> registrations, allowing service leaders to be easily discovered.
> (Timing and failure scenarios would have to be considered carefully.
> If the idea of registering as a service leader doesn't work, 
> then clients can always ask a service who its leader is.)  It 
> is important to me that clients need not consult zookeeper 
> directly, limiting the number of processes being coordinated 
> to the smallest possible number, decreasing complexity.
> 
> Once service leaders have been established, each leader can 
> communicate directly with its group members to complete 
> initialization, identify and share the most recent updates if 
> needed, etc. I think that using zookeeper as an information 
> conduit when service to service communication still drives up 
> the complexity unnecessarily.
> 
> Fred
> 
> 
> On Mon, Jul 26, 2010 at 6:21 PM, Bryan Thompson 
> <br...@sy...> wrote:
> > Fred,
> >
> > Can you provide some examples for the HA enabled 
> per-instance configuration?  I would like the instance based 
> configuration to be compatible with the rules-based approach. 
>  In addition, the services will need to publish certain 
> metadata about the logical service instances into zookeeper 
> for the HA quorums.    I've attach a current copy of the 
> HA/zookeeper integration document which specifies the 
> zookeeper paths that are used by the quorum.  Everything is 
> organized under the zpath of the logical service.  Take a 
> look and then let's see if we can bring this down to some 
> concrete points for both mechanisms to operate.
>

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-08-02 17:55:33

Fred,

Rather than create a new service for which we then have to devise a new HA approach, why don't you write a simple Java program or script that will generate the configuration files for each node, including their physical to logical service assignments, and then use puppet to SSH those files onto the deployment cluster?   Since you want a static configuration, it seems that this approach is much more likely to be error free since the script "knows" which node gets which configuration file and the assignments can all be done easily within nested loops using the IP address pool for the target cluster.

Bryan

> -----Original Message-----
> From: Bryan Thompson [mailto:br...@sy...] 
> Sent: Monday, August 02, 2010 10:12 AM
> To: Fred Oliver
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> Fred,
> 
> You are positing a new service which handles the binding of 
> the available physical services to the required logical 
> services.  How do you plan to make that logical to physical 
> binding service highly available?  It seems to me that this 
> centralizes an aspect of the distributed decision making 
> which is currently performed by the ServicesManagerServer.  
> If you make this binding service highly available, will you 
> have recreated a distributed decision making service?  It 
> would of course be a simpler service since it does not handle 
> service start decisions, but only service bind decisions.
> 
> Clients in this context refers to (a) ClientServices 
> participating in a bulk load; (b) nodes providing SPARQL end 
> points; and (c) the DataServices themselves, which use the 
> same discovery mechanisms to locate other DataServices when 
> moving shards or executing pipeline joins.  The hosts which 
> originate SPARQL requests are HTTP clients, but they are not 
> bigdata federation clients and do not have any visibility 
> into zookeeper, jini, or the bigdata services.
> 
> Given that "clients" are themselves bigdata services and that 
> Zookeeper scales nicely to 1000s of nodes, why have clients 
> go directly to the services rather than watching the quorum 
> state in zookeeper?
> 
> Bryan
> 
> > -----Original Message-----
> > From: Fred Oliver [mailto:fko...@gm...]
> > Sent: Friday, July 30, 2010 5:48 PM
> > To: Bryan Thompson
> > Cc: Bigdata Developers
> > Subject: Re: [Bigdata-developers] Why zookeeper?
> > 
> > Bryan,
> > 
> > Services (started by per-instance configuration) can 
> register with the 
> > lookup service with their physical ID, and include the 
> logical ID of 
> > their group (quorum), if any, in their service entries, 
> allowing them 
> > to be discovered by logical ID. I can imagine a new 
> stateless service 
> > whose sole function is binding logical IDs to unbound physical 
> > services, taking into account hardware configuration, rack/network 
> > awareness, etc.
> > 
> > Now that we have groups of physical services mapped to logical 
> > services, there is still the need for leader election. I'm 
> willing to 
> > accept using zookeeper for that limited purpose, though I think I'd 
> > prefer some embedded paxos implementation if I had one on hand.
> > Leaders can identify themselves with service entries in their 
> > registrations, allowing service leaders to be easily discovered.
> > (Timing and failure scenarios would have to be considered carefully.
> > If the idea of registering as a service leader doesn't work, then 
> > clients can always ask a service who its leader is.)  It is 
> important 
> > to me that clients need not consult zookeeper directly, 
> limiting the 
> > number of processes being coordinated to the smallest 
> possible number, 
> > decreasing complexity.
> > 
> > Once service leaders have been established, each leader can 
> > communicate directly with its group members to complete 
> > initialization, identify and share the most recent updates 
> if needed, 
> > etc. I think that using zookeeper as an information conduit when 
> > service to service communication still drives up the complexity 
> > unnecessarily.
> > 
> > Fred
> > 
> > 
> > On Mon, Jul 26, 2010 at 6:21 PM, Bryan Thompson <br...@sy...> 
> > wrote:
> > > Fred,
> > >
> > > Can you provide some examples for the HA enabled
> > per-instance configuration?  I would like the instance based 
> > configuration to be compatible with the rules-based approach.
> >  In addition, the services will need to publish certain 
> metadata about 
> > the logical service instances into zookeeper for the HA quorums.    
> > I've attach a current copy of the HA/zookeeper integration document 
> > which specifies the zookeeper paths that are used by the quorum.  
> > Everything is organized under the zpath of the logical 
> service.  Take 
> > a look and then let's see if we can bring this down to some 
> concrete 
> > points for both mechanisms to operate.
> > 
> --------------------------------------------------------------
> ----------------
> The Palm PDK Hot Apps Program offers developers who use the 
> Plug-In Development Kit to bring their C/C++ apps to Palm for 
> a share of $1 Million in cash or HP Products. Visit us here 
> for more details:
> http://p.sf.net/sfu/dev2dev-palm
> _______________________________________________
> Bigdata-developers mailing list
> Big...@li...
> https://lists.sourceforge.net/lists/listinfo/bigdata-developers
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-08-02 22:39:26

Bryan,

> You are positing a new service which handles the binding of the available physical services to the required logical services.  How do you plan to make that logical to physical binding service highly available?  It seems to me that this centralizes an aspect of the distributed decision making which is currently performed by the ServicesManagerServer.  If you make this binding service highly available, will you have recreated a distributed decision making service?  It would of course be a simpler service since it does not handle service start decisions, but only service bind decisions.

The new (simple, stateless) service I proposed in passing is useful
only when new unbound (physical) data services are added to the HA
cluster. Once new physical data services are bound into logical data
services, this new service has no further useful function and can be
shutdown. It does not need to be highly available.

> Clients in this context refers to (a) ClientServices participating in a bulk load; (b) nodes providing SPARQL end points; and (c) the DataServices themselves, which use the same discovery mechanisms to locate other DataServices when moving shards or executing pipeline joins.  The hosts which originate SPARQL requests are HTTP clients, but they are not bigdata federation clients and do not have any visibility into zookeeper, jini, or the bigdata services.

> Given that "clients" are themselves bigdata services and that Zookeeper scales nicely to 1000s of nodes, why have clients go directly to the services rather than watching the quorum state in zookeeper?

Removing one of two (mostly) redundant discovery mechanisms reduces
code complexity. The River service discovery manager, already in use,
scales even better to 1000s of nodes because persistence isn't needed
and the redundant copies don't need to cooperate.

After discovery, why not go directly to the service to ask for state
if necessary? It is the same cost as going to zookeeper, right? The
benefit is that going direct to the service involves an easily
documented testable, maintainable interface, and any state being
persisted is persisted by the service.

Other than general service startup and HA group startup, what
information is passed through zookeeper?

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-08-03 00:07:18

Fred,

> After discovery, why not go directly to the service to ask for state if necessary? It is the same cost as going to zookeeper, right? The benefit is that going direct to the service involves an easily documented testable, maintainable interface, and any state being persisted is persisted by the service. 

No, it is not the same cost. Zookeeper is a trigger based mechanism. Clients register a watch (aka trigger) and are notified if the watched condition is satisified.  It depends on how you set things up of course, but polling a service to discover whether it is the leader can incur more cost and can be broken because Jini/River is not providing atomic decision making about leader elections.  

Concerning a stateless service to handle binding, I would question whether it could make the binding decisions correctly in the case of a network partition.  You might wind up with two such instances making their decisions independently, at which point all bets are off.

I think that my suggestion of generating and deploying the configuration files automatically removes everything associated with rules based configuration versus instance based configuration.  You can not safely step beyond that point and erode the core functionality of zookeeper for distributed synchronous decision making dynamics without proposing a suitable replacement.  Jini/River does not handle this, which is why Zookeeper is part of the services architecture.

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Monday, August 02, 2010 6:39 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> Bryan,
> 
> > You are positing a new service which handles the binding of 
> the available physical services to the required logical 
> services.  How do you plan to make that logical to physical 
> binding service highly available?  It seems to me that this 
> centralizes an aspect of the distributed decision making 
> which is currently performed by the ServicesManagerServer.  
> If you make this binding service highly available, will you 
> have recreated a distributed decision making service?  It 
> would of course be a simpler service since it does not handle 
> service start decisions, but only service bind decisions.
> 
> The new (simple, stateless) service I proposed in passing is 
> useful only when new unbound (physical) data services are 
> added to the HA cluster. Once new physical data services are 
> bound into logical data services, this new service has no 
> further useful function and can be shutdown. It does not need 
> to be highly available.
> 
> > Clients in this context refers to (a) ClientServices 
> participating in a bulk load; (b) nodes providing SPARQL end 
> points; and (c) the DataServices themselves, which use the 
> same discovery mechanisms to locate other DataServices when 
> moving shards or executing pipeline joins.  The hosts which 
> originate SPARQL requests are HTTP clients, but they are not 
> bigdata federation clients and do not have any visibility 
> into zookeeper, jini, or the bigdata services.
> 
> > Given that "clients" are themselves bigdata services and 
> that Zookeeper scales nicely to 1000s of nodes, why have 
> clients go directly to the services rather than watching the 
> quorum state in zookeeper?
> 
> Removing one of two (mostly) redundant discovery mechanisms 
> reduces code complexity. The River service discovery manager, 
> already in use, scales even better to 1000s of nodes because 
> persistence isn't needed and the redundant copies don't need 
> to cooperate.
> 
> After discovery, why not go directly to the service to ask 
> for state if necessary? It is the same cost as going to 
> zookeeper, right? The benefit is that going direct to the 
> service involves an easily documented testable, maintainable 
> interface, and any state being persisted is persisted by the service.
> 
> Other than general service startup and HA group startup, what 
> information is passed through zookeeper?
> 
> Fred
>

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-08-03 13:13:11

Fred,

I also do not see how this service can be stateless.  It needs to know how many instances of the logical services exist, how many need to be created, and it needs to bind physical instances to logical instances in a manner which protects against network partitions.  That assignment really needs to be either static or governed by zookeeper in order to ensure that we do not assign too many instances to a quorum.

If you have more than the target number of instances for a quorum, then you have broken the simple majority semantics.  Handling this becomes the same as the problem of handling hot spare allocation, which is more than I think you want to get into right now.

So, I think that we need to pair your proposal for instance level configuration with static configuration and static binding of physical services to logical instances.

Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Monday, August 02, 2010 6:39 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> Bryan,
> 
> > You are positing a new service which handles the binding of 
> the available physical services to the required logical 
> services.  How do you plan to make that logical to physical 
> binding service highly available?  It seems to me that this 
> centralizes an aspect of the distributed decision making 
> which is currently performed by the ServicesManagerServer.  
> If you make this binding service highly available, will you 
> have recreated a distributed decision making service?  It 
> would of course be a simpler service since it does not handle 
> service start decisions, but only service bind decisions.
> 
> The new (simple, stateless) service I proposed in passing is 
> useful only when new unbound (physical) data services are 
> added to the HA cluster. Once new physical data services are 
> bound into logical data services, this new service has no 
> further useful function and can be shutdown. It does not need 
> to be highly available.
> 
> > Clients in this context refers to (a) ClientServices 
> participating in a bulk load; (b) nodes providing SPARQL end 
> points; and (c) the DataServices themselves, which use the 
> same discovery mechanisms to locate other DataServices when 
> moving shards or executing pipeline joins.  The hosts which 
> originate SPARQL requests are HTTP clients, but they are not 
> bigdata federation clients and do not have any visibility 
> into zookeeper, jini, or the bigdata services.
> 
> > Given that "clients" are themselves bigdata services and 
> that Zookeeper scales nicely to 1000s of nodes, why have 
> clients go directly to the services rather than watching the 
> quorum state in zookeeper?
> 
> Removing one of two (mostly) redundant discovery mechanisms 
> reduces code complexity. The River service discovery manager, 
> already in use, scales even better to 1000s of nodes because 
> persistence isn't needed and the redundant copies don't need 
> to cooperate.
> 
> After discovery, why not go directly to the service to ask 
> for state if necessary? It is the same cost as going to 
> zookeeper, right? The benefit is that going direct to the 
> service involves an easily documented testable, maintainable 
> interface, and any state being persisted is persisted by the service.
> 
> Other than general service startup and HA group startup, what 
> information is passed through zookeeper?
> 
> Fred
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-08-03 18:11:22

Bryan,

> No, it is not the same cost. Zookeeper is a trigger based mechanism. Clients register a watch (aka trigger) and are notified if the watched condition is satisified. It depends on how you set things up of course, but polling a service to discover whether it is the leader can incur more cost and can be broken because Jini/River is not providing atomic decision making about leader elections.

I don't see the difference. The River lookup service has a better
event system, and its clients will receive notification of changes of
service entries (including leadership changes if services place that
information there). I agree that polling is not reasonable. The
election of a leader may be (mostly) atomic, but the distribution of
the result of the election is asynchronous in either case.

> Concerning a stateless service to handle binding, I would question whether it could make the binding decisions correctly in the case of a network partition. You might wind up with two such instances making their decisions independently, at which point all bets are off.

I expect that only one such service instance need exist in the cluster
and only for a short time (until physical services are bound to
logical services). It is probably desirable to wait for the partition
to be healed in order to bind physical services on opposite sides of
the fault which caused the partition. The delay in bringing a new
logical service into an existing cluster will not interfere with the
operation of the cluster in the short term.

> I think that my suggestion of generating and deploying the configuration files automatically removes everything associated with rules based configuration versus instance based configuration. You can not safely step beyond that point and erode the core functionality of zookeeper for distributed synchronous decision making dynamics without proposing a suitable replacement. Jini/River does not handle this, which is why Zookeeper is part of the services architecture.

I don't propose replacing zookeeper for making the group leadership
decisions. Otherwise, removing the "distributed synchronous decision
making dynamics" where it is not absolutely needed is the significant
simplification I propose. Removing synchronous behavior from
asynchronous processes makes the code easier to understand, reason
about correctness, write, test and maintain.

Generating and deploying individual configuration files is a plan, but
that does not disconnect JiniFederation (which is still needed) from
its use of zookeeper (which is not needed in the instance-based
configuration).

Other than the leadership election mechanism (which can be limited to
the group from whom the leader is elected) and the rules-based startup
mechanism, what is it that River does not handle?

Fred

Re: [Bigdata-developers] Why zookeeper?

From: Bryan T. <br...@sy...> - 2010-08-03 19:48:23

Fred,

I really do not want to get into the Jini/Zookeeper thing.  Both are great frameworks.  They do overlap a bit, but they are being used in ways which emphasize their respective strengths.

I think that the runtime binding approach you are outlining may cause problems with the quorum semantics and definitely will cause problems with encapsulating instance based versus rules based functionality within the architecture.  However, static binding will reduce risk above the basic instance based configuration you have advocated and reflects a separation of concerns between static instance configuration (or dynamic rules based configuration) on the one hand and quorum dynamics on the other hand.  I am not willing to go the additional step to dynamic binding since believe that it could undermine the quorum dynamics and it will make it difficult to encapsulate the rules based versus instance based mechanisms.

I think that this separation of concerns is a reasonable compromise in the architecture and I think that it gives you the freedom you are asking for from the "complexity" of the rules based configuration mechanisms.  At this point, I would like to get beyond this discussion and focus on completing the HA milestone.

Thanks,
Bryan

> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...] 
> Sent: Tuesday, August 03, 2010 2:11 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] Why zookeeper?
> 
> Bryan,
> 
> > No, it is not the same cost. Zookeeper is a trigger based 
> mechanism. Clients register a watch (aka trigger) and are 
> notified if the watched condition is satisified.  It depends 
> on how you set things up of course, but polling a service to 
> discover whether it is the leader can incur more cost and can 
> be broken because Jini/River is not providing atomic decision 
> making about leader elections.
> 
> I don't see the difference. The River lookup service has a 
> better event system, and its clients will receive 
> notification of changes of service entries (including 
> leadership changes if services place that information there). 
> I agree that polling is not reasonable. The election of a 
> leader may be (mostly) atomic, but the distribution of the 
> result of the election is asynchronous in either case.
> 
> > Concerning a stateless service to handle binding, I would 
> question whether it could make the binding decisions 
> correctly in the case of a network partition.  You might wind 
> up with two such instances making their decisions 
> independently, at which point all bets are off.
> 
> I expect that only one such service instance need exist in 
> the cluster and only for a short time (until physical 
> services are bound to logical services). It is probably 
> desirable to wait for the partition to be healed in order to 
> bind physical services on opposite sides of the fault which 
> caused the partition. The delay in bringing a new logical 
> service into an existing cluster will not interfere with the 
> operation of the cluster in the short term.
> 
> > I think that my suggestion of generating and deploying the 
> configuration files automatically removes everything 
> associated with rules based configuration versus instance 
> based configuration.  You can not safely step beyond that 
> point and erode the core functionality of zookeeper for 
> distributed synchronous decision making dynamics without 
> proposing a suitable replacement.  Jini/River does not handle 
> this, which is why Zookeeper is part of the services architecture.
> 
> I don't propose replacing zookeeper for making the group 
> leadership decisions. Otherwise, removing the "distributed 
> synchronous decision making dynamics" where it is not 
> absolutely needed is the significant simplification I 
> propose. Removing synchronous behavior from asynchronous 
> processes makes the code easier to understand, reason about 
> correctness, write, test and maintain.
> 
> Generating and deploying individual configuration files is a 
> plan, but that does not disconnect JiniFederation (which is 
> still needed) from its use of zookeeper (which is not needed 
> in the instance-based configuration).
> 
> Other than the leadership election mechanism (which can be 
> limited to the group from whom the leader is elected) and the 
> rules-based startup mechanism, what is it that River does not handle?
> 
> Fred
>

Re: [Bigdata-developers] Why zookeeper?

From: Fred O. <fko...@gm...> - 2010-08-03 18:32:26

Bryan,

> I also do not see how this service can be stateless.  It needs to know how many instances of the logical services exist, how many need to be created, and it needs to bind physical instances to logical instances in a manner which protects against network partitions.  That assignment really needs to be either static or governed by zookeeper in order to ensure that we do not assign too many instances to a quorum.

The number of physical instances per logical instance would be a
configuration parameter. The number of logical services to be created
is the number of unbound physical services discovered divided by that
parameter (provided the rack/network awareness, etc. constraints are
met). For a group size of N, collect N compatible unbound physical
services to be a group, make a list of the ids, and send that list to
each group member along with the logical id in a "bind" command. The
newly bound services re-register with the new information. I don't see
where any question of "too many" comes up.

Fred

1 2 > >> (Page 1 of 2)