Re: [Zookeeper-user] [Q] Why doesn't ZooKeeper provide global consistency?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

We added the sync operation exactly for the OOB communication scenario you 
bring up. If client A receives a communication from client B and then want to 
query something from ZooKeeper that might be causally related to the OOB 
communication, B just has to do an asynchronous sync() operation (that sounds 
funny huh :) before it does any reads. By doing the sync(), B will see any 
updates completed before the OOB communication was received.

ben

On Tuesday 15 January 2008 08:38:26 Jacob Levy wrote:
> Flavio and Ben
>
> Thanks for the extensive answer.
>
> This is very good -- it is nearly perfect. The only problem arises if
> clients communicate OOB about values they read from ZK, then they can be
> out of sync, because one client may have seen Vn while the other can see
> only Vn-1.
>
> But I suspect that is easy to detect with comparing zkid's for the
> reads.
>
> To put this in perspective, this is the stuff that made previous
> attempts to implement this technology (such as Isis and Horus)
> non-scalable. The difference here (in ZK) is that there's a good
> separation of concerns -- you can be a client and not have to
> participate in the global consistency protocol, it's someone else's (the
> ZK servers) job.
>
> I'm curious how well this strong guarantee of ordering writes will work
> in distributed ZK, when clusters of ZK servers are federated. I suspect
> there will be limitations and scaling will not be as good. But that's
> life :)
>
> --Jacob
>
> -----Original Message-----
> From: Benjamin Reed [mailto:br...@ya...]
> Sent: Tuesday, January 15, 2008 7:49 AM
> To: Flavio Junqueira
> Cc: zoo...@li...
> Subject: Re: [Q] Why doesn't ZooKeeper provide global consistency?
>
> [** I'm moving this discussion to the sourceforge mailing list, because
> this
> is really a good piece of general information. **]
>
> Just in case there are some that did not follow this description, here
> are the
> main points:
>
> 1) ZooKeeper provides a total order of all updates: all clients DO see
> the
> same order of changes for all znodes, not just the same order of changes
> on a
> single znode.
> 2) ZooKeeper provides a precedence order for all updates and sync
> operations:
> any update or sync that completed before the client's update or sync
> operation completes will be visible to the client before the client's
> update
> or sync operation completes.
> 3) ZooKeeper does not provide precedence order for read operations: one
> client's update operation may complete on one ZooKeeper server and
> another
> client's read operation may later complete on a different ZooKeeper
> server
> and not see the update. If this kind of ordering is needed for reads,
> issue a
> sync() asynchronously before the read(). The read() will then have
> precedence
> order with respect to the sync().
>
> Bottom line: all update operations will be ordered and will see the
> result of
> earlier operations in "real-time". Reads may see old data. If you want
> to be
> sure that a read sees the very latest data issue a sync() before the
> read().
>
> ben
>
> On Tuesday 15 January 2008 00:33:51 Flavio Junqueira wrote:
> > Jacob, ZooKeeper provides total order of updates, meaning that all
>
> servers
>
> > execute the same set of updates in the same order. One property that
> > ZooKeeper doesn't provide is precedence order because read operations
>
> are
>
> > not ordered along with write operations using the atomic broadcast
> > protocol. A system that satisfies precedence order is one in which the
> > state left by a write that finishes before some read operation starts
> > (before here is relative to some global time reference; think of it as
> > wallclock time)  must be observable by the read operation.
> >
> >
> >
> > Just to give you an example, suppose that we have two clients, C1 and
>
> C2.
>
> > C1 submits two write operations and C2 one read operation that starts
>
> after
>
> > w2 has finished. Suppose that all operations are to the same zk node.
> >
> >
> >
> > C1    |---w1---|   |----w2----|
> >
> > C2                                  |----r1----|
> >
> >
> >
> > With precedence order, r1 has to return the value left by w2. Without
> > precedence order, r1 can return either the value left by w1 or the
>
> value
>
> > left by w2.
> >
> >
> >
> > The reason why it is like this is performance. If there is a large
>
> fraction
>
> > of read operations, then throughput is much higher when ignoring the
> > precedence order for reads. We get better performance by having read
> > operations executing in one server instead of being ordered along with
> > writes. Because there could be pending updates that a server hasn't
>
> seen,
>
> > although they have completed according to the atomic broadcast
>
> protocol,
>
> > reads can return slightly older values as in the example.
> >
> >
> >
> > Now, this is true if you use the regular zk operations. We have
>
> implemented
>
> > an asynchronous primitive called "sync()" that flushes the channel
>
> between
>
> > a follower and the current leader. If you execute sync() followed by
>
> r1, as
>
> > in the example above, then you must see the result of w2 and not the
>
> one of
>
> > w1.
> >
> >
> >
> > Is it clear?
> >
> >
> >
> > -Flavio
> >
> >
> >
> >
> >
> >   _____
> >
> > From: Jacob Levy [mailto:jy...@ya...]
> > Sent: Tuesday, January 15, 2008 5:59 AM
> > Subject: [Q] Why doesn't ZooKeeper provide global consistency?
> >
> >
> >
> > In discussions today with Michael and Bart, I tried to figure out why
> > ZooKeeper (if I understood right) does not provide global consistency.
>
> By
>
> > global consistency, I mean that all clients see the same order of
>
> changes
>
> > for all znodes, not just the same order of changes on a single znode.
> >
> >
> >
> > It seems this could be relatively efficiently provided (perhaps as an
> > optional mode) by ZooKeeper, since there's a single writer anyways, so
>
> in
>
> > fact there *is* a global ordering of all mutator events.
> >
> >
> >
> > Can someone explain why (if I understood correctly, again) ZooKeeper
>
> does
>
> > not provide this stronger consistency? Is it too costly, or do we lose
>
> some
>
> > opportunities for parallelism, or is there some other reason?
> >
> >
> >
> > If global consistency were available, we could write many more
>
> powerful
>
> > algorithms using ZooKeeper, and some things like implementing 2-phase
> > commit become simpler.
> >
> >
> >
> > Thanks!
> >
> >
> >
> > --Jacob