Re: [Zookeeper-user] [Q] Why doesn't ZooKeeper provide global consistency?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Flavio and Ben

Thanks for the extensive answer.

This is very good -- it is nearly perfect. The only problem arises if
clients communicate OOB about values they read from ZK, then they can be
out of sync, because one client may have seen Vn while the other can see
only Vn-1.

But I suspect that is easy to detect with comparing zkid's for the
reads.

To put this in perspective, this is the stuff that made previous
attempts to implement this technology (such as Isis and Horus)
non-scalable. The difference here (in ZK) is that there's a good
separation of concerns -- you can be a client and not have to
participate in the global consistency protocol, it's someone else's (the
ZK servers) job.

I'm curious how well this strong guarantee of ordering writes will work
in distributed ZK, when clusters of ZK servers are federated. I suspect
there will be limitations and scaling will not be as good. But that's
life :)

--Jacob

-----Original Message-----
From: Benjamin Reed [mailto:br...@ya...]=20
Sent: Tuesday, January 15, 2008 7:49 AM
To: Flavio Junqueira
Cc: zoo...@li...
Subject: Re: [Q] Why doesn't ZooKeeper provide global consistency?

[** I'm moving this discussion to the sourceforge mailing list, because
this=20
is really a good piece of general information. **]

Just in case there are some that did not follow this description, here
are the=20
main points:

1) ZooKeeper provides a total order of all updates: all clients DO see
the=20
same order of changes for all znodes, not just the same order of changes
on a=20
single znode.
2) ZooKeeper provides a precedence order for all updates and sync
operations:=20
any update or sync that completed before the client's update or sync=20
operation completes will be visible to the client before the client's
update=20
or sync operation completes.
3) ZooKeeper does not provide precedence order for read operations: one=20
client's update operation may complete on one ZooKeeper server and
another=20
client's read operation may later complete on a different ZooKeeper
server=20
and not see the update. If this kind of ordering is needed for reads,
issue a=20
sync() asynchronously before the read(). The read() will then have
precedence=20
order with respect to the sync().

Bottom line: all update operations will be ordered and will see the
result of=20
earlier operations in "real-time". Reads may see old data. If you want
to be=20
sure that a read sees the very latest data issue a sync() before the
read().

ben

On Tuesday 15 January 2008 00:33:51 Flavio Junqueira wrote:
> Jacob, ZooKeeper provides total order of updates, meaning that all
servers
> execute the same set of updates in the same order. One property that
> ZooKeeper doesn't provide is precedence order because read operations
are
> not ordered along with write operations using the atomic broadcast
> protocol. A system that satisfies precedence order is one in which the
> state left by a write that finishes before some read operation starts=20
> (before here is relative to some global time reference; think of it as
> wallclock time)  must be observable by the read operation.
>
>
>
> Just to give you an example, suppose that we have two clients, C1 and
C2.
> C1 submits two write operations and C2 one read operation that starts
after
> w2 has finished. Suppose that all operations are to the same zk node.
>
>
>
> C1    |---w1---|   |----w2----|
>
> C2                                  |----r1----|
>
>
>
> With precedence order, r1 has to return the value left by w2. Without
> precedence order, r1 can return either the value left by w1 or the
value
> left by w2.
>
>
>
> The reason why it is like this is performance. If there is a large
fraction
> of read operations, then throughput is much higher when ignoring the
> precedence order for reads. We get better performance by having read
> operations executing in one server instead of being ordered along with
> writes. Because there could be pending updates that a server hasn't
seen,
> although they have completed according to the atomic broadcast
protocol,
> reads can return slightly older values as in the example.
>
>
>
> Now, this is true if you use the regular zk operations. We have
implemented
> an asynchronous primitive called "sync()" that flushes the channel
between
> a follower and the current leader. If you execute sync() followed by
r1, as
> in the example above, then you must see the result of w2 and not the
one of
> w1.
>
>
>
> Is it clear?
>
>
>
> -Flavio
>
>
>
>
>
>   _____
>
> From: Jacob Levy [mailto:jy...@ya...]
> Sent: Tuesday, January 15, 2008 5:59 AM
> Subject: [Q] Why doesn't ZooKeeper provide global consistency?
>
>
>
> In discussions today with Michael and Bart, I tried to figure out why
> ZooKeeper (if I understood right) does not provide global consistency.
By
> global consistency, I mean that all clients see the same order of
changes
> for all znodes, not just the same order of changes on a single znode.
>
>
>
> It seems this could be relatively efficiently provided (perhaps as an
> optional mode) by ZooKeeper, since there's a single writer anyways, so
in
> fact there *is* a global ordering of all mutator events.
>
>
>
> Can someone explain why (if I understood correctly, again) ZooKeeper
does
> not provide this stronger consistency? Is it too costly, or do we lose
some
> opportunities for parallelism, or is there some other reason?
>
>
>
> If global consistency were available, we could write many more
powerful
> algorithms using ZooKeeper, and some things like implementing 2-phase
> commit become simpler.
>
>
>
> Thanks!
>
>
>
> --Jacob