From: Jacob L. <jy...@ya...> - 2008-01-15 16:39:48
|
Flavio and Ben Thanks for the extensive answer. This is very good -- it is nearly perfect. The only problem arises if clients communicate OOB about values they read from ZK, then they can be out of sync, because one client may have seen Vn while the other can see only Vn-1. But I suspect that is easy to detect with comparing zkid's for the reads. To put this in perspective, this is the stuff that made previous attempts to implement this technology (such as Isis and Horus) non-scalable. The difference here (in ZK) is that there's a good separation of concerns -- you can be a client and not have to participate in the global consistency protocol, it's someone else's (the ZK servers) job. I'm curious how well this strong guarantee of ordering writes will work in distributed ZK, when clusters of ZK servers are federated. I suspect there will be limitations and scaling will not be as good. But that's life :) --Jacob -----Original Message----- From: Benjamin Reed [mailto:br...@ya...]=20 Sent: Tuesday, January 15, 2008 7:49 AM To: Flavio Junqueira Cc: zoo...@li... Subject: Re: [Q] Why doesn't ZooKeeper provide global consistency? [** I'm moving this discussion to the sourceforge mailing list, because this=20 is really a good piece of general information. **] Just in case there are some that did not follow this description, here are the=20 main points: 1) ZooKeeper provides a total order of all updates: all clients DO see the=20 same order of changes for all znodes, not just the same order of changes on a=20 single znode. 2) ZooKeeper provides a precedence order for all updates and sync operations:=20 any update or sync that completed before the client's update or sync=20 operation completes will be visible to the client before the client's update=20 or sync operation completes. 3) ZooKeeper does not provide precedence order for read operations: one=20 client's update operation may complete on one ZooKeeper server and another=20 client's read operation may later complete on a different ZooKeeper server=20 and not see the update. If this kind of ordering is needed for reads, issue a=20 sync() asynchronously before the read(). The read() will then have precedence=20 order with respect to the sync(). Bottom line: all update operations will be ordered and will see the result of=20 earlier operations in "real-time". Reads may see old data. If you want to be=20 sure that a read sees the very latest data issue a sync() before the read(). ben On Tuesday 15 January 2008 00:33:51 Flavio Junqueira wrote: > Jacob, ZooKeeper provides total order of updates, meaning that all servers > execute the same set of updates in the same order. One property that > ZooKeeper doesn't provide is precedence order because read operations are > not ordered along with write operations using the atomic broadcast > protocol. A system that satisfies precedence order is one in which the > state left by a write that finishes before some read operation starts=20 > (before here is relative to some global time reference; think of it as > wallclock time) must be observable by the read operation. > > > > Just to give you an example, suppose that we have two clients, C1 and C2. > C1 submits two write operations and C2 one read operation that starts after > w2 has finished. Suppose that all operations are to the same zk node. > > > > C1 |---w1---| |----w2----| > > C2 |----r1----| > > > > With precedence order, r1 has to return the value left by w2. Without > precedence order, r1 can return either the value left by w1 or the value > left by w2. > > > > The reason why it is like this is performance. If there is a large fraction > of read operations, then throughput is much higher when ignoring the > precedence order for reads. We get better performance by having read > operations executing in one server instead of being ordered along with > writes. Because there could be pending updates that a server hasn't seen, > although they have completed according to the atomic broadcast protocol, > reads can return slightly older values as in the example. > > > > Now, this is true if you use the regular zk operations. We have implemented > an asynchronous primitive called "sync()" that flushes the channel between > a follower and the current leader. If you execute sync() followed by r1, as > in the example above, then you must see the result of w2 and not the one of > w1. > > > > Is it clear? > > > > -Flavio > > > > > > _____ > > From: Jacob Levy [mailto:jy...@ya...] > Sent: Tuesday, January 15, 2008 5:59 AM > Subject: [Q] Why doesn't ZooKeeper provide global consistency? > > > > In discussions today with Michael and Bart, I tried to figure out why > ZooKeeper (if I understood right) does not provide global consistency. By > global consistency, I mean that all clients see the same order of changes > for all znodes, not just the same order of changes on a single znode. > > > > It seems this could be relatively efficiently provided (perhaps as an > optional mode) by ZooKeeper, since there's a single writer anyways, so in > fact there *is* a global ordering of all mutator events. > > > > Can someone explain why (if I understood correctly, again) ZooKeeper does > not provide this stronger consistency? Is it too costly, or do we lose some > opportunities for parallelism, or is there some other reason? > > > > If global consistency were available, we could write many more powerful > algorithms using ZooKeeper, and some things like implementing 2-phase > commit become simpler. > > > > Thanks! > > > > --Jacob |