From: Benjamin R. <br...@ya...> - 2008-01-15 17:04:51
|
We added the sync operation exactly for the OOB communication scenario you bring up. If client A receives a communication from client B and then want to query something from ZooKeeper that might be causally related to the OOB communication, B just has to do an asynchronous sync() operation (that sounds funny huh :) before it does any reads. By doing the sync(), B will see any updates completed before the OOB communication was received. ben On Tuesday 15 January 2008 08:38:26 Jacob Levy wrote: > Flavio and Ben > > Thanks for the extensive answer. > > This is very good -- it is nearly perfect. The only problem arises if > clients communicate OOB about values they read from ZK, then they can be > out of sync, because one client may have seen Vn while the other can see > only Vn-1. > > But I suspect that is easy to detect with comparing zkid's for the > reads. > > To put this in perspective, this is the stuff that made previous > attempts to implement this technology (such as Isis and Horus) > non-scalable. The difference here (in ZK) is that there's a good > separation of concerns -- you can be a client and not have to > participate in the global consistency protocol, it's someone else's (the > ZK servers) job. > > I'm curious how well this strong guarantee of ordering writes will work > in distributed ZK, when clusters of ZK servers are federated. I suspect > there will be limitations and scaling will not be as good. But that's > life :) > > --Jacob > > -----Original Message----- > From: Benjamin Reed [mailto:br...@ya...] > Sent: Tuesday, January 15, 2008 7:49 AM > To: Flavio Junqueira > Cc: zoo...@li... > Subject: Re: [Q] Why doesn't ZooKeeper provide global consistency? > > [** I'm moving this discussion to the sourceforge mailing list, because > this > is really a good piece of general information. **] > > Just in case there are some that did not follow this description, here > are the > main points: > > 1) ZooKeeper provides a total order of all updates: all clients DO see > the > same order of changes for all znodes, not just the same order of changes > on a > single znode. > 2) ZooKeeper provides a precedence order for all updates and sync > operations: > any update or sync that completed before the client's update or sync > operation completes will be visible to the client before the client's > update > or sync operation completes. > 3) ZooKeeper does not provide precedence order for read operations: one > client's update operation may complete on one ZooKeeper server and > another > client's read operation may later complete on a different ZooKeeper > server > and not see the update. If this kind of ordering is needed for reads, > issue a > sync() asynchronously before the read(). The read() will then have > precedence > order with respect to the sync(). > > Bottom line: all update operations will be ordered and will see the > result of > earlier operations in "real-time". Reads may see old data. If you want > to be > sure that a read sees the very latest data issue a sync() before the > read(). > > ben > > On Tuesday 15 January 2008 00:33:51 Flavio Junqueira wrote: > > Jacob, ZooKeeper provides total order of updates, meaning that all > > servers > > > execute the same set of updates in the same order. One property that > > ZooKeeper doesn't provide is precedence order because read operations > > are > > > not ordered along with write operations using the atomic broadcast > > protocol. A system that satisfies precedence order is one in which the > > state left by a write that finishes before some read operation starts > > (before here is relative to some global time reference; think of it as > > wallclock time) must be observable by the read operation. > > > > > > > > Just to give you an example, suppose that we have two clients, C1 and > > C2. > > > C1 submits two write operations and C2 one read operation that starts > > after > > > w2 has finished. Suppose that all operations are to the same zk node. > > > > > > > > C1 |---w1---| |----w2----| > > > > C2 |----r1----| > > > > > > > > With precedence order, r1 has to return the value left by w2. Without > > precedence order, r1 can return either the value left by w1 or the > > value > > > left by w2. > > > > > > > > The reason why it is like this is performance. If there is a large > > fraction > > > of read operations, then throughput is much higher when ignoring the > > precedence order for reads. We get better performance by having read > > operations executing in one server instead of being ordered along with > > writes. Because there could be pending updates that a server hasn't > > seen, > > > although they have completed according to the atomic broadcast > > protocol, > > > reads can return slightly older values as in the example. > > > > > > > > Now, this is true if you use the regular zk operations. We have > > implemented > > > an asynchronous primitive called "sync()" that flushes the channel > > between > > > a follower and the current leader. If you execute sync() followed by > > r1, as > > > in the example above, then you must see the result of w2 and not the > > one of > > > w1. > > > > > > > > Is it clear? > > > > > > > > -Flavio > > > > > > > > > > > > _____ > > > > From: Jacob Levy [mailto:jy...@ya...] > > Sent: Tuesday, January 15, 2008 5:59 AM > > Subject: [Q] Why doesn't ZooKeeper provide global consistency? > > > > > > > > In discussions today with Michael and Bart, I tried to figure out why > > ZooKeeper (if I understood right) does not provide global consistency. > > By > > > global consistency, I mean that all clients see the same order of > > changes > > > for all znodes, not just the same order of changes on a single znode. > > > > > > > > It seems this could be relatively efficiently provided (perhaps as an > > optional mode) by ZooKeeper, since there's a single writer anyways, so > > in > > > fact there *is* a global ordering of all mutator events. > > > > > > > > Can someone explain why (if I understood correctly, again) ZooKeeper > > does > > > not provide this stronger consistency? Is it too costly, or do we lose > > some > > > opportunities for parallelism, or is there some other reason? > > > > > > > > If global consistency were available, we could write many more > > powerful > > > algorithms using ZooKeeper, and some things like implementing 2-phase > > commit become simpler. > > > > > > > > Thanks! > > > > > > > > --Jacob |