#7 Unique DB identifiers for servers and clients

service (6)

There should be a persistent unique identifier for an instance of ZooKeeper. Currently, if you bring a cluster down without stopping clients and reinitialize the servers, the servers will start logging client zxid errors because the clients have seen a later transaction than the server has. In reality the clients should detect that they are now talking to a new instance of the database and close the session.

A similar problem occurs when a server fails in a cluster of three machines, and the other two machines are reinitialized and restarted. If the failed machine starts up again, there is a chance that the old machine may get elected leader (since it will have the highest zxid) and overwrite new data.

A unique random id should probably get generated when a new cluster comes up. (It is easy to detect since the zxid will be zero.) Leader Election and the Leader should validate that the peers have the same database id. Clients should also validate that they are talking to servers with the same database id during a session.


  • Benjamin Reed

    Benjamin Reed - 2008-05-12
    • milestone: --> 3.0.0
    • assigned_to: nobody --> akornev
  • Benjamin Reed

    Benjamin Reed - 2008-05-12
    • assigned_to: akornev --> phunt
  • fpj

    fpj - 2008-05-29

    Logged In: YES
    Originator: NO

    File Added: patch-srv-chk.txt

  • Nobody/Anonymous

    Logged In: NO

    Ben, Pat, and I have discussed this problem a bit, and there seems to be two main options. The first one is to use an external tool that sets the DB id either when we start a new ZK service instance, or when we perform a service migration. A service migration could happen, for example, when we upgrade ZK servers to a new version. The second option is to implement a mechanism that automatically selects a new DB id every time it is necessary.

    The former is the simplest to implement, but it is certainly not transparent. The latter is more convenient to operators, but its implementation is tricky because of the following.

    The problem of having all correct processes agreeing upon a value proposed by one of the processes and despite failures is known as distributed consensus. We currently have two agreement protocols implemented to perform different tasks. The first one is leader election. Leader election is a kind of agreement, but our implementation is not sufficient to implement consensus by itself, although it enables an efficient implementation of consensus. We hence have to embed our leader election protocol to a protocol like Paxos to enable distributed consensus. The second protocol we have is exactly that: an optimized variant of Paxos that uses the leader election protocol to implement an atomic broadcast protocol. The atomic broadcast protocol is what we use to guarantee that all ZK replicas receive the same set of write operations and in the same order.

    Theoretically, we could use our atomic broadcast implementation to have the correct processes agreeing upon a new DB id value whenever necessary. However, our implementation requires each operation to have a zxid (transaction id), and the zxids are relative to a particular database version, so we can't use it as it is.

    Since we'd like to have the 3.0.0 release soon, it seems to be a better idea to go for external configuration solution, and perhaps pursue a transparent solution some other time.


  • Patrick Hunt

    Patrick Hunt - 2008-06-10
    • status: open --> closed

Log in to post a comment.