|
From: Doug C. <cu...@ap...> - 2008-04-08 20:29:32
|
Ning Li wrote: > I thought only the master writes to Zookeeper. Hosts read from Zookeeper. > We should still have a light-weight master, no? Which keeps track of > the heartbeats from the hosts, detects host failures and responds > accordingly, and decides state changes (add/remove nodes) when > appropriate... As a thought experiment, I'm trying to see whether, with Zookeeper, we actually need such a master. We could replace heartbeats with an ephemeral file per node containing its status. (Ephemeral files disappear if their owner goes offline.) Any host could (a) grab a lock; (b) analyze the ring for potential add/removes; (c) post these requests to a directory. In effect, getting the lock is master election, and while a node is doing this analysis, it is the master. But the master moves around: each host has a "master" thread, and remains "master" only during one analysis/action cycle. This analysis cycle should not be very compute or i/o intensive. The advantage of this is that we wouldn't have to explicitly test or otherwise engineer master failover, since it would be happening all the time. All global state would be redundantly stored in Zookeeper. Could this work? > In this example, say B synced with A to A's entry 11 and > with C to C's entry 12 (which includes adding doc X) before > it went offline. Because A expunged its log, now A's log > entry number starts from 21. B comes back online and > finds A's 21 is newer than last time B synced with it (11). > So B cannot recover but has to sync from scratch. My concern was that A might try to sync from B and get stale data. I think you're arguing that B should not go online, accepting sync requests, until it has itself sync'd with its neighbors. In this case, incremental sync would fail and B would sync from scratch, removing any stale adds. How does this work at system startup? How does B know not to go online, providing its stale adds, yet A is permitted to provide its data? Perhaps, at startup a node can respond to requests for its first log entry number, but refuse requests for the logs themselves. Then B can see that it must reload, and refuse requests from A to sync until reload is complete. Does that work? Doug |