|
From: Doug C. <cu...@ap...> - 2008-04-10 23:42:15
|
Ning Li wrote: > Yes, this would work. We simply put the code for master into host > and make sure only one host is executing the master code at a time. > > Interesting. :) But let's use a master for now since it's easier to > debug this way? Keeping all the state in Zookeeper may also make it easy to debug, since it can be read from any client. I actually don't think it should be hard to have it both ways. We should write the master as a fail-over loop in any case, so that multiple master-daemon processes can be running on a cluster at once, but only one acting as master. Switching masters frequently should be as simple as adding a counter in the top-level daemon loop to decide when to give up the lock. The master should be a different class from the host, and should have it's own main() method, so that it can be run independently, but it should also be possible to run the master in the same JVM as a host. (We'll want this for testing anyway.) So, with these two simple properties, lightweight master failover and ability to run the master on a host, means that we can choose to either dedicate a machine to the master, or run the master daemon on every host, switching frequently. I'd certainly like to preserve the latter option, so lets keep it in mind as we code. >> How does this work at system startup? How does B know not to go online, >> providing its stale adds, yet A is permitted to provide its data? >> Perhaps, at startup a node can respond to requests for its first log >> entry number, but refuse requests for the logs themselves. Then B can >> see that it must reload, and refuse requests from A to sync until reload >> is complete. Does that work? > > What do you mean by "a node can respond to requests for its first log > entry number, but refuse requests for the logs themselves"? From A's perspective, when B comes online, B's data looks fine, since B's log is complete. But B discovers, when it talks to A, that B's data is obsolete. If A retrieves B's data before B discovers that it is obsolete, then A would get stale adds. So B must block retrieval of its data until it has determined whether it is valid to all of its neighbors. At system startup this means that all nodes must wait for their neighbors to come online so that they can determine whether their own state is valid before permitting any synchronization. Does that make any sense? > Maybe at system startup, master (or a host) analyzes, for each node, > its first log entry number and the log entry numbers it has synced with > its overlapping nodes. Then it is derived from this analysis which nodes > need complete reloads? Is this what you meant? I think each node can determine that on its own at startup. It 1. Posts its range and log start number. 2. Waits a bit, so all other nodes have had a chance to post their data. 3. Decides if its index is valid, by checking all overlapping node's log start numbers & compares them with the last sync'd log number to see if they've compacted their log. 4. Starts syncing with its neighbors. I worry that there might be pathological cases where different nodes were offline and/or compacted at different times causing all replicas of a range to be discarded. Does this make any sense? Doug |