|
From: Ning L. <nin...@gm...> - 2008-04-04 22:18:40
|
On Fri, Apr 4, 2008 at 12:22 PM, Doug Cutting <cu...@ap...> wrote: > It would be nice to be able to eliminate long-offline nodes, but I don't > yet see how. > > At startup we want nodes to announce their content to the master. Not > all nodes will start at exactly the same time. (Note also that, if the > master fails then nodes will also re-elect a new master and post their > state there. Search and indexing should continue uninterrupted through > master moves.) So, when a master first starts it needs to avoid > modifying the ring for a time until it assumes that all nodes are up. > We might even have nodes randomly delay their first report, so that the > master isn't overwhelmed. > > If the network is partitioned then the master would allocate new nodes > to underserviced regions. When the network is repaired, we have the > choice of ignoring the data on the nodes that were replaced, or > synchronizing it with what has transpired in their absence. In the case > where all replicas of a region were offline, then we would want to use > their data when they come online (like the system restart case), but > when only a single replica was offline we might simply ignore its data > and let it sync from scratch. However it may not be easy to distinguish > these cases. If all replicas go offline, then we add new nodes to the > region, we'd need to remember that, at some point in the past, all nodes > in that region were offline. If the master was restarted during this > time, it will be even harder to keep track of this. We were thinking storing the "metadata" in Zookeeper, right? All the nodes (ever) created, their ranges, their current log entry number, the starting entry numbers of their overlapping nodes, the current ring... So when a master is restarted, it starts from exactly where it was before it went down by retrieving the "metadata" from Zookeeper. When an offline node comes back up, we verify with Zookeeper that it is a valid node in a valid state. Then we check if we can patch up its data by checking whether the overlapping nodes and their logs still have the entries back to the starting entry numbers of the node. If not, we sync from scratch. What do you think? Ning |