|
From: Doug C. <cu...@ap...> - 2008-04-04 16:22:04
|
Ning Li wrote: > Database as the application interface and RangedDatabase as the > service-provider interface sounds good. The methods in RangedDatabase > will be very similar to those in the current RangedDatabase? Yes. HeapDatabase should extend this now though. > I've been thinking about this. Finally, I think this is a possibility: > 1 The database records and logs a deleted document and its version. > 2 If all the replicas have recorded and logged the deleted document > with the same version number, the document can be removed from > the database. This is because any new versions of the document > come after will have a larger version number. > > Does this sound right? Things are more complicated when we > consider state changes... It sounds right except for the case of a long-offline node coming back online. More on that below... > Do we allow a node to go offline for a long time? I thought we'd consider > the node goes down and pick a replacement for it. It would be nice to be able to eliminate long-offline nodes, but I don't yet see how. At startup we want nodes to announce their content to the master. Not all nodes will start at exactly the same time. (Note also that, if the master fails then nodes will also re-elect a new master and post their state there. Search and indexing should continue uninterrupted through master moves.) So, when a master first starts it needs to avoid modifying the ring for a time until it assumes that all nodes are up. We might even have nodes randomly delay their first report, so that the master isn't overwhelmed. If the network is partitioned then the master would allocate new nodes to underserviced regions. When the network is repaired, we have the choice of ignoring the data on the nodes that were replaced, or synchronizing it with what has transpired in their absence. In the case where all replicas of a region were offline, then we would want to use their data when they come online (like the system restart case), but when only a single replica was offline we might simply ignore its data and let it sync from scratch. However it may not be easy to distinguish these cases. If all replicas go offline, then we add new nodes to the region, we'd need to remember that, at some point in the past, all nodes in that region were offline. If the master was restarted during this time, it will be even harder to keep track of this. I'm still hopeful that we can come up a heuristic for this, but I need to think more about what it should be. Doug |