|
From: Doug C. <cu...@ap...> - 2008-04-10 22:38:13
|
Ning Li wrote: > Let's say the replication is 2 and node A and node B are the two replicas > of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4). > After the split, node A and node B will be the replicas of [0, 2) and > node C and node D will be created to be the replicas of [2, 4). > > The split process works as follows: > 1 Master tells some hosts to create node C and node D of [2, 4) > from node A or node B > 2 C and D tell the master they are ready > 3 Master changes the map to A and B serving [0, 2) and C and D > serving [2, 4) > 4 Master tells A and B to stop serving [2, 4) and only serve [0, 2) > 5 A and B tell the master they are done > > Care should be taken in synchronizing logs in the process. > > Here, it doesn't matter if at one point in time, node X serves R1, > node Y serves R2, and R1 and R2 overlap. What does matter is > that there is no overlap in the map which the master maintains. > And because clients get the map from the master, clients do > not see any overlap in the map, either. Yes, that's one way to do it. Perhaps it is even the best way. But if C reports it is ready to start serving [2 4) before D is, then this might be revealed to clients, to more rapidly reduce the load on A and B. Might this also permit a less stateful master? My hunch is that the more queues of pending splits and such that the master has to maintain, the greater the chances of confusion in the face of failure. Perhaps the master could simply issue step (1), asking C and D to start serving [2 4), and remember only that it had issued this request so that it doesn't issue it again too soon. Then, when C & D tell the master they're ready, the master can notice that [2, 4) is over-replicated, and issue requests to A and B to stop serving this. In this model, the master would not wait for both C & D to report, nor would it wait for A and B to report their narrowed ranges before republishing the ring. In this approach, if D fails while replicating [2, 4) then the master could still trim A to [0, 2), but not trim B yet. Also, if previously-offline E comes back online already serving [0, 2) or [2, 3), or some other overlapping range from some prior merge/split state, it can be more easily re-integrated into the ring, without discarding all of its data. So, even in the case of a mostly-disjoint partitioning, there may be benefits to permitting overlap. > I'm coding the add-node process with the overlapping replication > right now. It's taking some time. After it's finished, we can review > the code and have a complete picture of the complexity and the > performance implications. Great! Doug |