|
From: Ning L. <nin...@gm...> - 2008-04-22 22:52:16
|
It's been quiet for a while. Any comments/thoughts? Cheers, Ning On Wed, Apr 16, 2008 at 7:12 PM, Ning Li <nin...@gm...> wrote: > > I'm coding the add-node process with the overlapping replication > > right now. It's taking some time. After it's finished, we can review > > the code and have a complete picture of the complexity and the > > performance implications. > > Finally checked in the first version for adding a node, with log > propagation and cleanup... :) > > > > On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote: > > Yes, that's one way to do it. Perhaps it is even the best way. But if > > C reports it is ready to start serving [2 4) before D is, then this > > might be revealed to clients, to more rapidly reduce the load on A and > > B. Might this also permit a less stateful master? My hunch is that the > > more queues of pending splits and such that the master has to maintain, > > the greater the chances of confusion in the face of failure. > > > > Perhaps the master could simply issue step (1), asking C and D to start > > serving [2 4), and remember only that it had issued this request so that > > it doesn't issue it again too soon. Then, when C & D tell the master > > they're ready, the master can notice that [2, 4) is over-replicated, and > > issue requests to A and B to stop serving this. In this model, the > > master would not wait for both C & D to report, nor would it wait for A > > and B to report their narrowed ranges before republishing the ring. > > > > In this approach, if D fails while replicating [2, 4) then the master > > could still trim A to [0, 2), but not trim B yet. Also, if > > previously-offline E comes back online already serving [0, 2) or [2, 3), > > or some other overlapping range from some prior merge/split state, it > > can be more easily re-integrated into the ring, without discarding all > > of its data. > > > > So, even in the case of a mostly-disjoint partitioning, there may be > > benefits to permitting overlap. > > You have a number of good points. > > I was worried about the complexity of a state change "transaction" > (i.e. adding/removing a node and the changes to neighbors' ranges > and the changes to more neighbors' log propagation that come > with it). As you said, "the more queues of pending splits" (or > pending range changes) "and such that the master has to maintain, > > the greater the chances of confusion in the face of failure." > > Your email made me think that we can make the complexity > manageable by breaking a state change "transaction" into > smaller ones. > > Here is what I think of a state change "transaction": > A new node X' is added at the position X' between X and X+1. > The affected nodes switch from: > > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > And the log propagation of even more nodes are affected. > > We break this "transaction" into four smaller ones: > 1 node X' serves range [X', X+3) > Now range [X', X+3) is replicated 4 times - over-replicated. > 2 node X-2 from [X-2, X+1) to [X-2, X') > Now only range [X+1, X+3) is still over-replicated. > 3 node X-1 from [X-1, X+2) to [X-1, X+1) > Now only range [X+2, X+3) is still over-replicated > 4 node X from [X, X+3) to [X, X+2) > Now no range is over-replicated. > So in each of these smaller "transactions", only one node > and some neighbors' log propagation are affected. Failure > is now easier to deal with. > > It's worth pointing out that the latter three "transactions" > are scheduled to balance the level of replication. > > Is this a good idea? > > Ning > |