|
From: Doug C. <cu...@ap...> - 2008-04-22 23:56:43
|
Ning Li wrote: > Your email made me think that we can make the complexity > manageable by breaking a state change "transaction" into > smaller ones. > > Here is what I think of a state change "transaction": > A new node X' is added at the position X' between X and X+1. > The affected nodes switch from: > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > And the log propagation of even more nodes are affected. > > We break this "transaction" into four smaller ones: > 1 node X' serves range [X', X+3) > Now range [X', X+3) is replicated 4 times - over-replicated. > 2 node X-2 from [X-2, X+1) to [X-2, X') > Now only range [X+1, X+3) is still over-replicated. > 3 node X-1 from [X-1, X+2) to [X-1, X+1) > Now only range [X+2, X+3) is still over-replicated > 4 node X from [X, X+3) to [X, X+2) > Now no range is over-replicated. > So in each of these smaller "transactions", only one node > and some neighbors' log propagation are affected. Failure > is now easier to deal with. > > It's worth pointing out that the latter three "transactions" > are scheduled to balance the level of replication. > > Is this a good idea? This is nice. Ideally the flow wouldn't require a lot of state. So adding X' is triggered first (by turning on a new node or by splitting a hot spot or somesuch). Once X' is online, then (2), (3) and (4) are triggered by the over-replication, with a lock to permit only one transaction at a time. There's no explicit plan or worklist. Instead, the state can be evaluated at each point, and the highest priority transaction can be executed. Repeat. So the algorithm might look like: condition: action: ----------------------------------------------------- under-replication start a replication over-replication start a reduction new node available add @ hottest spot hot spot free a node These are priority-ordered. Only one action may be in progress at a time in a given region of the ring. Does that work? Doug |