Re: [bailey-developers] More detailed design on load balancing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>  I'm coding the add-node process with the overlapping replication
>  right now. It's taking some time. After it's finished, we can review
>  the code and have a complete picture of the complexity and the
>  performance implications.

Finally checked in the first version for adding a node, with log
propagation and cleanup... :)

On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote:
>  Yes, that's one way to do it.  Perhaps it is even the best way.  But if
>  C reports it is ready to start serving [2 4) before D is, then this
>  might be revealed to clients, to more rapidly reduce the load on A and
>  B.  Might this also permit a less stateful master?  My hunch is that the
>  more queues of pending splits and such that the master has to maintain,
>  the greater the chances of confusion in the face of failure.
>
>  Perhaps the master could simply issue step (1), asking C and D to start
>  serving [2 4), and remember only that it had issued this request so that
>  it doesn't issue it again too soon.  Then, when C & D tell the master
>  they're ready, the master can notice that [2, 4) is over-replicated, and
>  issue requests to A and B to stop serving this.  In this model, the
>  master would not wait for both C & D to report, nor would it wait for A
>  and B to report their narrowed ranges before republishing the ring.
>
>  In this approach, if D fails while replicating [2, 4) then the master
>  could still trim A to [0, 2), but not trim B yet.  Also, if
>  previously-offline E comes back online already serving [0, 2) or [2, 3),
>  or some other overlapping range from some prior merge/split state, it
>  can be more easily re-integrated into the ring, without discarding all
>  of its data.
>
>  So, even in the case of a mostly-disjoint partitioning, there may be
>  benefits to permitting overlap.

You have a number of good points.

I was worried about the complexity of a state change "transaction"
(i.e. adding/removing a node and the changes to neighbors' ranges
and the changes to more neighbors' log propagation that come
with it). As you said, "the more queues of pending splits" (or
pending range changes) "and such that the master has to maintain,
the greater the chances of confusion in the face of failure."

Your email made me think that we can make the complexity
manageable by breaking a state change "transaction" into
smaller ones.

Here is what I think of a state change "transaction":
A new node X' is added at the position X' between X and X+1.
The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
And the log propagation of even more nodes are affected.

We break this "transaction" into four smaller ones:
1 node X' serves range [X', X+3)
     Now range [X', X+3) is replicated 4 times - over-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X')
     Now only range [X+1, X+3) is still over-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+1)
     Now only range [X+2, X+3) is still over-replicated
4 node X from [X, X+3) to [X, X+2)
     Now no range is over-replicated.
So in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected. Failure
is now easier to deal with.

It's worth pointing out that the latter three "transactions"
are scheduled to balance the level of replication.

Is this a good idea?

Ning

PS: Here is what happens when a node is removed:
Node X+1 is removed. The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
node X+1 serves range [X+1, X+4)
to:
node X-2 serves range [X-2, X+2)
node X-1 serves range [X-1, X+3)
node X serves range [X, X+4)
And the log propagation of even more nodes are affected.

Again, we break this "transaction" into four smaller ones:
1 node X+1 is removed
     Depending on whether node X+1 failed, we may do nothing.
     Now range [X+1, X+4) is replicated 2 times - under-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X+2)
     Now only range [X+2, X+4) is still under-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+3)
     Now only range [X+3, X+4) is still under-replicated
4 node X from [X, X+3) to [X, X+4)
     Now no range is under-replicated.
Again, in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected.