Re: [bailey-developers] More detailed design on load balancing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mon, Mar 31, 2008 at 5:35 PM, Doug Cutting <cu...@ap...> wrote:
>  I'd thought mostly in terms of adding/removing.  If a hot spot is
>  identified on the ring, then it should be split by adding a new node in
>  its range.  The node can be allocated by finding a cool spot and
>  removing a node there.
>
>  But moving a node from a hot host to a cooler host might be faster,
>  since the index could be copied rather than reconstructed from logs.
>  However this might not be enough to handle uneven loading of the ring.
>  For that we really need to be able to change the partitioning by placing
>  new nodes in hot spots.

I thought both types are useful and I started with the simpler one.
But the support for adding/removing nodes are definitely necessary.

>  That sounds right.  In terms of hostToMaster protocol, there are four
>  messages:
>    - Master tells B to start copying N (in response to heartbeat)
>    - B tells Master that it now serves N
>    - Master tells A to remove N (in heartbeat response)
>    - A tells Master it no longer serves N
>
>  The master will have to keep track of what moves are in flight, so that
>  it doesn't schedule too many in one region and risk losing data, right?

Yes. For simplicity and to reduce the risk of losing data, no other
move or add/remove should be started for the nodes overlapping
with node N.

In designing the addition/removal or the move of a node, I think
we must achieve:
G1 There is no data loss if no host goes down in the process,
     even if the replication level is 1.

And we should achieve:
G2 When a new node is added or a node adds a new range,
     the node can build the database on the range from the
     node(s) serving that range. That is, the node does not
     have to construct the database on the range from the
     log(s).
G3 When a (new) node starts to serve a new range, it
     knows where to start processing the log entries from
     the overlapping nodes of the range.

G2 and G3 aim at avoiding processing a log from the
beginning, which becomes less feasible as a log grows.
If we achieve G2 and G3, it means it's possible to remove
old log entries.

In the design on how to move a node:
  - We achieve G2 by B copying a checkpoint of N from A.
  - We achieve G3 by B also copying the starting entry
    numbers of the overlapping nodes as part of the
    checkpoint. That tells B where to start processing the
    log entries from the overlapping nodes.
  - We achieve G1 by, after A stops serving N and stops
    processing logs from the nodes overlapping N,
    N and its overlapping nodes continuing to process
    N's log on A until all entries are processed. Only then
    can A remove N and its log.

>  I think I'd rather see us implement add/remove balancing rather than
>  copying first, since, as mentioned above, I think it is more general.
>  It also means that there's much less chance that the same range/node
>  will exist on two hosts at once.
>
>  The communication with the master would be much the same.  Assume that N
>  is an underloaded node, and that M is an overloaded point on the ring.
>   1. Master tells N's neighbors to expand to cover N
>   2. N's neighbors tell Master that N is now redundant
>   3. Master tells A to stop serving N
>   4. Master tells A to start serving M
>   5. A tells master it's now serving M
>
>  The master must keep track of what nodes it is trying to decommission,
>  and avoid trying to decommission more than one in any overlapping region
>  at a time.  It also has to keep track of new nodes requests that it has
>  made, so that it doesn't issue them to multiple nodes.  Heartbeats
>  should refresh this state, i.e., a host should list its pending ranges
>  as well as its complete ranges.  I think that's sufficient state for the
>  master.  Does that sound right?

Would it be clearer if we describe adding a node and
removing a node separately? I think it's important to
figure out how G1, G2 & G3 above are achieved in the
design.

Assume the replication level is 3. Let's look at adding
a node first. Assume a new node X' is added at the
position X' between X and X+1 and assigned to host A.
The neighboring nodes will be affected and switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)

The process works as follows:
1    Master tells host A to start serving node X'
2    Host A copies [X', X+3) from node X (or other
     node(s) which cover(s) [X', X+3))
2.1 Host A records the starting entry numbers of
     the overlapping nodes of node X
2.2 Host A retrieves documents in range [X', X+3)
     from node X (a new functionality)
2.3 Host A builds node X' from retrieved documents
2.4 Host A processes log entries from the
     overlapping nodes starting from the entry
     numbers recorded in 2.1
3    Host A tells master it's now serving node X'
4    Master changes the map and tells X, X-1, X-2
     their new smaller ranges
5    Nodes X, X-1, X-2 gradually remove the
     documents no longer in their ranges

Step 2.2 achieves G2 and step 2.1 achieves G3.
Achieving G1 is a bit complicated but possible.

The process to remove a node is similar and again
the algorithm to achieve G1 is complicated.

What do you think?

The algorithm to achieve G1 can be simpler if we
change to a simpler replication scheme - instead
of nodes having overlapping ranges, nodes do not
overlap and there are R replicas for each node.
The down side is that when we add a new node,
we need to create R replicas for the new node.

>  The master might try to keep the cluster less than fully occupied so
>  that it can more quickly respond to hotspots and failures.  For example,
>  if each host can serve five nodes, it might normally use only four, so
>  that each always has a spare slot.  Then the above steps would happen as
>  4, 5, 1, 2, 3, first addressing the hot spot, then the cool spot.

Agree.

Cheers,
Ning