Re: [bailey-developers] More detailed design on load balancing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ning Li wrote:
> Actually, the mechanism I want copy and ranged copy to use is
> not checkpoint copying. This is because ranged copy may only
> want a small sub-range of the checkpoint (in removing a node).
> So, what if a RangedDatabase supports getDocs(Range range)
> where full documents in the range are retrieved. To be conservative,
> the starting entry numbers are set to the ones at the beginning
> of the retrieval. Of course we need re-parsing the documents.
> But we do that in log propagation as well. So, which approach
> do you think is better?

I agree.  We shouldn't copy checkpoints, at least not initially, but 
rather copy documents & reindex them.

>>  Another downside is that load balancing is trickier.  With disjoint
>>  ranges, how would we split a hot range?  BTW, I don't think we'll
>>  actually randomly assign nodes, but rather try to initially add them
>>  evenly around the ring, then use load balancing to split hot spots.
> 
> Halfen a hot range? More some more sophisticated algorithm.
> We can still have nodes evenly around the ring at the start.
> The partitioning algorithm doesn't change. Only the replication
> algorithm is different.

But we'll need to dynamically balance things, since load may be uneven 
around the ring, especially with user-provided ring positions.  And once 
we permit splitting, don't we have to handle overlapping ranges?

>>  So the reasons for overlap are: (1) we can't split ranges without
>>  permitting it, and (2) it makes search faster.  Do you agree?
> 
> What do you mean by "without permitting it"?

Just that, if we split a node in a disjoint partitioning then we'll have 
overlapping ranges during the split.  Switching all clients to search 
the new half-sized nodes won't be atomic.  So I'm not sure that disjoint 
ranges really simplifies things fundamentally.

> As to 2), I'm not sure overlap makes search faster.
> Assume the scope of the positions is [0, 400) and the replication
> level is 2. Let's say we have 4 nodes with overlap:
>   - node 0 serves [0, 200)
>   - node 1 serves [100, 300)
>   - node 2 serves [200, 400)
>   - node 3 serves [300, 100)
> Without overlap, it's only fair that nodes serve ranges of the same
> size as with overlap, so we only have 2 nodes:
>   - node 0 serves [0, 200)
>   - node 1 serves [200, 400)
> Each node is replicated twice. So it's 4 nodes in total.
> 
> I think this is a fair comparison so the search performance should
> be comparable. What do you think?

Good point.

One other difference is that, with overlapping ranges, when a node 
fails, its load is spread more evenly over the remaining nodes.  In this 
example, the load of any single node would become shared by two other 
nodes rather than one, so each node only needs 50% headroom rather than 
100%.

Doug