|
From: Yonik S. <yo...@ap...> - 2008-02-26 18:21:49
|
I think one problem we are running into is due to using the same hash
ring for both partitioning the key space and selecting placement of
those keys. Putting many virtual nodes on the hash ring to make the
hash function fair, in conjunction with using that hash ring to do
replication, causes all the indicies to be mixed together.
We may need to separate partitioning of the key space into shards, and
the decision of where to put those shards.
One example using consistent hashing:
When a new system ("newhost") is added to the cluster, one
consistent hash ring could be used to define a new shard (many virtual
nodes to make all shards about the same size... newhost_shard.1
newhost_shard.2 newhost_shard.3...), and a different consistent hash
ring used to select what other shards to replicate in addition to it's
own (add "newhost_shard" to the hash ring, and if N=3 then newhost
will also mirror the previous two shards on the ring.)
I'm no longer sure of the benefit of using consistent hashing though
(at least for both parts).
A simpler way to split up the key space would be to just divide it up
into Q equal sized pieces, and just keep Q sufficiently larger than S
(the number of systems). If one didn't want to pick a maximum Q
beforehand, then Q could always be doubled to keep it sufficiently
large.
A shard could always be uniquely identified by a single integer... the
top 5 bits being how many times the key space was divided by 2, and
the bottom 27 bits being the slice number.
So, to calculate what shard a key belongs in:
sliceNumber = key.hashCode >>> nSplits
shardId == (nSplits<<27) | shardNum
No global state necessary to define what a shard is, doubling the
number of shards is easy, anyone can easily determine if a key is in a
shard or not, etc.
That still leaves the problem of allocating shards to nodes unsolved
for the time being... but I'm just brainstorming out loud at this
point.
-Yonik
|