|
From: Doug C. <cu...@ap...> - 2008-04-02 17:55:48
|
Ning Li wrote: > Assume the replication level is 3. Let's look at adding > a node first. Assume a new node X' is added at the > position X' between X and X+1 and assigned to host A. > The neighboring nodes will be affected and switch from: > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > > The process works as follows: > 1 Master tells host A to start serving node X' > 2 Host A copies [X', X+3) from node X (or other > node(s) which cover(s) [X', X+3)) > 2.1 Host A records the starting entry numbers of > the overlapping nodes of node X > 2.2 Host A retrieves documents in range [X', X+3) > from node X (a new functionality) > 2.3 Host A builds node X' from retrieved documents > 2.4 Host A processes log entries from the > overlapping nodes starting from the entry > numbers recorded in 2.1 > 3 Host A tells master it's now serving node X' > 4 Master changes the map and tells X, X-1, X-2 > their new smaller ranges > 5 Nodes X, X-1, X-2 gradually remove the > documents no longer in their ranges > > Step 2.2 achieves G2 and step 2.1 achieves G3. > Achieving G1 is a bit complicated but possible. > > The process to remove a node is similar and again > the algorithm to achieve G1 is complicated. > > What do you think? This looks great! We can implement node copying with this too, right, just by setting X'=X? And 2.2 could be done as a database checkpoint copy & filter. In the case where X'=X, this would result in the same algorithm as a node copy, since the filter would be a no-op. Or am I missing some fundamental difference? Initially we can avoid implementing checkpoint copying, since I don't think it easily fits into an RPC protocol. Perhaps it will work by specifying a set of URLs to fetch or something? > The algorithm to achieve G1 can be simpler if we > change to a simpler replication scheme - instead > of nodes having overlapping ranges, nodes do not > overlap and there are R replicas for each node. > The down side is that when we add a new node, > we need to create R replicas for the new node. Another downside is that load balancing is trickier. With disjoint ranges, how would we split a hot range? BTW, I don't think we'll actually randomly assign nodes, but rather try to initially add them evenly around the ring, then use load balancing to split hot spots. Also, with disjoint ranges, each search would require searching more indexes. The search performance impact would thus be significant. So the reasons for overlap are: (1) we can't split ranges without permitting it, and (2) it makes search faster. Do you agree? Doug |