|
From: Ning L. <nin...@gm...> - 2008-03-31 20:39:50
|
First, summarize the terms used: 1 A node consists of a key/hash range, a database, and a log. 2 A node periodically processes logs of the nodes with overlapping ranges. A node remembers the entry number Ei it processes of an overlapping node Ni and would start from (Ei)+1 next time. 3 A node checkpoint consists of the node's range, a database checkpoint, a log entry number E and log entry numbers Ei's for the overlapping nodes. A checkpoint provides a consistent view of the node. There are two types of load balancing: moving a node from one host to another, and adding/removing a node. Let's see if we can agree on the simpler one first. Protocol for moving a node (N) from one host (A) to another (B): Note: No other load balancing can be started for the nodes overlapping with node N. 1 Copy the latest checkpoint C of node N to host B. 2 Node N on host B has an empty log. It starts to process logs of the nodes with overlapping ranges, including node N on host A. 3 The master commits the change to the ring - officially node N is on host B, not on host A. 4 Node N on host A stops processing logs of the nodes with overlapping ranges. 5 After all the overlapping nodes finish processing node N's log on host A, node N is removed from host A. Step 3 is the commit point. If host A or host B goes down before that, the move fails. Otherwise, the move succeeds. The move should also be considered fail if any overlapping nodes goes down before the commit point to reduce the complexity? Sounds right? Ning |
|
From: Doug C. <cu...@ap...> - 2008-03-31 21:41:34
|
Ning Li wrote: > There are two types of load balancing: moving a node from one > host to another, and adding/removing a node. Let's see if we > can agree on the simpler one first. I'd thought mostly in terms of adding/removing. If a hot spot is identified on the ring, then it should be split by adding a new node in its range. The node can be allocated by finding a cool spot and removing a node there. But moving a node from a hot host to a cooler host might be faster, since the index could be copied rather than reconstructed from logs. However this might not be enough to handle uneven loading of the ring. For that we really need to be able to change the partitioning by placing new nodes in hot spots. > Protocol for moving a node (N) from one host (A) to another (B): > Note: No other load balancing can be started for the nodes > overlapping with node N. > 1 Copy the latest checkpoint C of node N to host B. > 2 Node N on host B has an empty log. It starts to process logs > of the nodes with overlapping ranges, including node N on > host A. > 3 The master commits the change to the ring - officially node N > is on host B, not on host A. > 4 Node N on host A stops processing logs of the nodes with > overlapping ranges. > 5 After all the overlapping nodes finish processing node N's > log on host A, node N is removed from host A. > > Step 3 is the commit point. If host A or host B goes down > before that, the move fails. Otherwise, the move succeeds. > The move should also be considered fail if any overlapping > nodes goes down before the commit point to reduce the > complexity? That sounds right. In terms of hostToMaster protocol, there are four messages: - Master tells B to start copying N (in response to heartbeat) - B tells Master that it now serves N - Master tells A to remove N (in heartbeat response) - A tells Master it no longer serves N The master will have to keep track of what moves are in flight, so that it doesn't schedule too many in one region and risk losing data, right? I think I'd rather see us implement add/remove balancing rather than copying first, since, as mentioned above, I think it is more general. It also means that there's much less chance that the same range/node will exist on two hosts at once. The communication with the master would be much the same. Assume that N is an underloaded node, and that M is an overloaded point on the ring. 1. Master tells N's neighbors to expand to cover N 2. N's neighbors tell Master that N is now redundant 3. Master tells A to stop serving N 4. Master tells A to start serving M 5. A tells master it's now serving M The master must keep track of what nodes it is trying to decommission, and avoid trying to decommission more than one in any overlapping region at a time. It also has to keep track of new nodes requests that it has made, so that it doesn't issue them to multiple nodes. Heartbeats should refresh this state, i.e., a host should list its pending ranges as well as its complete ranges. I think that's sufficient state for the master. Does that sound right? The master might try to keep the cluster less than fully occupied so that it can more quickly respond to hotspots and failures. For example, if each host can serve five nodes, it might normally use only four, so that each always has a spare slot. Then the above steps would happen as 4, 5, 1, 2, 3, first addressing the hot spot, then the cool spot. Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-02 00:10:01
|
On Mon, Mar 31, 2008 at 5:35 PM, Doug Cutting <cu...@ap...> wrote:
> I'd thought mostly in terms of adding/removing. If a hot spot is
> identified on the ring, then it should be split by adding a new node in
> its range. The node can be allocated by finding a cool spot and
> removing a node there.
>
> But moving a node from a hot host to a cooler host might be faster,
> since the index could be copied rather than reconstructed from logs.
> However this might not be enough to handle uneven loading of the ring.
> For that we really need to be able to change the partitioning by placing
> new nodes in hot spots.
I thought both types are useful and I started with the simpler one.
But the support for adding/removing nodes are definitely necessary.
> That sounds right. In terms of hostToMaster protocol, there are four
> messages:
> - Master tells B to start copying N (in response to heartbeat)
> - B tells Master that it now serves N
> - Master tells A to remove N (in heartbeat response)
> - A tells Master it no longer serves N
>
> The master will have to keep track of what moves are in flight, so that
> it doesn't schedule too many in one region and risk losing data, right?
Yes. For simplicity and to reduce the risk of losing data, no other
move or add/remove should be started for the nodes overlapping
with node N.
In designing the addition/removal or the move of a node, I think
we must achieve:
G1 There is no data loss if no host goes down in the process,
even if the replication level is 1.
And we should achieve:
G2 When a new node is added or a node adds a new range,
the node can build the database on the range from the
node(s) serving that range. That is, the node does not
have to construct the database on the range from the
log(s).
G3 When a (new) node starts to serve a new range, it
knows where to start processing the log entries from
the overlapping nodes of the range.
G2 and G3 aim at avoiding processing a log from the
beginning, which becomes less feasible as a log grows.
If we achieve G2 and G3, it means it's possible to remove
old log entries.
In the design on how to move a node:
- We achieve G2 by B copying a checkpoint of N from A.
- We achieve G3 by B also copying the starting entry
numbers of the overlapping nodes as part of the
checkpoint. That tells B where to start processing the
log entries from the overlapping nodes.
- We achieve G1 by, after A stops serving N and stops
processing logs from the nodes overlapping N,
N and its overlapping nodes continuing to process
N's log on A until all entries are processed. Only then
can A remove N and its log.
> I think I'd rather see us implement add/remove balancing rather than
> copying first, since, as mentioned above, I think it is more general.
> It also means that there's much less chance that the same range/node
> will exist on two hosts at once.
>
> The communication with the master would be much the same. Assume that N
> is an underloaded node, and that M is an overloaded point on the ring.
> 1. Master tells N's neighbors to expand to cover N
> 2. N's neighbors tell Master that N is now redundant
> 3. Master tells A to stop serving N
> 4. Master tells A to start serving M
> 5. A tells master it's now serving M
>
> The master must keep track of what nodes it is trying to decommission,
> and avoid trying to decommission more than one in any overlapping region
> at a time. It also has to keep track of new nodes requests that it has
> made, so that it doesn't issue them to multiple nodes. Heartbeats
> should refresh this state, i.e., a host should list its pending ranges
> as well as its complete ranges. I think that's sufficient state for the
> master. Does that sound right?
Would it be clearer if we describe adding a node and
removing a node separately? I think it's important to
figure out how G1, G2 & G3 above are achieved in the
design.
Assume the replication level is 3. Let's look at adding
a node first. Assume a new node X' is added at the
position X' between X and X+1 and assigned to host A.
The neighboring nodes will be affected and switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
The process works as follows:
1 Master tells host A to start serving node X'
2 Host A copies [X', X+3) from node X (or other
node(s) which cover(s) [X', X+3))
2.1 Host A records the starting entry numbers of
the overlapping nodes of node X
2.2 Host A retrieves documents in range [X', X+3)
from node X (a new functionality)
2.3 Host A builds node X' from retrieved documents
2.4 Host A processes log entries from the
overlapping nodes starting from the entry
numbers recorded in 2.1
3 Host A tells master it's now serving node X'
4 Master changes the map and tells X, X-1, X-2
their new smaller ranges
5 Nodes X, X-1, X-2 gradually remove the
documents no longer in their ranges
Step 2.2 achieves G2 and step 2.1 achieves G3.
Achieving G1 is a bit complicated but possible.
The process to remove a node is similar and again
the algorithm to achieve G1 is complicated.
What do you think?
The algorithm to achieve G1 can be simpler if we
change to a simpler replication scheme - instead
of nodes having overlapping ranges, nodes do not
overlap and there are R replicas for each node.
The down side is that when we add a new node,
we need to create R replicas for the new node.
> The master might try to keep the cluster less than fully occupied so
> that it can more quickly respond to hotspots and failures. For example,
> if each host can serve five nodes, it might normally use only four, so
> that each always has a spare slot. Then the above steps would happen as
> 4, 5, 1, 2, 3, first addressing the hot spot, then the cool spot.
Agree.
Cheers,
Ning
|
|
From: Doug C. <cu...@ap...> - 2008-04-02 17:55:48
|
Ning Li wrote: > Assume the replication level is 3. Let's look at adding > a node first. Assume a new node X' is added at the > position X' between X and X+1 and assigned to host A. > The neighboring nodes will be affected and switch from: > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > > The process works as follows: > 1 Master tells host A to start serving node X' > 2 Host A copies [X', X+3) from node X (or other > node(s) which cover(s) [X', X+3)) > 2.1 Host A records the starting entry numbers of > the overlapping nodes of node X > 2.2 Host A retrieves documents in range [X', X+3) > from node X (a new functionality) > 2.3 Host A builds node X' from retrieved documents > 2.4 Host A processes log entries from the > overlapping nodes starting from the entry > numbers recorded in 2.1 > 3 Host A tells master it's now serving node X' > 4 Master changes the map and tells X, X-1, X-2 > their new smaller ranges > 5 Nodes X, X-1, X-2 gradually remove the > documents no longer in their ranges > > Step 2.2 achieves G2 and step 2.1 achieves G3. > Achieving G1 is a bit complicated but possible. > > The process to remove a node is similar and again > the algorithm to achieve G1 is complicated. > > What do you think? This looks great! We can implement node copying with this too, right, just by setting X'=X? And 2.2 could be done as a database checkpoint copy & filter. In the case where X'=X, this would result in the same algorithm as a node copy, since the filter would be a no-op. Or am I missing some fundamental difference? Initially we can avoid implementing checkpoint copying, since I don't think it easily fits into an RPC protocol. Perhaps it will work by specifying a set of URLs to fetch or something? > The algorithm to achieve G1 can be simpler if we > change to a simpler replication scheme - instead > of nodes having overlapping ranges, nodes do not > overlap and there are R replicas for each node. > The down side is that when we add a new node, > we need to create R replicas for the new node. Another downside is that load balancing is trickier. With disjoint ranges, how would we split a hot range? BTW, I don't think we'll actually randomly assign nodes, but rather try to initially add them evenly around the ring, then use load balancing to split hot spots. Also, with disjoint ranges, each search would require searching more indexes. The search performance impact would thus be significant. So the reasons for overlap are: (1) we can't split ranges without permitting it, and (2) it makes search faster. Do you agree? Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-02 23:35:53
|
On Wed, Apr 2, 2008 at 1:55 PM, Doug Cutting <cu...@ap...> wrote: > This looks great! We can implement node copying with this too, right, > just by setting X'=X? And 2.2 could be done as a database checkpoint > copy & filter. In the case where X'=X, this would result in the same > algorithm as a node copy, since the filter would be a no-op. Or am I > missing some fundamental difference? Yes, copy and ranged copy could/should use the same mechanism. > Initially we can avoid implementing checkpoint copying, since I don't > think it easily fits into an RPC protocol. Perhaps it will work by > specifying a set of URLs to fetch or something? Actually, the mechanism I want copy and ranged copy to use is not checkpoint copying. This is because ranged copy may only want a small sub-range of the checkpoint (in removing a node). So, what if a RangedDatabase supports getDocs(Range range) where full documents in the range are retrieved. To be conservative, the starting entry numbers are set to the ones at the beginning of the retrieval. Of course we need re-parsing the documents. But we do that in log propagation as well. So, which approach do you think is better? > Another downside is that load balancing is trickier. With disjoint > ranges, how would we split a hot range? BTW, I don't think we'll > actually randomly assign nodes, but rather try to initially add them > evenly around the ring, then use load balancing to split hot spots. Halfen a hot range? More some more sophisticated algorithm. We can still have nodes evenly around the ring at the start. The partitioning algorithm doesn't change. Only the replication algorithm is different. > Also, with disjoint ranges, each search would require searching more > indexes. The search performance impact would thus be significant. > > So the reasons for overlap are: (1) we can't split ranges without > permitting it, and (2) it makes search faster. Do you agree? What do you mean by "without permitting it"? As to 2), I'm not sure overlap makes search faster. Assume the scope of the positions is [0, 400) and the replication level is 2. Let's say we have 4 nodes with overlap: - node 0 serves [0, 200) - node 1 serves [100, 300) - node 2 serves [200, 400) - node 3 serves [300, 100) Without overlap, it's only fair that nodes serve ranges of the same size as with overlap, so we only have 2 nodes: - node 0 serves [0, 200) - node 1 serves [200, 400) Each node is replicated twice. So it's 4 nodes in total. I think this is a fair comparison so the search performance should be comparable. What do you think? Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-08 21:39:08
|
Ning Li wrote: > Actually, the mechanism I want copy and ranged copy to use is > not checkpoint copying. This is because ranged copy may only > want a small sub-range of the checkpoint (in removing a node). > So, what if a RangedDatabase supports getDocs(Range range) > where full documents in the range are retrieved. To be conservative, > the starting entry numbers are set to the ones at the beginning > of the retrieval. Of course we need re-parsing the documents. > But we do that in log propagation as well. So, which approach > do you think is better? I agree. We shouldn't copy checkpoints, at least not initially, but rather copy documents & reindex them. >> Another downside is that load balancing is trickier. With disjoint >> ranges, how would we split a hot range? BTW, I don't think we'll >> actually randomly assign nodes, but rather try to initially add them >> evenly around the ring, then use load balancing to split hot spots. > > Halfen a hot range? More some more sophisticated algorithm. > We can still have nodes evenly around the ring at the start. > The partitioning algorithm doesn't change. Only the replication > algorithm is different. But we'll need to dynamically balance things, since load may be uneven around the ring, especially with user-provided ring positions. And once we permit splitting, don't we have to handle overlapping ranges? >> So the reasons for overlap are: (1) we can't split ranges without >> permitting it, and (2) it makes search faster. Do you agree? > > What do you mean by "without permitting it"? Just that, if we split a node in a disjoint partitioning then we'll have overlapping ranges during the split. Switching all clients to search the new half-sized nodes won't be atomic. So I'm not sure that disjoint ranges really simplifies things fundamentally. > As to 2), I'm not sure overlap makes search faster. > Assume the scope of the positions is [0, 400) and the replication > level is 2. Let's say we have 4 nodes with overlap: > - node 0 serves [0, 200) > - node 1 serves [100, 300) > - node 2 serves [200, 400) > - node 3 serves [300, 100) > Without overlap, it's only fair that nodes serve ranges of the same > size as with overlap, so we only have 2 nodes: > - node 0 serves [0, 200) > - node 1 serves [200, 400) > Each node is replicated twice. So it's 4 nodes in total. > > I think this is a fair comparison so the search performance should > be comparable. What do you think? Good point. One other difference is that, with overlapping ranges, when a node fails, its load is spread more evenly over the remaining nodes. In this example, the load of any single node would become shared by two other nodes rather than one, so each node only needs 50% headroom rather than 100%. Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-10 17:11:17
|
On Tue, Apr 8, 2008 at 5:39 PM, Doug Cutting <cu...@ap...> wrote: > > Halfen a hot range? More some more sophisticated algorithm. > > We can still have nodes evenly around the ring at the start. > > The partitioning algorithm doesn't change. Only the replication > > algorithm is different. > > But we'll need to dynamically balance things, since load may be uneven > around the ring, especially with user-provided ring positions. And once > we permit splitting, don't we have to handle overlapping ranges? With non-overlapping replication, if we decide to split a range, we split all the replicas of the range. Same as with overlapping replication, we use the master to coordinate the process. > > What do you mean by "without permitting it"? > > Just that, if we split a node in a disjoint partitioning then we'll have > overlapping ranges during the split. Switching all clients to search > the new half-sized nodes won't be atomic. So I'm not sure that disjoint > ranges really simplifies things fundamentally. Let's say the replication is 2 and node A and node B are the two replicas of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4). After the split, node A and node B will be the replicas of [0, 2) and node C and node D will be created to be the replicas of [2, 4). The split process works as follows: 1 Master tells some hosts to create node C and node D of [2, 4) from node A or node B 2 C and D tell the master they are ready 3 Master changes the map to A and B serving [0, 2) and C and D serving [2, 4) 4 Master tells A and B to stop serving [2, 4) and only serve [0, 2) 5 A and B tell the master they are done Care should be taken in synchronizing logs in the process. Here, it doesn't matter if at one point in time, node X serves R1, node Y serves R2, and R1 and R2 overlap. What does matter is that there is no overlap in the map which the master maintains. And because clients get the map from the master, clients do not see any overlap in the map, either. What do you think? > > I think this is a fair comparison so the search performance should > > be comparable. What do you think? > > Good point. > > One other difference is that, with overlapping ranges, when a node > fails, its load is spread more evenly over the remaining nodes. In this > example, the load of any single node would become shared by two other > nodes rather than one, so each node only needs 50% headroom rather than > 100%. Yes, that's the advantage of the overlapping replication scheme. I'm coding the add-node process with the overlapping replication right now. It's taking some time. After it's finished, we can review the code and have a complete picture of the complexity and the performance implications. Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-10 22:38:13
|
Ning Li wrote: > Let's say the replication is 2 and node A and node B are the two replicas > of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4). > After the split, node A and node B will be the replicas of [0, 2) and > node C and node D will be created to be the replicas of [2, 4). > > The split process works as follows: > 1 Master tells some hosts to create node C and node D of [2, 4) > from node A or node B > 2 C and D tell the master they are ready > 3 Master changes the map to A and B serving [0, 2) and C and D > serving [2, 4) > 4 Master tells A and B to stop serving [2, 4) and only serve [0, 2) > 5 A and B tell the master they are done > > Care should be taken in synchronizing logs in the process. > > Here, it doesn't matter if at one point in time, node X serves R1, > node Y serves R2, and R1 and R2 overlap. What does matter is > that there is no overlap in the map which the master maintains. > And because clients get the map from the master, clients do > not see any overlap in the map, either. Yes, that's one way to do it. Perhaps it is even the best way. But if C reports it is ready to start serving [2 4) before D is, then this might be revealed to clients, to more rapidly reduce the load on A and B. Might this also permit a less stateful master? My hunch is that the more queues of pending splits and such that the master has to maintain, the greater the chances of confusion in the face of failure. Perhaps the master could simply issue step (1), asking C and D to start serving [2 4), and remember only that it had issued this request so that it doesn't issue it again too soon. Then, when C & D tell the master they're ready, the master can notice that [2, 4) is over-replicated, and issue requests to A and B to stop serving this. In this model, the master would not wait for both C & D to report, nor would it wait for A and B to report their narrowed ranges before republishing the ring. In this approach, if D fails while replicating [2, 4) then the master could still trim A to [0, 2), but not trim B yet. Also, if previously-offline E comes back online already serving [0, 2) or [2, 3), or some other overlapping range from some prior merge/split state, it can be more easily re-integrated into the ring, without discarding all of its data. So, even in the case of a mostly-disjoint partitioning, there may be benefits to permitting overlap. > I'm coding the add-node process with the overlapping replication > right now. It's taking some time. After it's finished, we can review > the code and have a complete picture of the complexity and the > performance implications. Great! Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-16 23:12:15
|
> I'm coding the add-node process with the overlapping replication
> right now. It's taking some time. After it's finished, we can review
> the code and have a complete picture of the complexity and the
> performance implications.
Finally checked in the first version for adding a node, with log
propagation and cleanup... :)
On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote:
> Yes, that's one way to do it. Perhaps it is even the best way. But if
> C reports it is ready to start serving [2 4) before D is, then this
> might be revealed to clients, to more rapidly reduce the load on A and
> B. Might this also permit a less stateful master? My hunch is that the
> more queues of pending splits and such that the master has to maintain,
> the greater the chances of confusion in the face of failure.
>
> Perhaps the master could simply issue step (1), asking C and D to start
> serving [2 4), and remember only that it had issued this request so that
> it doesn't issue it again too soon. Then, when C & D tell the master
> they're ready, the master can notice that [2, 4) is over-replicated, and
> issue requests to A and B to stop serving this. In this model, the
> master would not wait for both C & D to report, nor would it wait for A
> and B to report their narrowed ranges before republishing the ring.
>
> In this approach, if D fails while replicating [2, 4) then the master
> could still trim A to [0, 2), but not trim B yet. Also, if
> previously-offline E comes back online already serving [0, 2) or [2, 3),
> or some other overlapping range from some prior merge/split state, it
> can be more easily re-integrated into the ring, without discarding all
> of its data.
>
> So, even in the case of a mostly-disjoint partitioning, there may be
> benefits to permitting overlap.
You have a number of good points.
I was worried about the complexity of a state change "transaction"
(i.e. adding/removing a node and the changes to neighbors' ranges
and the changes to more neighbors' log propagation that come
with it). As you said, "the more queues of pending splits" (or
pending range changes) "and such that the master has to maintain,
the greater the chances of confusion in the face of failure."
Your email made me think that we can make the complexity
manageable by breaking a state change "transaction" into
smaller ones.
Here is what I think of a state change "transaction":
A new node X' is added at the position X' between X and X+1.
The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
And the log propagation of even more nodes are affected.
We break this "transaction" into four smaller ones:
1 node X' serves range [X', X+3)
Now range [X', X+3) is replicated 4 times - over-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X')
Now only range [X+1, X+3) is still over-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+1)
Now only range [X+2, X+3) is still over-replicated
4 node X from [X, X+3) to [X, X+2)
Now no range is over-replicated.
So in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected. Failure
is now easier to deal with.
It's worth pointing out that the latter three "transactions"
are scheduled to balance the level of replication.
Is this a good idea?
Ning
PS: Here is what happens when a node is removed:
Node X+1 is removed. The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
node X+1 serves range [X+1, X+4)
to:
node X-2 serves range [X-2, X+2)
node X-1 serves range [X-1, X+3)
node X serves range [X, X+4)
And the log propagation of even more nodes are affected.
Again, we break this "transaction" into four smaller ones:
1 node X+1 is removed
Depending on whether node X+1 failed, we may do nothing.
Now range [X+1, X+4) is replicated 2 times - under-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X+2)
Now only range [X+2, X+4) is still under-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+3)
Now only range [X+3, X+4) is still under-replicated
4 node X from [X, X+3) to [X, X+4)
Now no range is under-replicated.
Again, in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected.
|
|
From: Ning L. <nin...@gm...> - 2008-04-22 22:52:16
|
It's been quiet for a while. Any comments/thoughts? Cheers, Ning On Wed, Apr 16, 2008 at 7:12 PM, Ning Li <nin...@gm...> wrote: > > I'm coding the add-node process with the overlapping replication > > right now. It's taking some time. After it's finished, we can review > > the code and have a complete picture of the complexity and the > > performance implications. > > Finally checked in the first version for adding a node, with log > propagation and cleanup... :) > > > > On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote: > > Yes, that's one way to do it. Perhaps it is even the best way. But if > > C reports it is ready to start serving [2 4) before D is, then this > > might be revealed to clients, to more rapidly reduce the load on A and > > B. Might this also permit a less stateful master? My hunch is that the > > more queues of pending splits and such that the master has to maintain, > > the greater the chances of confusion in the face of failure. > > > > Perhaps the master could simply issue step (1), asking C and D to start > > serving [2 4), and remember only that it had issued this request so that > > it doesn't issue it again too soon. Then, when C & D tell the master > > they're ready, the master can notice that [2, 4) is over-replicated, and > > issue requests to A and B to stop serving this. In this model, the > > master would not wait for both C & D to report, nor would it wait for A > > and B to report their narrowed ranges before republishing the ring. > > > > In this approach, if D fails while replicating [2, 4) then the master > > could still trim A to [0, 2), but not trim B yet. Also, if > > previously-offline E comes back online already serving [0, 2) or [2, 3), > > or some other overlapping range from some prior merge/split state, it > > can be more easily re-integrated into the ring, without discarding all > > of its data. > > > > So, even in the case of a mostly-disjoint partitioning, there may be > > benefits to permitting overlap. > > You have a number of good points. > > I was worried about the complexity of a state change "transaction" > (i.e. adding/removing a node and the changes to neighbors' ranges > and the changes to more neighbors' log propagation that come > with it). As you said, "the more queues of pending splits" (or > pending range changes) "and such that the master has to maintain, > > the greater the chances of confusion in the face of failure." > > Your email made me think that we can make the complexity > manageable by breaking a state change "transaction" into > smaller ones. > > Here is what I think of a state change "transaction": > A new node X' is added at the position X' between X and X+1. > The affected nodes switch from: > > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > And the log propagation of even more nodes are affected. > > We break this "transaction" into four smaller ones: > 1 node X' serves range [X', X+3) > Now range [X', X+3) is replicated 4 times - over-replicated. > 2 node X-2 from [X-2, X+1) to [X-2, X') > Now only range [X+1, X+3) is still over-replicated. > 3 node X-1 from [X-1, X+2) to [X-1, X+1) > Now only range [X+2, X+3) is still over-replicated > 4 node X from [X, X+3) to [X, X+2) > Now no range is over-replicated. > So in each of these smaller "transactions", only one node > and some neighbors' log propagation are affected. Failure > is now easier to deal with. > > It's worth pointing out that the latter three "transactions" > are scheduled to balance the level of replication. > > Is this a good idea? > > Ning > |
|
From: Doug C. <cu...@ap...> - 2008-04-22 23:56:43
|
Ning Li wrote: > Your email made me think that we can make the complexity > manageable by breaking a state change "transaction" into > smaller ones. > > Here is what I think of a state change "transaction": > A new node X' is added at the position X' between X and X+1. > The affected nodes switch from: > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > And the log propagation of even more nodes are affected. > > We break this "transaction" into four smaller ones: > 1 node X' serves range [X', X+3) > Now range [X', X+3) is replicated 4 times - over-replicated. > 2 node X-2 from [X-2, X+1) to [X-2, X') > Now only range [X+1, X+3) is still over-replicated. > 3 node X-1 from [X-1, X+2) to [X-1, X+1) > Now only range [X+2, X+3) is still over-replicated > 4 node X from [X, X+3) to [X, X+2) > Now no range is over-replicated. > So in each of these smaller "transactions", only one node > and some neighbors' log propagation are affected. Failure > is now easier to deal with. > > It's worth pointing out that the latter three "transactions" > are scheduled to balance the level of replication. > > Is this a good idea? This is nice. Ideally the flow wouldn't require a lot of state. So adding X' is triggered first (by turning on a new node or by splitting a hot spot or somesuch). Once X' is online, then (2), (3) and (4) are triggered by the over-replication, with a lock to permit only one transaction at a time. There's no explicit plan or worklist. Instead, the state can be evaluated at each point, and the highest priority transaction can be executed. Repeat. So the algorithm might look like: condition: action: ----------------------------------------------------- under-replication start a replication over-replication start a reduction new node available add @ hottest spot hot spot free a node These are priority-ordered. Only one action may be in progress at a time in a given region of the ring. Does that work? Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-23 16:00:30
|
On Tue, Apr 22, 2008 at 7:56 PM, Doug Cutting <cu...@ap...> wrote: > This is nice. Ideally the flow wouldn't require a lot of state. So > adding X' is triggered first (by turning on a new node or by splitting a > hot spot or somesuch). Once X' is online, then (2), (3) and (4) are > triggered by the over-replication, with a lock to permit only one > transaction at a time. There's no explicit plan or worklist. Instead, > the state can be evaluated at each point, and the highest priority > transaction can be executed. Repeat. > > So the algorithm might look like: > > condition: action: > ----------------------------------------------------- > under-replication start a replication > over-replication start a reduction > new node available add @ hottest spot > hot spot free a node > > These are priority-ordered. Only one action may be in progress at a > time in a given region of the ring. > > Does that work? Yep! Before I change the code to do the above, what do you think of the current implementation in terms of the master-host communication, the host-host communication, and the logic of the cleanup threads which make sure there is no data loss if no host goes down in the process? BTW, you are still enthusiastic about Bailey, right? :) Cheers, Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-23 17:26:04
|
Ning Li wrote: >> Does that work? > > Yep! Great! > Before I change the code to do the above, what do you think of the > current implementation in terms of the master-host communication, > the host-host communication, and the logic of the cleanup threads > which make sure there is no data loss if no host goes down in the > process? I'll try to take a detailed look at it today. > BTW, you are still enthusiastic about Bailey, right? :) Very much so! I'm just having a hard time pulling away from Hadoop long enough to give much time to Bailey. Plus, I've recently been asked to start working on a third project! Sigh. Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-28 16:28:36
|
> 2. Write a multi-threaded test program that uses this, simulating a > large, active index. It randomly adds, removes and updates randomly > created documents, and periodically queries and checks that results are > correct. Does the newly added TestMtHeapDb look alright? On Wed, Apr 23, 2008 at 1:26 PM, Doug Cutting <cu...@ap...> wrote: > > Before I change the code to do the above, what do you think of the > > current implementation in terms of the master-host communication, > > the host-host communication, and the logic of the cleanup threads > > which make sure there is no data loss if no host goes down in the > > process? > > I'll try to take a detailed look at it today. Does it look good? Cheers, Ning |