You can subscribe to this list here.
| 2008 |
Jan
|
Feb
(60) |
Mar
(65) |
Apr
(44) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Doug C. <cu...@ap...> - 2008-05-09 22:08:43
|
Mike Klaas wrote: > Have you guys seen this project? Sounds similar to some of the goals > of Bailey. > > http://katta.wiki.sourceforge.net/ Looks related, but there's no code there yet... The diagram makes it look like it's a read-only search-time system. Indexes are created offline in hadoop, then posted to katta. Bailey's design is to support live updates of documents. Doug |
|
From: Mike K. <mik...@gm...> - 2008-05-09 21:48:03
|
Have you guys seen this project? Sounds similar to some of the goals of Bailey. http://katta.wiki.sourceforge.net/ -Mike |
|
From: Ning L. <nin...@gm...> - 2008-05-02 21:03:50
|
On Fri, May 2, 2008 at 4:37 PM, Yonik Seeley <yo...@ap...> wrote: > On Fri, May 2, 2008 at 4:12 PM, Ning Li <nin...@gm...> wrote: > > We need a W-way write to provide fault tolerance for the write. > > Maybe we can return a flag indicating <W nodes did the write > > and let an application decide whether it wants to redo the write? > > Returning the number of servers actually written sounds like the right approach. > It does seem hard to try and cancel the write. Because of the version number, a write is idempotent. So we don't need to cancel the write, right? > > > > - Last but not the least, a possible performance impact: > > > > a node can receive the same write request from several > > > > different nodes around the same time. > > > > > > It seems like this should be rare. If it is rare, we shouldn't do any > > > extra work to handle it. > > > > I'm not sure if this is rare. I will run a test when we experiment > > with a more realistic workload. > > Do you mean the exact same client write request, or different client > write requests that happen to be for the same document? What would be > the scenario that would trigger the first? The exact same client write request. This can arise because of log propagation. For example, a document is in the range served by nodes A, B, C and D (W = 3, N = 4). Now let's say A and B finish writing and logging the document. C is still processing the document. C also syncs the log entries from A and B and sees the same write via the log propagation. Because C hasn't finished writing the document, it goes ahead and retrieves the document from A and/or B. This is the wasted work I was worried about. I'm not sure if this is rare. I think it depends on how parallel are the writes and log syncs. It also depends on how long the document parsing/processing takes. Cheers, Ning |
|
From: Yonik S. <yo...@ap...> - 2008-05-02 20:37:06
|
On Fri, May 2, 2008 at 4:12 PM, Ning Li <nin...@gm...> wrote: > We need a W-way write to provide fault tolerance for the write. > Maybe we can return a flag indicating <W nodes did the write > and let an application decide whether it wants to redo the write? Returning the number of servers actually written sounds like the right approach. It does seem hard to try and cancel the write. > > > - Last but not the least, a possible performance impact: > > > a node can receive the same write request from several > > > different nodes around the same time. > > > > It seems like this should be rare. If it is rare, we shouldn't do any > > extra work to handle it. > > I'm not sure if this is rare. I will run a test when we experiment > with a more realistic workload. Do you mean the exact same client write request, or different client write requests that happen to be for the same document? What would be the scenario that would trigger the first? -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-05-02 20:12:49
|
On Thu, May 1, 2008 at 12:39 PM, Yonik Seeley <yo...@ap...> wrote: > Google's distributed file system maximized link bandwidth by having a > chain... one node would send to another node, which would send to > another node, etc. The trick was that it was streamed (a node would > start forwarding an update as soon as it started receiving it). > That's probably too complex for now, so parallel looks like the right > choice. Yes, let's go with a simpler version first. But we should definitely keep the chain design in mind. > > - What happens if the coordinator node cannot "complete" > > the write request because it cannot find W nodes that > > handle the write request successfully? > > Good question. It seems like we should be able to operate in degraded > mode, so shouldn't a write succeed if at least one node got it? > > Or perhaps a W-way write is a feature, but not the default? Seems > like the desired replication factor shouldn't be strongly coupled to > the number of nodes that we need to write to for success. The number W of nodes that we need to "complete" a write and the replication factor/level N are two separate parameters in the system configuration and 1 <= W <= N (usually, 1 < W < N). We need a W-way write to provide fault tolerance for the write. Maybe we can return a flag indicating <W nodes did the write and let an application decide whether it wants to redo the write? > > - Last but not the least, a possible performance impact: > > a node can receive the same write request from several > > different nodes around the same time. > > It seems like this should be rare. If it is rare, we shouldn't do any > extra work to handle it. I'm not sure if this is rare. I will run a test when we experiment with a more realistic workload. Cheers, Ning |
|
From: Yonik S. <yo...@ap...> - 2008-05-01 16:39:50
|
On Tue, Apr 29, 2008 at 11:21 AM, Ning Li <nin...@gm...> wrote: > Previously, we decided that: > > - A write request is sent to multiple nodes, all of whose > ranges must include the document. > > - When a node receives a write request of a document with > revision R, it checks that revision R is newer than the > current revision stored/indexed on the node. Then the node > stores and indexes the new revision in memory and put a > log entry in the log. > > - A W-way write means at least W number of nodes have to > successfully serve the write request for the write to be > considered "complete". 1 < W <= N. W > 1 is to provide > fault tolerance to write operations. > > > Here are the details: > > - The node that receives a write request is the coordinator > of the write request. It makes sure the write request is > also handled on W-1 other nodes. > > - Should the coordinator node send the write request to > W-1 other nodes in sequence or in parallel? Preferably > in parallel? Google's distributed file system maximized link bandwidth by having a chain... one node would send to another node, which would send to another node, etc. The trick was that it was streamed (a node would start forwarding an update as soon as it started receiving it). That's probably too complex for now, so parallel looks like the right choice. > - When the write request fails on a node, should the > coordinator node retry that node or immediately try > some other node? > > - What happens if the coordinator node cannot "complete" > the write request because it cannot find W nodes that > handle the write request successfully? Good question. It seems like we should be able to operate in degraded mode, so shouldn't a write succeed if at least one node got it? Or perhaps a W-way write is a feature, but not the default? Seems like the desired replication factor shouldn't be strongly coupled to the number of nodes that we need to write to for success. > On the one hand, > we cannot undo the write request from the nodes that > succeeded because we cannot increase user's revision. > On the other hand, the nodes that succeeded may > propagate the write request to more nodes so the write > request may eventually be "completed". So users should > be aware that even if a write request is "incomplete", > it may or may not be "completed" eventually. > > - When a node receives a write request whose revision > is older than that on the node, the node returns success > without doing anything. Should a user be notified that > there is a newer revision encountered? > > - Last but not the least, a possible performance impact: > a node can receive the same write request from several > different nodes around the same time. It seems like this should be rare. If it is rare, we shouldn't do any extra work to handle it. -Yonik > When it checks > its database, the write request is considered new, then > it parses and process the request, several times, and > in the end finds out that it's the same request. What is > a good strategy to minimize this situation? > > What do you think? > > Cheers, > Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-29 15:21:31
|
Previously, we decided that: - A write request is sent to multiple nodes, all of whose ranges must include the document. - When a node receives a write request of a document with revision R, it checks that revision R is newer than the current revision stored/indexed on the node. Then the node stores and indexes the new revision in memory and put a log entry in the log. - A W-way write means at least W number of nodes have to successfully serve the write request for the write to be considered "complete". 1 < W <= N. W > 1 is to provide fault tolerance to write operations. Here are the details: - The node that receives a write request is the coordinator of the write request. It makes sure the write request is also handled on W-1 other nodes. - Should the coordinator node send the write request to W-1 other nodes in sequence or in parallel? Preferably in parallel? - When the write request fails on a node, should the coordinator node retry that node or immediately try some other node? - What happens if the coordinator node cannot "complete" the write request because it cannot find W nodes that handle the write request successfully? On the one hand, we cannot undo the write request from the nodes that succeeded because we cannot increase user's revision. On the other hand, the nodes that succeeded may propagate the write request to more nodes so the write request may eventually be "completed". So users should be aware that even if a write request is "incomplete", it may or may not be "completed" eventually. - When a node receives a write request whose revision is older than that on the node, the node returns success without doing anything. Should a user be notified that there is a newer revision encountered? - Last but not the least, a possible performance impact: a node can receive the same write request from several different nodes around the same time. When it checks its database, the write request is considered new, then it parses and process the request, several times, and in the end finds out that it's the same request. What is a good strategy to minimize this situation? What do you think? Cheers, Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-28 16:28:36
|
> 2. Write a multi-threaded test program that uses this, simulating a > large, active index. It randomly adds, removes and updates randomly > created documents, and periodically queries and checks that results are > correct. Does the newly added TestMtHeapDb look alright? On Wed, Apr 23, 2008 at 1:26 PM, Doug Cutting <cu...@ap...> wrote: > > Before I change the code to do the above, what do you think of the > > current implementation in terms of the master-host communication, > > the host-host communication, and the logic of the cleanup threads > > which make sure there is no data loss if no host goes down in the > > process? > > I'll try to take a detailed look at it today. Does it look good? Cheers, Ning |
|
From: <ni...@us...> - 2008-04-28 16:25:32
|
Revision: 29
http://bailey.svn.sourceforge.net/bailey/?rev=29&view=rev
Author: ning_li
Date: 2008-04-28 09:25:33 -0700 (Mon, 28 Apr 2008)
Log Message:
-----------
"Write a multi-threaded test program that uses this, simulating a large, active index. It randomly adds, removes and updates randomly created documents, and periodically queries and checks that results are correct."
Added Paths:
-----------
trunk/src/test/org/apache/bailey/TestMtHeapDb.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <ni...@us...> - 2008-04-28 16:21:05
|
Revision: 28
http://bailey.svn.sourceforge.net/bailey/?rev=28&view=rev
Author: ning_li
Date: 2008-04-28 09:20:38 -0700 (Mon, 28 Apr 2008)
Log Message:
-----------
Make the "stipulate" method synchronized to avoid ConcurrentModificationException.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <ni...@us...> - 2008-04-28 16:18:05
|
Revision: 27
http://bailey.svn.sourceforge.net/bailey/?rev=27&view=rev
Author: ning_li
Date: 2008-04-28 09:18:09 -0700 (Mon, 28 Apr 2008)
Log Message:
-----------
Make ServiceDatabase extends Database. ServiceDatabase defines more methods which are necessary to improve performance in the internal implementation.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/Database.java
trunk/src/java/org/apache/bailey/ddb/Client.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
trunk/src/java/org/apache/bailey/provider/ServiceDatabase.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Doug C. <cu...@ap...> - 2008-04-23 17:26:04
|
Ning Li wrote: >> Does that work? > > Yep! Great! > Before I change the code to do the above, what do you think of the > current implementation in terms of the master-host communication, > the host-host communication, and the logic of the cleanup threads > which make sure there is no data loss if no host goes down in the > process? I'll try to take a detailed look at it today. > BTW, you are still enthusiastic about Bailey, right? :) Very much so! I'm just having a hard time pulling away from Hadoop long enough to give much time to Bailey. Plus, I've recently been asked to start working on a third project! Sigh. Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-23 16:00:30
|
On Tue, Apr 22, 2008 at 7:56 PM, Doug Cutting <cu...@ap...> wrote: > This is nice. Ideally the flow wouldn't require a lot of state. So > adding X' is triggered first (by turning on a new node or by splitting a > hot spot or somesuch). Once X' is online, then (2), (3) and (4) are > triggered by the over-replication, with a lock to permit only one > transaction at a time. There's no explicit plan or worklist. Instead, > the state can be evaluated at each point, and the highest priority > transaction can be executed. Repeat. > > So the algorithm might look like: > > condition: action: > ----------------------------------------------------- > under-replication start a replication > over-replication start a reduction > new node available add @ hottest spot > hot spot free a node > > These are priority-ordered. Only one action may be in progress at a > time in a given region of the ring. > > Does that work? Yep! Before I change the code to do the above, what do you think of the current implementation in terms of the master-host communication, the host-host communication, and the logic of the cleanup threads which make sure there is no data loss if no host goes down in the process? BTW, you are still enthusiastic about Bailey, right? :) Cheers, Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-23 00:00:51
|
Ning Li wrote: >> I think each node can determine that on its own at startup. It >> >> 1. Posts its range and log start number. >> 2. Waits a bit, so all other nodes have had a chance to post their data. >> 3. Decides if its index is valid, by checking all overlapping node's log >> start numbers & compares them with the last sync'd log number to see if >> they've compacted their log. >> 4. Starts syncing with its neighbors. > > Data in Zookeeper is persistent, right? So there are records in Zookeeper > on the range, log start number and log start numbers on neighbors > of a node. So what does it mean if, at startup, a node's posted numbers > are newer or older than those in Zookeeper? Can the data in Zookeeper > help during startup? Or do we discard those records in Zookeeper? Zookeeper can store data that's persistent or that disappears when a node disappears. I think the most natural representation in Zookeeper would be to make the live ring to be represented by ephemeral files. Doug |
|
From: Doug C. <cu...@ap...> - 2008-04-22 23:56:43
|
Ning Li wrote: > Your email made me think that we can make the complexity > manageable by breaking a state change "transaction" into > smaller ones. > > Here is what I think of a state change "transaction": > A new node X' is added at the position X' between X and X+1. > The affected nodes switch from: > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > And the log propagation of even more nodes are affected. > > We break this "transaction" into four smaller ones: > 1 node X' serves range [X', X+3) > Now range [X', X+3) is replicated 4 times - over-replicated. > 2 node X-2 from [X-2, X+1) to [X-2, X') > Now only range [X+1, X+3) is still over-replicated. > 3 node X-1 from [X-1, X+2) to [X-1, X+1) > Now only range [X+2, X+3) is still over-replicated > 4 node X from [X, X+3) to [X, X+2) > Now no range is over-replicated. > So in each of these smaller "transactions", only one node > and some neighbors' log propagation are affected. Failure > is now easier to deal with. > > It's worth pointing out that the latter three "transactions" > are scheduled to balance the level of replication. > > Is this a good idea? This is nice. Ideally the flow wouldn't require a lot of state. So adding X' is triggered first (by turning on a new node or by splitting a hot spot or somesuch). Once X' is online, then (2), (3) and (4) are triggered by the over-replication, with a lock to permit only one transaction at a time. There's no explicit plan or worklist. Instead, the state can be evaluated at each point, and the highest priority transaction can be executed. Repeat. So the algorithm might look like: condition: action: ----------------------------------------------------- under-replication start a replication over-replication start a reduction new node available add @ hottest spot hot spot free a node These are priority-ordered. Only one action may be in progress at a time in a given region of the ring. Does that work? Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-22 22:52:16
|
It's been quiet for a while. Any comments/thoughts? Cheers, Ning On Wed, Apr 16, 2008 at 7:12 PM, Ning Li <nin...@gm...> wrote: > > I'm coding the add-node process with the overlapping replication > > right now. It's taking some time. After it's finished, we can review > > the code and have a complete picture of the complexity and the > > performance implications. > > Finally checked in the first version for adding a node, with log > propagation and cleanup... :) > > > > On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote: > > Yes, that's one way to do it. Perhaps it is even the best way. But if > > C reports it is ready to start serving [2 4) before D is, then this > > might be revealed to clients, to more rapidly reduce the load on A and > > B. Might this also permit a less stateful master? My hunch is that the > > more queues of pending splits and such that the master has to maintain, > > the greater the chances of confusion in the face of failure. > > > > Perhaps the master could simply issue step (1), asking C and D to start > > serving [2 4), and remember only that it had issued this request so that > > it doesn't issue it again too soon. Then, when C & D tell the master > > they're ready, the master can notice that [2, 4) is over-replicated, and > > issue requests to A and B to stop serving this. In this model, the > > master would not wait for both C & D to report, nor would it wait for A > > and B to report their narrowed ranges before republishing the ring. > > > > In this approach, if D fails while replicating [2, 4) then the master > > could still trim A to [0, 2), but not trim B yet. Also, if > > previously-offline E comes back online already serving [0, 2) or [2, 3), > > or some other overlapping range from some prior merge/split state, it > > can be more easily re-integrated into the ring, without discarding all > > of its data. > > > > So, even in the case of a mostly-disjoint partitioning, there may be > > benefits to permitting overlap. > > You have a number of good points. > > I was worried about the complexity of a state change "transaction" > (i.e. adding/removing a node and the changes to neighbors' ranges > and the changes to more neighbors' log propagation that come > with it). As you said, "the more queues of pending splits" (or > pending range changes) "and such that the master has to maintain, > > the greater the chances of confusion in the face of failure." > > Your email made me think that we can make the complexity > manageable by breaking a state change "transaction" into > smaller ones. > > Here is what I think of a state change "transaction": > A new node X' is added at the position X' between X and X+1. > The affected nodes switch from: > > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > And the log propagation of even more nodes are affected. > > We break this "transaction" into four smaller ones: > 1 node X' serves range [X', X+3) > Now range [X', X+3) is replicated 4 times - over-replicated. > 2 node X-2 from [X-2, X+1) to [X-2, X') > Now only range [X+1, X+3) is still over-replicated. > 3 node X-1 from [X-1, X+2) to [X-1, X+1) > Now only range [X+2, X+3) is still over-replicated > 4 node X from [X, X+3) to [X, X+2) > Now no range is over-replicated. > So in each of these smaller "transactions", only one node > and some neighbors' log propagation are affected. Failure > is now easier to deal with. > > It's worth pointing out that the latter three "transactions" > are scheduled to balance the level of replication. > > Is this a good idea? > > Ning > |
|
From: Ning L. <nin...@gm...> - 2008-04-17 21:34:46
|
On Thu, Apr 10, 2008 at 7:42 PM, Doug Cutting <cu...@ap...> wrote: > So, with these two simple properties, lightweight master failover and > ability to run the master on a host, means that we can choose to either > dedicate a machine to the master, or run the master daemon on every > host, switching frequently. I'd certainly like to preserve the latter > option, so lets keep it in mind as we code. Sounds good. I like the design. :) > From A's perspective, when B comes online, B's data looks fine, since > B's log is complete. But B discovers, when it talks to A, that B's data > is obsolete. If A retrieves B's data before B discovers that it is > obsolete, then A would get stale adds. So B must block retrieval of its > data until it has determined whether it is valid to all of its > neighbors. At system startup this means that all nodes must wait for > their neighbors to come online so that they can determine whether their > own state is valid before permitting any synchronization. > > Does that make any sense? Yep. My point was also that a node needs to check the neighbors to decide the validity. > I think each node can determine that on its own at startup. It > > 1. Posts its range and log start number. > 2. Waits a bit, so all other nodes have had a chance to post their data. > 3. Decides if its index is valid, by checking all overlapping node's log > start numbers & compares them with the last sync'd log number to see if > they've compacted their log. > 4. Starts syncing with its neighbors. Data in Zookeeper is persistent, right? So there are records in Zookeeper on the range, log start number and log start numbers on neighbors of a node. So what does it mean if, at startup, a node's posted numbers are newer or older than those in Zookeeper? Can the data in Zookeeper help during startup? Or do we discard those records in Zookeeper? > I worry that there might be pathological cases where different nodes > were offline and/or compacted at different times causing all replicas of > a range to be discarded. :) Hopefully this won't happen because of replication. In addition, we only throw away/re-build a node when we know we can re-build its range from some other nodes, right? Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-16 23:12:15
|
> I'm coding the add-node process with the overlapping replication
> right now. It's taking some time. After it's finished, we can review
> the code and have a complete picture of the complexity and the
> performance implications.
Finally checked in the first version for adding a node, with log
propagation and cleanup... :)
On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote:
> Yes, that's one way to do it. Perhaps it is even the best way. But if
> C reports it is ready to start serving [2 4) before D is, then this
> might be revealed to clients, to more rapidly reduce the load on A and
> B. Might this also permit a less stateful master? My hunch is that the
> more queues of pending splits and such that the master has to maintain,
> the greater the chances of confusion in the face of failure.
>
> Perhaps the master could simply issue step (1), asking C and D to start
> serving [2 4), and remember only that it had issued this request so that
> it doesn't issue it again too soon. Then, when C & D tell the master
> they're ready, the master can notice that [2, 4) is over-replicated, and
> issue requests to A and B to stop serving this. In this model, the
> master would not wait for both C & D to report, nor would it wait for A
> and B to report their narrowed ranges before republishing the ring.
>
> In this approach, if D fails while replicating [2, 4) then the master
> could still trim A to [0, 2), but not trim B yet. Also, if
> previously-offline E comes back online already serving [0, 2) or [2, 3),
> or some other overlapping range from some prior merge/split state, it
> can be more easily re-integrated into the ring, without discarding all
> of its data.
>
> So, even in the case of a mostly-disjoint partitioning, there may be
> benefits to permitting overlap.
You have a number of good points.
I was worried about the complexity of a state change "transaction"
(i.e. adding/removing a node and the changes to neighbors' ranges
and the changes to more neighbors' log propagation that come
with it). As you said, "the more queues of pending splits" (or
pending range changes) "and such that the master has to maintain,
the greater the chances of confusion in the face of failure."
Your email made me think that we can make the complexity
manageable by breaking a state change "transaction" into
smaller ones.
Here is what I think of a state change "transaction":
A new node X' is added at the position X' between X and X+1.
The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
And the log propagation of even more nodes are affected.
We break this "transaction" into four smaller ones:
1 node X' serves range [X', X+3)
Now range [X', X+3) is replicated 4 times - over-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X')
Now only range [X+1, X+3) is still over-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+1)
Now only range [X+2, X+3) is still over-replicated
4 node X from [X, X+3) to [X, X+2)
Now no range is over-replicated.
So in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected. Failure
is now easier to deal with.
It's worth pointing out that the latter three "transactions"
are scheduled to balance the level of replication.
Is this a good idea?
Ning
PS: Here is what happens when a node is removed:
Node X+1 is removed. The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
node X+1 serves range [X+1, X+4)
to:
node X-2 serves range [X-2, X+2)
node X-1 serves range [X-1, X+3)
node X serves range [X, X+4)
And the log propagation of even more nodes are affected.
Again, we break this "transaction" into four smaller ones:
1 node X+1 is removed
Depending on whether node X+1 failed, we may do nothing.
Now range [X+1, X+4) is replicated 2 times - under-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X+2)
Now only range [X+2, X+4) is still under-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+3)
Now only range [X+3, X+4) is still under-replicated
4 node X from [X, X+3) to [X, X+4)
Now no range is under-replicated.
Again, in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected.
|
|
From: <ni...@us...> - 2008-04-16 19:33:43
|
Revision: 26
http://bailey.svn.sourceforge.net/bailey/?rev=26&view=rev
Author: ning_li
Date: 2008-04-16 12:33:49 -0700 (Wed, 16 Apr 2008)
Log Message:
-----------
This is the first version supporting adding a node with log propagation and cleanup.
1 As first version, only one node can be added at a time for simplicity. That is, a node is added, the mapper is updated and all hosts update their mapper before another node can be added. Cleanup ensures there is no loss of data if there is no host failure.
2 Though greatly simplified, it demonstrates the flow of events.
3 We may break a state change "transaction" (e.g. adding/removing a node and replication changes that come with it) into smaller "transaction". But the overall flow of events is similar. So hopefully this version is still useful.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/Range.java
trunk/src/java/org/apache/bailey/ddb/Host.java
trunk/src/java/org/apache/bailey/ddb/HostCommand.java
trunk/src/java/org/apache/bailey/ddb/HostStatus.java
trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/HostToMasterProtocol.java
trunk/src/java/org/apache/bailey/ddb/Log.java
trunk/src/java/org/apache/bailey/ddb/Mapper.java
trunk/src/java/org/apache/bailey/ddb/Master.java
trunk/src/java/org/apache/bailey/ddb/NodeHostMap.java
trunk/src/java/org/apache/bailey/ddb/NodeStatus.java
trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
trunk/src/java/org/apache/bailey/ddb/Ring.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleMaster.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleLog.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleMaster.java
trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
trunk/src/java/org/apache/bailey/provider/ServiceDatabase.java
trunk/src/java/org/apache/bailey/util/SetValuedMap.java
trunk/src/test/org/apache/bailey/TestSimpleDb.java
trunk/src/test/org/apache/bailey/TestSimpleDbWithLog.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/ddb/NodeState.java
trunk/src/java/org/apache/bailey/ddb/RangeMismatchException.java
trunk/src/java/org/apache/bailey/ddb/simple/HostInstanceRegistry.java
trunk/src/java/org/apache/bailey/util/RefCounted.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Doug C. <cu...@ap...> - 2008-04-10 23:42:15
|
Ning Li wrote: > Yes, this would work. We simply put the code for master into host > and make sure only one host is executing the master code at a time. > > Interesting. :) But let's use a master for now since it's easier to > debug this way? Keeping all the state in Zookeeper may also make it easy to debug, since it can be read from any client. I actually don't think it should be hard to have it both ways. We should write the master as a fail-over loop in any case, so that multiple master-daemon processes can be running on a cluster at once, but only one acting as master. Switching masters frequently should be as simple as adding a counter in the top-level daemon loop to decide when to give up the lock. The master should be a different class from the host, and should have it's own main() method, so that it can be run independently, but it should also be possible to run the master in the same JVM as a host. (We'll want this for testing anyway.) So, with these two simple properties, lightweight master failover and ability to run the master on a host, means that we can choose to either dedicate a machine to the master, or run the master daemon on every host, switching frequently. I'd certainly like to preserve the latter option, so lets keep it in mind as we code. >> How does this work at system startup? How does B know not to go online, >> providing its stale adds, yet A is permitted to provide its data? >> Perhaps, at startup a node can respond to requests for its first log >> entry number, but refuse requests for the logs themselves. Then B can >> see that it must reload, and refuse requests from A to sync until reload >> is complete. Does that work? > > What do you mean by "a node can respond to requests for its first log > entry number, but refuse requests for the logs themselves"? From A's perspective, when B comes online, B's data looks fine, since B's log is complete. But B discovers, when it talks to A, that B's data is obsolete. If A retrieves B's data before B discovers that it is obsolete, then A would get stale adds. So B must block retrieval of its data until it has determined whether it is valid to all of its neighbors. At system startup this means that all nodes must wait for their neighbors to come online so that they can determine whether their own state is valid before permitting any synchronization. Does that make any sense? > Maybe at system startup, master (or a host) analyzes, for each node, > its first log entry number and the log entry numbers it has synced with > its overlapping nodes. Then it is derived from this analysis which nodes > need complete reloads? Is this what you meant? I think each node can determine that on its own at startup. It 1. Posts its range and log start number. 2. Waits a bit, so all other nodes have had a chance to post their data. 3. Decides if its index is valid, by checking all overlapping node's log start numbers & compares them with the last sync'd log number to see if they've compacted their log. 4. Starts syncing with its neighbors. I worry that there might be pathological cases where different nodes were offline and/or compacted at different times causing all replicas of a range to be discarded. Does this make any sense? Doug |
|
From: Doug C. <cu...@ap...> - 2008-04-10 22:38:13
|
Ning Li wrote: > Let's say the replication is 2 and node A and node B are the two replicas > of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4). > After the split, node A and node B will be the replicas of [0, 2) and > node C and node D will be created to be the replicas of [2, 4). > > The split process works as follows: > 1 Master tells some hosts to create node C and node D of [2, 4) > from node A or node B > 2 C and D tell the master they are ready > 3 Master changes the map to A and B serving [0, 2) and C and D > serving [2, 4) > 4 Master tells A and B to stop serving [2, 4) and only serve [0, 2) > 5 A and B tell the master they are done > > Care should be taken in synchronizing logs in the process. > > Here, it doesn't matter if at one point in time, node X serves R1, > node Y serves R2, and R1 and R2 overlap. What does matter is > that there is no overlap in the map which the master maintains. > And because clients get the map from the master, clients do > not see any overlap in the map, either. Yes, that's one way to do it. Perhaps it is even the best way. But if C reports it is ready to start serving [2 4) before D is, then this might be revealed to clients, to more rapidly reduce the load on A and B. Might this also permit a less stateful master? My hunch is that the more queues of pending splits and such that the master has to maintain, the greater the chances of confusion in the face of failure. Perhaps the master could simply issue step (1), asking C and D to start serving [2 4), and remember only that it had issued this request so that it doesn't issue it again too soon. Then, when C & D tell the master they're ready, the master can notice that [2, 4) is over-replicated, and issue requests to A and B to stop serving this. In this model, the master would not wait for both C & D to report, nor would it wait for A and B to report their narrowed ranges before republishing the ring. In this approach, if D fails while replicating [2, 4) then the master could still trim A to [0, 2), but not trim B yet. Also, if previously-offline E comes back online already serving [0, 2) or [2, 3), or some other overlapping range from some prior merge/split state, it can be more easily re-integrated into the ring, without discarding all of its data. So, even in the case of a mostly-disjoint partitioning, there may be benefits to permitting overlap. > I'm coding the add-node process with the overlapping replication > right now. It's taking some time. After it's finished, we can review > the code and have a complete picture of the complexity and the > performance implications. Great! Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-10 17:34:42
|
On Tue, Apr 8, 2008 at 4:29 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > I thought only the master writes to Zookeeper. Hosts read from Zookeeper. > > We should still have a light-weight master, no? Which keeps track of > > the heartbeats from the hosts, detects host failures and responds > > accordingly, and decides state changes (add/remove nodes) when > > appropriate... > > As a thought experiment, I'm trying to see whether, with Zookeeper, we > actually need such a master. > > We could replace heartbeats with an ephemeral file per node containing > its status. (Ephemeral files disappear if their owner goes offline.) > > Any host could (a) grab a lock; (b) analyze the ring for potential > add/removes; (c) post these requests to a directory. In effect, getting > the lock is master election, and while a node is doing this analysis, it > is the master. But the master moves around: each host has a "master" > thread, and remains "master" only during one analysis/action cycle. > This analysis cycle should not be very compute or i/o intensive. > > The advantage of this is that we wouldn't have to explicitly test or > otherwise engineer master failover, since it would be happening all the > time. All global state would be redundantly stored in Zookeeper. > > Could this work? Yes, this would work. We simply put the code for master into host and make sure only one host is executing the master code at a time. Interesting. :) But let's use a master for now since it's easier to debug this way? > > In this example, say B synced with A to A's entry 11 and > > with C to C's entry 12 (which includes adding doc X) before > > it went offline. Because A expunged its log, now A's log > > entry number starts from 21. B comes back online and > > finds A's 21 is newer than last time B synced with it (11). > > So B cannot recover but has to sync from scratch. > > My concern was that A might try to sync from B and get stale data. I > think you're arguing that B should not go online, accepting sync > requests, until it has itself sync'd with its neighbors. In this case, > incremental sync would fail and B would sync from scratch, removing any > stale adds. > > How does this work at system startup? How does B know not to go online, > providing its stale adds, yet A is permitted to provide its data? > Perhaps, at startup a node can respond to requests for its first log > entry number, but refuse requests for the logs themselves. Then B can > see that it must reload, and refuse requests from A to sync until reload > is complete. Does that work? What do you mean by "a node can respond to requests for its first log entry number, but refuse requests for the logs themselves"? Maybe at system startup, master (or a host) analyzes, for each node, its first log entry number and the log entry numbers it has synced with its overlapping nodes. Then it is derived from this analysis which nodes need complete reloads? Is this what you meant? Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-10 17:11:17
|
On Tue, Apr 8, 2008 at 5:39 PM, Doug Cutting <cu...@ap...> wrote: > > Halfen a hot range? More some more sophisticated algorithm. > > We can still have nodes evenly around the ring at the start. > > The partitioning algorithm doesn't change. Only the replication > > algorithm is different. > > But we'll need to dynamically balance things, since load may be uneven > around the ring, especially with user-provided ring positions. And once > we permit splitting, don't we have to handle overlapping ranges? With non-overlapping replication, if we decide to split a range, we split all the replicas of the range. Same as with overlapping replication, we use the master to coordinate the process. > > What do you mean by "without permitting it"? > > Just that, if we split a node in a disjoint partitioning then we'll have > overlapping ranges during the split. Switching all clients to search > the new half-sized nodes won't be atomic. So I'm not sure that disjoint > ranges really simplifies things fundamentally. Let's say the replication is 2 and node A and node B are the two replicas of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4). After the split, node A and node B will be the replicas of [0, 2) and node C and node D will be created to be the replicas of [2, 4). The split process works as follows: 1 Master tells some hosts to create node C and node D of [2, 4) from node A or node B 2 C and D tell the master they are ready 3 Master changes the map to A and B serving [0, 2) and C and D serving [2, 4) 4 Master tells A and B to stop serving [2, 4) and only serve [0, 2) 5 A and B tell the master they are done Care should be taken in synchronizing logs in the process. Here, it doesn't matter if at one point in time, node X serves R1, node Y serves R2, and R1 and R2 overlap. What does matter is that there is no overlap in the map which the master maintains. And because clients get the map from the master, clients do not see any overlap in the map, either. What do you think? > > I think this is a fair comparison so the search performance should > > be comparable. What do you think? > > Good point. > > One other difference is that, with overlapping ranges, when a node > fails, its load is spread more evenly over the remaining nodes. In this > example, the load of any single node would become shared by two other > nodes rather than one, so each node only needs 50% headroom rather than > 100%. Yes, that's the advantage of the overlapping replication scheme. I'm coding the add-node process with the overlapping replication right now. It's taking some time. After it's finished, we can review the code and have a complete picture of the complexity and the performance implications. Ning |
|
From: <Yo...@us...> - 2008-04-10 03:40:46
|
Revision: 25
http://bailey.svn.sourceforge.net/bailey/?rev=25&view=rev
Author: Yonik
Date: 2008-04-09 20:40:49 -0700 (Wed, 09 Apr 2008)
Log Message:
-----------
Replace String.hashCode() with LOOKUP3 (fast and well studied) for getDefaultPosition. String.hashCode() isn't suitable if you are using all the bits, since most of the changes happen on the right. Also made a version of LOOKUP3 that is well defined for implementations in other languages.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/Document.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/util/Hash.java
trunk/src/test/org/apache/bailey/util/
trunk/src/test/org/apache/bailey/util/TestHash.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <yo...@us...> - 2008-04-10 03:26:36
|
Revision: 24
http://bailey.svn.sourceforge.net/bailey/?rev=24&view=rev
Author: yonik
Date: 2008-04-09 20:26:41 -0700 (Wed, 09 Apr 2008)
Log Message:
-----------
one more time... test commit with new email
Modified Paths:
--------------
trunk/src/java/overview.html
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|