bailey-developers Mailing List for Bailey

Status: Planning

Brought to you by: cutting, ning_li, phunt, yonik

bailey-developers

You can subscribe to this list here.

2008	_Jan	_Feb (60)	_Mar (65)	_Apr (44)	_May (6)	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

1 2 3 .. 7 > >> (Page 1 of 7)

Re: [bailey-developers] katta

From: Doug C. <cu...@ap...> - 2008-05-09 22:08:43

Mike Klaas wrote:
> Have you guys seen this project?   Sounds similar to some of the goals  
> of Bailey.
> 
> http://katta.wiki.sourceforge.net/

Looks related, but there's no code there yet...

The diagram makes it look like it's a read-only search-time system. 
Indexes are created offline in hadoop, then posted to katta.  Bailey's 
design is to support live updates of documents.

Doug

[bailey-developers] katta

From: Mike K. <mik...@gm...> - 2008-05-09 21:48:03

Have you guys seen this project?   Sounds similar to some of the goals  
of Bailey.

http://katta.wiki.sourceforge.net/

-Mike

Re: [bailey-developers] Details on handling a write request

From: Ning L. <nin...@gm...> - 2008-05-02 21:03:50

On Fri, May 2, 2008 at 4:37 PM, Yonik Seeley <yo...@ap...> wrote:
> On Fri, May 2, 2008 at 4:12 PM, Ning Li <nin...@gm...> wrote:
>  >  We need a W-way write to provide fault tolerance for the write.
>  >  Maybe we can return a flag indicating <W nodes did the write
>  >  and let an application decide whether it wants to redo the write?
>
>  Returning the number of servers actually written sounds like the right approach.
>  It does seem hard to try and cancel the write.

Because of the version number, a write is idempotent. So we don't
need to cancel the write, right?

>  >  >  >  - Last but not the least, a possible performance impact:
>  >  >  >   a node can receive the same write request from several
>  >  >  >   different nodes around the same time.
>  >  >
>  >  >  It seems like this should be rare.  If it is rare, we shouldn't do any
>  >  >  extra work to handle it.
>  >
>  >  I'm not sure if this is rare. I will run a test when we experiment
>  >  with a more realistic workload.
>
>  Do you mean the exact same client write request, or different client
>  write requests that happen to be for the same document?  What would be
>  the scenario that would trigger the first?

The exact same client write request. This can arise because of
log propagation. For example, a document is in the range served
by nodes A, B, C and D (W = 3, N = 4). Now let's say A and B
finish writing and logging the document. C is still processing the
document. C also syncs the log entries from A and B and sees
the same write via the log propagation. Because C hasn't finished
writing the document, it goes ahead and retrieves the document
from A and/or B. This is the wasted work I was worried about.
I'm not sure if this is rare. I think it depends on how parallel are
the writes and log syncs. It also depends on how long the document
parsing/processing takes.

Cheers,
Ning

Re: [bailey-developers] Details on handling a write request

From: Yonik S. <yo...@ap...> - 2008-05-02 20:37:06

On Fri, May 2, 2008 at 4:12 PM, Ning Li <nin...@gm...> wrote:
>  We need a W-way write to provide fault tolerance for the write.
>  Maybe we can return a flag indicating <W nodes did the write
>  and let an application decide whether it wants to redo the write?

Returning the number of servers actually written sounds like the right approach.
It does seem hard to try and cancel the write.

>  >  >  - Last but not the least, a possible performance impact:
>  >  >   a node can receive the same write request from several
>  >  >   different nodes around the same time.
>  >
>  >  It seems like this should be rare.  If it is rare, we shouldn't do any
>  >  extra work to handle it.
>
>  I'm not sure if this is rare. I will run a test when we experiment
>  with a more realistic workload.

Do you mean the exact same client write request, or different client
write requests that happen to be for the same document?  What would be
the scenario that would trigger the first?

-Yonik

Re: [bailey-developers] Details on handling a write request

From: Ning L. <nin...@gm...> - 2008-05-02 20:12:49

On Thu, May 1, 2008 at 12:39 PM, Yonik Seeley <yo...@ap...> wrote:
>  Google's distributed file system maximized link bandwidth by having a
>  chain... one node would send to another node, which would send to
>  another node, etc.  The trick was that it was streamed (a node would
>  start forwarding an update as soon as it started receiving it).
>  That's probably too complex for now, so parallel looks like the right
>  choice.

Yes, let's go with a simpler version first. But we should definitely
keep the chain design in mind.

>  >  - What happens if the coordinator node cannot "complete"
>  >   the write request because it cannot find W nodes that
>  >   handle the write request successfully?
>
>  Good question.  It seems like we should be able to operate in degraded
>  mode, so shouldn't a write succeed if at least one node got it?
>
>  Or perhaps a  W-way write is a feature, but not the default?  Seems
>  like the desired replication factor shouldn't be strongly coupled to
>  the number of nodes that we need to write to for success.

The number W of nodes that we need to "complete" a write and
the replication factor/level N are two separate parameters in the
system configuration and 1 <= W <= N (usually, 1 < W < N).
We need a W-way write to provide fault tolerance for the write.
Maybe we can return a flag indicating <W nodes did the write
and let an application decide whether it wants to redo the write?

>  >  - Last but not the least, a possible performance impact:
>  >   a node can receive the same write request from several
>  >   different nodes around the same time.
>
>  It seems like this should be rare.  If it is rare, we shouldn't do any
>  extra work to handle it.

I'm not sure if this is rare. I will run a test when we experiment
with a more realistic workload.

Cheers,
Ning

Re: [bailey-developers] Details on handling a write request

From: Yonik S. <yo...@ap...> - 2008-05-01 16:39:50

On Tue, Apr 29, 2008 at 11:21 AM, Ning Li <nin...@gm...> wrote:
> Previously, we decided that:
>
>  - A write request is sent to multiple nodes, all of whose
>   ranges must include the document.
>
>  - When a node receives a write request of a document with
>   revision R, it checks that revision R is newer than the
>   current revision stored/indexed on the node. Then the node
>   stores and indexes the new revision in memory and put a
>   log entry in the log.
>
>  - A W-way write means at least W number of nodes have to
>   successfully serve the write request for the write to be
>   considered "complete". 1 < W <= N. W > 1 is to provide
>   fault tolerance to write operations.
>
>
>  Here are the details:
>
>  - The node that receives a write request is the coordinator
>   of the write request. It makes sure the write request is
>   also handled on W-1 other nodes.
>
>  - Should the coordinator node send the write request to
>   W-1 other nodes in sequence or in parallel? Preferably
>   in parallel?

Google's distributed file system maximized link bandwidth by having a
chain... one node would send to another node, which would send to
another node, etc.  The trick was that it was streamed (a node would
start forwarding an update as soon as it started receiving it).
That's probably too complex for now, so parallel looks like the right
choice.

>  - When the write request fails on a node, should the
>   coordinator node retry that node or immediately try
>   some other node?
>
>  - What happens if the coordinator node cannot "complete"
>   the write request because it cannot find W nodes that
>   handle the write request successfully?

Good question.  It seems like we should be able to operate in degraded
mode, so shouldn't a write succeed if at least one node got it?

Or perhaps a  W-way write is a feature, but not the default?  Seems
like the desired replication factor shouldn't be strongly coupled to
the number of nodes that we need to write to for success.

> On the one hand,
>   we cannot undo the write request from the nodes that
>   succeeded because we cannot increase user's revision.
>   On the other hand, the nodes that succeeded may
>   propagate the write request to more nodes so the write
>   request may eventually be "completed". So users should
>   be aware that even if a write request is "incomplete",
>   it may or may not be "completed" eventually.
>
>  - When a node receives a write request whose revision
>   is older than that on the node, the node returns success
>   without doing anything. Should a user be notified that
>   there is a newer revision encountered?
>
>  - Last but not the least, a possible performance impact:
>   a node can receive the same write request from several
>   different nodes around the same time.

It seems like this should be rare.  If it is rare, we shouldn't do any
extra work to handle it.

-Yonik

> When it checks
>   its database, the write request is considered new, then
>   it parses and process the request, several times, and
>   in the end finds out that it's the same request. What is
>   a good strategy to minimize this situation?
>
>  What do you think?
>
>  Cheers,
>  Ning

[bailey-developers] Details on handling a write request

From: Ning L. <nin...@gm...> - 2008-04-29 15:21:31

Previously, we decided that:

- A write request is sent to multiple nodes, all of whose
  ranges must include the document.

- When a node receives a write request of a document with
  revision R, it checks that revision R is newer than the
  current revision stored/indexed on the node. Then the node
  stores and indexes the new revision in memory and put a
  log entry in the log.

- A W-way write means at least W number of nodes have to
  successfully serve the write request for the write to be
  considered "complete". 1 < W <= N. W > 1 is to provide
  fault tolerance to write operations.


Here are the details:

- The node that receives a write request is the coordinator
  of the write request. It makes sure the write request is
  also handled on W-1 other nodes.

- Should the coordinator node send the write request to
  W-1 other nodes in sequence or in parallel? Preferably
  in parallel?

- When the write request fails on a node, should the
  coordinator node retry that node or immediately try
  some other node?

- What happens if the coordinator node cannot "complete"
  the write request because it cannot find W nodes that
  handle the write request successfully? On the one hand,
  we cannot undo the write request from the nodes that
  succeeded because we cannot increase user's revision.
  On the other hand, the nodes that succeeded may
  propagate the write request to more nodes so the write
  request may eventually be "completed". So users should
  be aware that even if a write request is "incomplete",
  it may or may not be "completed" eventually.

- When a node receives a write request whose revision
  is older than that on the node, the node returns success
  without doing anything. Should a user be notified that
  there is a newer revision encountered?

- Last but not the least, a possible performance impact:
  a node can receive the same write request from several
  different nodes around the same time. When it checks
  its database, the write request is considered new, then
  it parses and process the request, several times, and
  in the end finds out that it's the same request. What is
  a good strategy to minimize this situation?

What do you think?

Cheers,
Ning

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-28 16:28:36

> 2. Write a multi-threaded test program that uses this, simulating a
> large, active index.  It randomly adds, removes and updates randomly
> created documents, and periodically queries and checks that results are
> correct.

Does the newly added TestMtHeapDb look alright?


On Wed, Apr 23, 2008 at 1:26 PM, Doug Cutting <cu...@ap...> wrote:
>  > Before I change the code to do the above, what do you think of the
>  > current implementation in terms of the master-host communication,
>  > the host-host communication, and the logic of the cleanup threads
>  > which make sure there is no data loss if no host goes down in the
>  > process?
>
>  I'll try to take a detailed look at it today.

Does it look good?


Cheers,
Ning

[bailey-developers] SF.net SVN: bailey: [29] trunk/src/test/org/apache/bailey/TestMtHeapDb.java

From: <ni...@us...> - 2008-04-28 16:25:32

Revision: 29
          http://bailey.svn.sourceforge.net/bailey/?rev=29&view=rev
Author:   ning_li
Date:     2008-04-28 09:25:33 -0700 (Mon, 28 Apr 2008)

Log Message:
-----------
"Write a multi-threaded test program that uses this, simulating a large, active index. It randomly adds, removes and updates randomly created documents, and periodically queries and checks that results are correct."

Added Paths:
-----------
    trunk/src/test/org/apache/bailey/TestMtHeapDb.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[bailey-developers] SF.net SVN: bailey: [28] trunk/src/java/org/apache/bailey/heap/HeapDatabase .java

From: <ni...@us...> - 2008-04-28 16:21:05

Revision: 28
          http://bailey.svn.sourceforge.net/bailey/?rev=28&view=rev
Author:   ning_li
Date:     2008-04-28 09:20:38 -0700 (Mon, 28 Apr 2008)

Log Message:
-----------
Make the "stipulate" method synchronized to avoid ConcurrentModificationException.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/heap/HeapDatabase.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[bailey-developers] SF.net SVN: bailey: [27] trunk/src/java/org/apache/bailey

From: <ni...@us...> - 2008-04-28 16:18:05

Revision: 27
          http://bailey.svn.sourceforge.net/bailey/?rev=27&view=rev
Author:   ning_li
Date:     2008-04-28 09:18:09 -0700 (Mon, 28 Apr 2008)

Log Message:
-----------
Make ServiceDatabase extends Database. ServiceDatabase defines more methods which are necessary to improve performance in the internal implementation.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/Database.java
    trunk/src/java/org/apache/bailey/ddb/Client.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
    trunk/src/java/org/apache/bailey/provider/ServiceDatabase.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Re: [bailey-developers] More detailed design on load balancing

From: Doug C. <cu...@ap...> - 2008-04-23 17:26:04

Ning Li wrote:
>>  Does that work?
> 
> Yep!

Great!

> Before I change the code to do the above, what do you think of the
> current implementation in terms of the master-host communication,
> the host-host communication, and the logic of the cleanup threads
> which make sure there is no data loss if no host goes down in the
> process?

I'll try to take a detailed look at it today.

> BTW, you are still enthusiastic about Bailey, right? :)

Very much so!  I'm just having a hard time pulling away from Hadoop long 
enough to give much time to Bailey.  Plus, I've recently been asked to 
start working on a third project!  Sigh.

Doug

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-23 16:00:30

On Tue, Apr 22, 2008 at 7:56 PM, Doug Cutting <cu...@ap...> wrote:
>  This is nice.  Ideally the flow wouldn't require a lot of state.  So
>  adding X' is triggered first (by turning on a new node or by splitting a
>  hot spot or somesuch).  Once X' is online, then (2), (3) and (4) are
>  triggered by the over-replication, with a lock to permit only one
>  transaction at a time.  There's no explicit plan or worklist.  Instead,
>  the state can be evaluated at each point, and the highest priority
>  transaction can be executed.  Repeat.
>
>  So the algorithm might look like:
>
>  condition:           action:
>  -----------------------------------------------------
>  under-replication    start a replication
>  over-replication     start a reduction
>  new node available   add @ hottest spot
>  hot spot             free a node
>
>  These are priority-ordered.  Only one action may be in progress at a
>  time in a given region of the ring.
>
>  Does that work?

Yep!

Before I change the code to do the above, what do you think of the
current implementation in terms of the master-host communication,
the host-host communication, and the logic of the cleanup threads
which make sure there is no data loss if no host goes down in the
process?

BTW, you are still enthusiastic about Bailey, right? :)

Cheers,
Ning

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-23 00:00:51

Ning Li wrote:
>>  I think each node can determine that on its own at startup.  It
>>
>>  1. Posts its range and log start number.
>>  2. Waits a bit, so all other nodes have had a chance to post their data.
>>  3. Decides if its index is valid, by checking all overlapping node's log
>>  start numbers & compares them with the last sync'd log number to see if
>>  they've compacted their log.
>>  4. Starts syncing with its neighbors.
> 
> Data in Zookeeper is persistent, right? So there are records in Zookeeper
> on the range, log start number and log start numbers on neighbors
> of a node. So what does it mean if, at startup, a node's posted numbers
> are newer or older than those in Zookeeper? Can the data in Zookeeper
> help during startup? Or do we discard those records in Zookeeper?

Zookeeper can store data that's persistent or that disappears when a 
node disappears.  I think the most natural representation in Zookeeper 
would be to make the live ring to be represented by ephemeral files.

Doug

Re: [bailey-developers] More detailed design on load balancing

From: Doug C. <cu...@ap...> - 2008-04-22 23:56:43

Ning Li wrote:
> Your email made me think that we can make the complexity
> manageable by breaking a state change "transaction" into
> smaller ones.
> 
> Here is what I think of a state change "transaction":
> A new node X' is added at the position X' between X and X+1.
> The affected nodes switch from:
> node X-2 serves range [X-2, X+1)
> node X-1 serves range [X-1, X+2)
> node X serves range [X, X+3)
> to:
> node X-2 serves range [X-2, X')
> node X-1 serves range [X-1, X+1)
> node X serves range [X, X+2)
> node X' serves range [X', X+3)
> And the log propagation of even more nodes are affected.
> 
> We break this "transaction" into four smaller ones:
> 1 node X' serves range [X', X+3)
>      Now range [X', X+3) is replicated 4 times - over-replicated.
> 2 node X-2 from [X-2, X+1) to [X-2, X')
>      Now only range [X+1, X+3) is still over-replicated.
> 3 node X-1 from [X-1, X+2) to [X-1, X+1)
>      Now only range [X+2, X+3) is still over-replicated
> 4 node X from [X, X+3) to [X, X+2)
>      Now no range is over-replicated.
> So in each of these smaller "transactions", only one node
> and some neighbors' log propagation are affected. Failure
> is now easier to deal with.
> 
> It's worth pointing out that the latter three "transactions"
> are scheduled to balance the level of replication.
> 
> Is this a good idea?

This is nice.  Ideally the flow wouldn't require a lot of state.  So 
adding X' is triggered first (by turning on a new node or by splitting a 
hot spot or somesuch).  Once X' is online, then (2), (3) and (4) are 
triggered by the over-replication, with a lock to permit only one 
transaction at a time.  There's no explicit plan or worklist.  Instead, 
the state can be evaluated at each point, and the highest priority 
transaction can be executed.  Repeat.

So the algorithm might look like:

condition:           action:
-----------------------------------------------------
under-replication    start a replication
over-replication     start a reduction
new node available   add @ hottest spot
hot spot             free a node

These are priority-ordered.  Only one action may be in progress at a 
time in a given region of the ring.

Does that work?

Doug

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-22 22:52:16

It's been quiet for a while. Any comments/thoughts?

Cheers,
Ning


On Wed, Apr 16, 2008 at 7:12 PM, Ning Li <nin...@gm...> wrote:
> >  I'm coding the add-node process with the overlapping replication
>  >  right now. It's taking some time. After it's finished, we can review
>  >  the code and have a complete picture of the complexity and the
>  >  performance implications.
>
>  Finally checked in the first version for adding a node, with log
>  propagation and cleanup... :)
>
>
>
>  On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote:
>  >  Yes, that's one way to do it.  Perhaps it is even the best way.  But if
>  >  C reports it is ready to start serving [2 4) before D is, then this
>  >  might be revealed to clients, to more rapidly reduce the load on A and
>  >  B.  Might this also permit a less stateful master?  My hunch is that the
>  >  more queues of pending splits and such that the master has to maintain,
>  >  the greater the chances of confusion in the face of failure.
>  >
>  >  Perhaps the master could simply issue step (1), asking C and D to start
>  >  serving [2 4), and remember only that it had issued this request so that
>  >  it doesn't issue it again too soon.  Then, when C & D tell the master
>  >  they're ready, the master can notice that [2, 4) is over-replicated, and
>  >  issue requests to A and B to stop serving this.  In this model, the
>  >  master would not wait for both C & D to report, nor would it wait for A
>  >  and B to report their narrowed ranges before republishing the ring.
>  >
>  >  In this approach, if D fails while replicating [2, 4) then the master
>  >  could still trim A to [0, 2), but not trim B yet.  Also, if
>  >  previously-offline E comes back online already serving [0, 2) or [2, 3),
>  >  or some other overlapping range from some prior merge/split state, it
>  >  can be more easily re-integrated into the ring, without discarding all
>  >  of its data.
>  >
>  >  So, even in the case of a mostly-disjoint partitioning, there may be
>  >  benefits to permitting overlap.
>
>  You have a number of good points.
>
>  I was worried about the complexity of a state change "transaction"
>  (i.e. adding/removing a node and the changes to neighbors' ranges
>  and the changes to more neighbors' log propagation that come
>  with it). As you said, "the more queues of pending splits" (or
>  pending range changes) "and such that the master has to maintain,
>
> the greater the chances of confusion in the face of failure."
>
>  Your email made me think that we can make the complexity
>  manageable by breaking a state change "transaction" into
>  smaller ones.
>
>  Here is what I think of a state change "transaction":
>  A new node X' is added at the position X' between X and X+1.
>  The affected nodes switch from:
>
> node X-2 serves range [X-2, X+1)
>  node X-1 serves range [X-1, X+2)
>  node X serves range [X, X+3)
>  to:
>  node X-2 serves range [X-2, X')
>  node X-1 serves range [X-1, X+1)
>  node X serves range [X, X+2)
>  node X' serves range [X', X+3)
>  And the log propagation of even more nodes are affected.
>
>  We break this "transaction" into four smaller ones:
>  1 node X' serves range [X', X+3)
>      Now range [X', X+3) is replicated 4 times - over-replicated.
>  2 node X-2 from [X-2, X+1) to [X-2, X')
>      Now only range [X+1, X+3) is still over-replicated.
>  3 node X-1 from [X-1, X+2) to [X-1, X+1)
>      Now only range [X+2, X+3) is still over-replicated
>  4 node X from [X, X+3) to [X, X+2)
>      Now no range is over-replicated.
>  So in each of these smaller "transactions", only one node
>  and some neighbors' log propagation are affected. Failure
>  is now easier to deal with.
>
>  It's worth pointing out that the latter three "transactions"
>  are scheduled to balance the level of replication.
>
>  Is this a good idea?
>
>  Ning
>

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-17 21:34:46

On Thu, Apr 10, 2008 at 7:42 PM, Doug Cutting <cu...@ap...> wrote:
>  So, with these two simple properties, lightweight master failover and
>  ability to run the master on a host, means that we can choose to either
>  dedicate a machine to the master, or run the master daemon on every
>  host, switching frequently.  I'd certainly like to preserve the latter
>  option, so lets keep it in mind as we code.

Sounds good. I like the design. :)


>   From A's perspective, when B comes online, B's data looks fine, since
>  B's log is complete.  But B discovers, when it talks to A, that B's data
>  is obsolete.  If A retrieves B's data before B discovers that it is
>  obsolete, then A would get stale adds.  So B must block retrieval of its
>  data until it has determined whether it is valid to all of its
>  neighbors.  At system startup this means that all nodes must wait for
>  their neighbors to come online so that they can determine whether their
>  own state is valid before permitting any synchronization.
>
>  Does that make any sense?

Yep. My point was also that a node needs to check the neighbors to
decide the validity.


>  I think each node can determine that on its own at startup.  It
>
>  1. Posts its range and log start number.
>  2. Waits a bit, so all other nodes have had a chance to post their data.
>  3. Decides if its index is valid, by checking all overlapping node's log
>  start numbers & compares them with the last sync'd log number to see if
>  they've compacted their log.
>  4. Starts syncing with its neighbors.

Data in Zookeeper is persistent, right? So there are records in Zookeeper
on the range, log start number and log start numbers on neighbors
of a node. So what does it mean if, at startup, a node's posted numbers
are newer or older than those in Zookeeper? Can the data in Zookeeper
help during startup? Or do we discard those records in Zookeeper?

>  I worry that there might be pathological cases where different nodes
>  were offline and/or compacted at different times causing all replicas of
>  a range to be discarded.

:) Hopefully this won't happen because of replication. In addition, we
only throw away/re-build a node when we know we can re-build its range
from some other nodes, right?

Ning

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-16 23:12:15

>  I'm coding the add-node process with the overlapping replication
>  right now. It's taking some time. After it's finished, we can review
>  the code and have a complete picture of the complexity and the
>  performance implications.

Finally checked in the first version for adding a node, with log
propagation and cleanup... :)

On Thu, Apr 10, 2008 at 6:38 PM, Doug Cutting <cu...@ap...> wrote:
>  Yes, that's one way to do it.  Perhaps it is even the best way.  But if
>  C reports it is ready to start serving [2 4) before D is, then this
>  might be revealed to clients, to more rapidly reduce the load on A and
>  B.  Might this also permit a less stateful master?  My hunch is that the
>  more queues of pending splits and such that the master has to maintain,
>  the greater the chances of confusion in the face of failure.
>
>  Perhaps the master could simply issue step (1), asking C and D to start
>  serving [2 4), and remember only that it had issued this request so that
>  it doesn't issue it again too soon.  Then, when C & D tell the master
>  they're ready, the master can notice that [2, 4) is over-replicated, and
>  issue requests to A and B to stop serving this.  In this model, the
>  master would not wait for both C & D to report, nor would it wait for A
>  and B to report their narrowed ranges before republishing the ring.
>
>  In this approach, if D fails while replicating [2, 4) then the master
>  could still trim A to [0, 2), but not trim B yet.  Also, if
>  previously-offline E comes back online already serving [0, 2) or [2, 3),
>  or some other overlapping range from some prior merge/split state, it
>  can be more easily re-integrated into the ring, without discarding all
>  of its data.
>
>  So, even in the case of a mostly-disjoint partitioning, there may be
>  benefits to permitting overlap.

You have a number of good points.

I was worried about the complexity of a state change "transaction"
(i.e. adding/removing a node and the changes to neighbors' ranges
and the changes to more neighbors' log propagation that come
with it). As you said, "the more queues of pending splits" (or
pending range changes) "and such that the master has to maintain,
the greater the chances of confusion in the face of failure."

Your email made me think that we can make the complexity
manageable by breaking a state change "transaction" into
smaller ones.

Here is what I think of a state change "transaction":
A new node X' is added at the position X' between X and X+1.
The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
And the log propagation of even more nodes are affected.

We break this "transaction" into four smaller ones:
1 node X' serves range [X', X+3)
     Now range [X', X+3) is replicated 4 times - over-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X')
     Now only range [X+1, X+3) is still over-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+1)
     Now only range [X+2, X+3) is still over-replicated
4 node X from [X, X+3) to [X, X+2)
     Now no range is over-replicated.
So in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected. Failure
is now easier to deal with.

It's worth pointing out that the latter three "transactions"
are scheduled to balance the level of replication.

Is this a good idea?

Ning

PS: Here is what happens when a node is removed:
Node X+1 is removed. The affected nodes switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
node X+1 serves range [X+1, X+4)
to:
node X-2 serves range [X-2, X+2)
node X-1 serves range [X-1, X+3)
node X serves range [X, X+4)
And the log propagation of even more nodes are affected.

Again, we break this "transaction" into four smaller ones:
1 node X+1 is removed
     Depending on whether node X+1 failed, we may do nothing.
     Now range [X+1, X+4) is replicated 2 times - under-replicated.
2 node X-2 from [X-2, X+1) to [X-2, X+2)
     Now only range [X+2, X+4) is still under-replicated.
3 node X-1 from [X-1, X+2) to [X-1, X+3)
     Now only range [X+3, X+4) is still under-replicated
4 node X from [X, X+3) to [X, X+4)
     Now no range is under-replicated.
Again, in each of these smaller "transactions", only one node
and some neighbors' log propagation are affected.

[bailey-developers] SF.net SVN: bailey: [26] trunk/src

From: <ni...@us...> - 2008-04-16 19:33:43

Revision: 26
          http://bailey.svn.sourceforge.net/bailey/?rev=26&view=rev
Author:   ning_li
Date:     2008-04-16 12:33:49 -0700 (Wed, 16 Apr 2008)

Log Message:
-----------
This is the first version supporting adding a node with log propagation and cleanup.

1 As first version, only one node can be added at a time for simplicity. That is, a node is added, the mapper is updated and all hosts update their mapper before another node can be added. Cleanup ensures there is no loss of data if there is no host failure.

2 Though greatly simplified, it demonstrates the flow of events.

3 We may break a state change "transaction" (e.g. adding/removing a node and replication changes that come with it) into smaller "transaction". But the overall flow of events is similar. So hopefully this version is still useful.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/Range.java
    trunk/src/java/org/apache/bailey/ddb/Host.java
    trunk/src/java/org/apache/bailey/ddb/HostCommand.java
    trunk/src/java/org/apache/bailey/ddb/HostStatus.java
    trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
    trunk/src/java/org/apache/bailey/ddb/HostToMasterProtocol.java
    trunk/src/java/org/apache/bailey/ddb/Log.java
    trunk/src/java/org/apache/bailey/ddb/Mapper.java
    trunk/src/java/org/apache/bailey/ddb/Master.java
    trunk/src/java/org/apache/bailey/ddb/NodeHostMap.java
    trunk/src/java/org/apache/bailey/ddb/NodeStatus.java
    trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
    trunk/src/java/org/apache/bailey/ddb/Ring.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleMaster.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleLog.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleMaster.java
    trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
    trunk/src/java/org/apache/bailey/provider/ServiceDatabase.java
    trunk/src/java/org/apache/bailey/util/SetValuedMap.java
    trunk/src/test/org/apache/bailey/TestSimpleDb.java
    trunk/src/test/org/apache/bailey/TestSimpleDbWithLog.java

Added Paths:
-----------
    trunk/src/java/org/apache/bailey/ddb/NodeState.java
    trunk/src/java/org/apache/bailey/ddb/RangeMismatchException.java
    trunk/src/java/org/apache/bailey/ddb/simple/HostInstanceRegistry.java
    trunk/src/java/org/apache/bailey/util/RefCounted.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-10 23:42:15

Ning Li wrote:
> Yes, this would work. We simply put the code for master into host
> and make sure only one host is executing the master code at a time.
> 
> Interesting. :) But let's use a master for now since it's easier to
> debug this way?

Keeping all the state in Zookeeper may also make it easy to debug, since 
it can be read from any client.

I actually don't think it should be hard to have it both ways.

We should write the master as a fail-over loop in any case, so that 
multiple master-daemon processes can be running on a cluster at once, 
but only one acting as master.  Switching masters frequently should be 
as simple as adding a counter in the top-level daemon loop to decide 
when to give up the lock.

The master should be a different class from the host, and should have 
it's own main() method, so that it can be run independently, but it 
should also be possible to run the master in the same JVM as a host. 
(We'll want this for testing anyway.)

So, with these two simple properties, lightweight master failover and 
ability to run the master on a host, means that we can choose to either 
dedicate a machine to the master, or run the master daemon on every 
host, switching frequently.  I'd certainly like to preserve the latter 
option, so lets keep it in mind as we code.

>>  How does this work at system startup?  How does B know not to go online,
>>  providing its stale adds, yet A is permitted to provide its data?
>>  Perhaps, at startup a node can respond to requests for its first log
>>  entry number, but refuse requests for the logs themselves.  Then B can
>>  see that it must reload, and refuse requests from A to sync until reload
>>  is complete.  Does that work?
> 
> What do you mean by "a node can respond to requests for its first log
> entry number, but refuse requests for the logs themselves"?

 From A's perspective, when B comes online, B's data looks fine, since 
B's log is complete.  But B discovers, when it talks to A, that B's data 
is obsolete.  If A retrieves B's data before B discovers that it is 
obsolete, then A would get stale adds.  So B must block retrieval of its 
data until it has determined whether it is valid to all of its 
neighbors.  At system startup this means that all nodes must wait for 
their neighbors to come online so that they can determine whether their 
own state is valid before permitting any synchronization.

Does that make any sense?

> Maybe at system startup, master (or a host) analyzes, for each node,
> its first log entry number and the log entry numbers it has synced with
> its overlapping nodes. Then it is derived from this analysis which nodes
> need complete reloads? Is this what you meant?

I think each node can determine that on its own at startup.  It

1. Posts its range and log start number.
2. Waits a bit, so all other nodes have had a chance to post their data.
3. Decides if its index is valid, by checking all overlapping node's log 
start numbers & compares them with the last sync'd log number to see if 
they've compacted their log.
4. Starts syncing with its neighbors.

I worry that there might be pathological cases where different nodes 
were offline and/or compacted at different times causing all replicas of 
a range to be discarded.

Does this make any sense?

Doug

Re: [bailey-developers] More detailed design on load balancing

From: Doug C. <cu...@ap...> - 2008-04-10 22:38:13

Ning Li wrote:
> Let's say the replication is 2 and node A and node B are the two replicas
> of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4).
> After the split, node A and node B will be the replicas of [0, 2) and
> node C and node D will be created to be the replicas of [2, 4).
> 
> The split process works as follows:
> 1 Master tells some hosts to create node C and node D of [2, 4)
>    from node A or node B
> 2 C and D tell the master they are ready
> 3 Master changes the map to A and B serving [0, 2) and C and D
>    serving [2, 4)
> 4 Master tells A and B to stop serving [2, 4) and only serve [0, 2)
> 5 A and B tell the master they are done
> 
> Care should be taken in synchronizing logs in the process.
> 
> Here, it doesn't matter if at one point in time, node X serves R1,
> node Y serves R2, and R1 and R2 overlap. What does matter is
> that there is no overlap in the map which the master maintains.
> And because clients get the map from the master, clients do
> not see any overlap in the map, either.

Yes, that's one way to do it.  Perhaps it is even the best way.  But if 
C reports it is ready to start serving [2 4) before D is, then this 
might be revealed to clients, to more rapidly reduce the load on A and 
B.  Might this also permit a less stateful master?  My hunch is that the 
more queues of pending splits and such that the master has to maintain, 
the greater the chances of confusion in the face of failure.

Perhaps the master could simply issue step (1), asking C and D to start 
serving [2 4), and remember only that it had issued this request so that 
it doesn't issue it again too soon.  Then, when C & D tell the master 
they're ready, the master can notice that [2, 4) is over-replicated, and 
issue requests to A and B to stop serving this.  In this model, the 
master would not wait for both C & D to report, nor would it wait for A 
and B to report their narrowed ranges before republishing the ring.

In this approach, if D fails while replicating [2, 4) then the master 
could still trim A to [0, 2), but not trim B yet.  Also, if 
previously-offline E comes back online already serving [0, 2) or [2, 3), 
or some other overlapping range from some prior merge/split state, it 
can be more easily re-integrated into the ring, without discarding all 
of its data.

So, even in the case of a mostly-disjoint partitioning, there may be 
benefits to permitting overlap.

> I'm coding the add-node process with the overlapping replication
> right now. It's taking some time. After it's finished, we can review
> the code and have a complete picture of the complexity and the
> performance implications.

Great!

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-10 17:34:42

On Tue, Apr 8, 2008 at 4:29 PM, Doug Cutting <cu...@ap...> wrote:
> Ning Li wrote:
>  > I thought only the master writes to Zookeeper. Hosts read from Zookeeper.
>  > We should still have a light-weight master, no? Which keeps track of
>  > the heartbeats from the hosts, detects host failures and responds
>  > accordingly, and decides state changes (add/remove nodes) when
>  > appropriate...
>
>  As a thought experiment, I'm trying to see whether, with Zookeeper, we
>  actually need such a master.
>
>  We could replace heartbeats with an ephemeral file per node containing
>  its status.  (Ephemeral files disappear if their owner goes offline.)
>
>  Any host could (a) grab a lock; (b) analyze the ring for potential
>  add/removes; (c) post these requests to a directory.  In effect, getting
>  the lock is master election, and while a node is doing this analysis, it
>  is the master.  But the master moves around: each host has a "master"
>  thread, and remains "master" only during one analysis/action cycle.
>  This analysis cycle should not be very compute or i/o intensive.
>
>  The advantage of this is that we wouldn't have to explicitly test or
>  otherwise engineer master failover, since it would be happening all the
>  time.  All global state would be redundantly stored in Zookeeper.
>
>  Could this work?

Yes, this would work. We simply put the code for master into host
and make sure only one host is executing the master code at a time.

Interesting. :) But let's use a master for now since it's easier to
debug this way?


>  > In this example, say B synced with A to A's entry 11 and
>  > with C to C's entry 12 (which includes adding doc X) before
>  > it went offline. Because A expunged its log, now A's log
>  > entry number starts from 21. B comes back online and
>  > finds A's 21 is newer than last time B synced with it (11).
>  > So B cannot recover but has to sync from scratch.
>
>  My concern was that A might try to sync from B and get stale data.  I
>  think you're arguing that B should not go online, accepting sync
>  requests, until it has itself sync'd with its neighbors.  In this case,
>  incremental sync would fail and B would sync from scratch, removing any
>  stale adds.
>
>  How does this work at system startup?  How does B know not to go online,
>  providing its stale adds, yet A is permitted to provide its data?
>  Perhaps, at startup a node can respond to requests for its first log
>  entry number, but refuse requests for the logs themselves.  Then B can
>  see that it must reload, and refuse requests from A to sync until reload
>  is complete.  Does that work?

What do you mean by "a node can respond to requests for its first log
entry number, but refuse requests for the logs themselves"?

Maybe at system startup, master (or a host) analyzes, for each node,
its first log entry number and the log entry numbers it has synced with
its overlapping nodes. Then it is derived from this analysis which nodes
need complete reloads? Is this what you meant?


Ning

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-10 17:11:17

On Tue, Apr 8, 2008 at 5:39 PM, Doug Cutting <cu...@ap...> wrote:
>  > Halfen a hot range? More some more sophisticated algorithm.
>  > We can still have nodes evenly around the ring at the start.
>  > The partitioning algorithm doesn't change. Only the replication
>  > algorithm is different.
>
>  But we'll need to dynamically balance things, since load may be uneven
>  around the ring, especially with user-provided ring positions.  And once
>  we permit splitting, don't we have to handle overlapping ranges?

With non-overlapping replication, if we decide to split a range, we split
all the replicas of the range. Same as with overlapping replication, we
use the master to coordinate the process.

>  > What do you mean by "without permitting it"?
>
>  Just that, if we split a node in a disjoint partitioning then we'll have
>  overlapping ranges during the split.  Switching all clients to search
>  the new half-sized nodes won't be atomic.  So I'm not sure that disjoint
>  ranges really simplifies things fundamentally.

Let's say the replication is 2 and node A and node B are the two replicas
of range [0, 4). Now we decide to split [0, 4) into two: [0, 2) and [2, 4).
After the split, node A and node B will be the replicas of [0, 2) and
node C and node D will be created to be the replicas of [2, 4).

The split process works as follows:
1 Master tells some hosts to create node C and node D of [2, 4)
   from node A or node B
2 C and D tell the master they are ready
3 Master changes the map to A and B serving [0, 2) and C and D
   serving [2, 4)
4 Master tells A and B to stop serving [2, 4) and only serve [0, 2)
5 A and B tell the master they are done

Care should be taken in synchronizing logs in the process.

Here, it doesn't matter if at one point in time, node X serves R1,
node Y serves R2, and R1 and R2 overlap. What does matter is
that there is no overlap in the map which the master maintains.
And because clients get the map from the master, clients do
not see any overlap in the map, either.

What do you think?

>  > I think this is a fair comparison so the search performance should
>  > be comparable. What do you think?
>
>  Good point.
>
>  One other difference is that, with overlapping ranges, when a node
>  fails, its load is spread more evenly over the remaining nodes.  In this
>  example, the load of any single node would become shared by two other
>  nodes rather than one, so each node only needs 50% headroom rather than
>  100%.

Yes, that's the advantage of the overlapping replication scheme.

I'm coding the add-node process with the overlapping replication
right now. It's taking some time. After it's finished, we can review
the code and have a complete picture of the complexity and the
performance implications.

Ning

[bailey-developers] SF.net SVN: bailey: [25] trunk/src

From: <Yo...@us...> - 2008-04-10 03:40:46

Revision: 25
          http://bailey.svn.sourceforge.net/bailey/?rev=25&view=rev
Author:   Yonik
Date:     2008-04-09 20:40:49 -0700 (Wed, 09 Apr 2008)

Log Message:
-----------
Replace String.hashCode() with LOOKUP3 (fast and well studied) for getDefaultPosition. String.hashCode() isn't suitable if you are using all the bits, since most of the changes happen on the right.  Also made a version of LOOKUP3 that is well defined for implementations in other languages.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/Document.java

Added Paths:
-----------
    trunk/src/java/org/apache/bailey/util/Hash.java
    trunk/src/test/org/apache/bailey/util/
    trunk/src/test/org/apache/bailey/util/TestHash.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[bailey-developers] SF.net SVN: bailey: [24] trunk/src/java/overview.html

From: <yo...@us...> - 2008-04-10 03:26:36

Revision: 24
          http://bailey.svn.sourceforge.net/bailey/?rev=24&view=rev
Author:   yonik
Date:     2008-04-09 20:26:41 -0700 (Wed, 09 Apr 2008)

Log Message:
-----------
one more time... test commit with new email

Modified Paths:
--------------
    trunk/src/java/overview.html


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

1 2 3 .. 7 > >> (Page 1 of 7)