bailey-developers Mailing List for Bailey (Page 2)

Status: Planning

Brought to you by: cutting, ning_li, phunt, yonik

bailey-developers

You can subscribe to this list here.

2008	_Jan	_Feb (60)	_Mar (65)	_Apr (44)	_May (6)	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 2 3 4 .. 7 > >> (Page 2 of 7)

[bailey-developers] SF.net SVN: bailey: [23] trunk/src/java/overview.html

From: <yo...@us...> - 2008-04-10 03:23:54

Revision: 23
          http://bailey.svn.sourceforge.net/bailey/?rev=23&view=rev
Author:   yonik
Date:     2008-04-09 20:23:56 -0700 (Wed, 09 Apr 2008)

Log Message:
-----------
test commit w/ new email

Modified Paths:
--------------
    trunk/src/java/overview.html


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[bailey-developers] SF.net SVN: bailey: [22] trunk/src/java/overview.html

From: <yo...@us...> - 2008-04-10 03:18:12

Revision: 22
          http://bailey.svn.sourceforge.net/bailey/?rev=22&view=rev
Author:   yonik
Date:     2008-04-09 20:18:17 -0700 (Wed, 09 Apr 2008)

Log Message:
-----------
test commit

Modified Paths:
--------------
    trunk/src/java/overview.html


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-09 20:57:41

Yonik Seeley wrote:
> Everything seemed easy enough until you brought up network partitioning.
> In that event, could each partition elect a master?  We would then
> have to handle merging of multiple different rings when the network
> was repaired, right?

Zookeeper is supposed to handle hard stuff like this.  Multiple masters 
should be impossible.  But clients and nodes will have stale rings at 
times, but they should be able to continue w/o compromising things.

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Yonik S. <yo...@ap...> - 2008-04-09 17:45:02

Everything seemed easy enough until you brought up network partitioning.
In that event, could each partition elect a master?  We would then
have to handle merging of multiple different rings when the network
was repaired, right?

-Yonik

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Yonik S. <yo...@ap...> - 2008-04-09 17:11:42

I'm trying to catch up with all the activity... please ignore if this
has been addressed in later emails.

On Wed, Apr 2, 2008 at 1:36 PM, Doug Cutting <cu...@ap...> wrote:
>  Second, I don't yet see a way around checking versions when documents
>  are added or deleted.  The ugliest bit is that we have to keep track of
>  the version of every document that's ever been deleted, in case a
>  long-offline node comes online and reports a stale addition.  That table
>  could grow without bound.  Sigh.  Do you see a way around this?
>
>  Perhaps a node could discard old deletions after a time, keeping track
>  of the log entry number of the oldest retained deletion.  Attempts to
>  sync starting with an older entry number should be rejected and should
>  trigger a complete copy-based replacement of the stale index.

Right, this seems like the right approach.  We shouldn't have to
support incremental syncs forever.

> The
>  hazard is that, if a document is added to a single node, then that node
>  goes offline for a long time, then, when it comes online, the addition
>  will be lost.  Not great.

If a document is added to a single node, it could be lost permanently
anyway (node may never come back).  That's why an add should go to
multiple nodes.

-Yonik

Re: [bailey-developers] More detailed design on load balancing

From: Doug C. <cu...@ap...> - 2008-04-08 21:39:08

Ning Li wrote:
> Actually, the mechanism I want copy and ranged copy to use is
> not checkpoint copying. This is because ranged copy may only
> want a small sub-range of the checkpoint (in removing a node).
> So, what if a RangedDatabase supports getDocs(Range range)
> where full documents in the range are retrieved. To be conservative,
> the starting entry numbers are set to the ones at the beginning
> of the retrieval. Of course we need re-parsing the documents.
> But we do that in log propagation as well. So, which approach
> do you think is better?

I agree.  We shouldn't copy checkpoints, at least not initially, but 
rather copy documents & reindex them.

>>  Another downside is that load balancing is trickier.  With disjoint
>>  ranges, how would we split a hot range?  BTW, I don't think we'll
>>  actually randomly assign nodes, but rather try to initially add them
>>  evenly around the ring, then use load balancing to split hot spots.
> 
> Halfen a hot range? More some more sophisticated algorithm.
> We can still have nodes evenly around the ring at the start.
> The partitioning algorithm doesn't change. Only the replication
> algorithm is different.

But we'll need to dynamically balance things, since load may be uneven 
around the ring, especially with user-provided ring positions.  And once 
we permit splitting, don't we have to handle overlapping ranges?

>>  So the reasons for overlap are: (1) we can't split ranges without
>>  permitting it, and (2) it makes search faster.  Do you agree?
> 
> What do you mean by "without permitting it"?

Just that, if we split a node in a disjoint partitioning then we'll have 
overlapping ranges during the split.  Switching all clients to search 
the new half-sized nodes won't be atomic.  So I'm not sure that disjoint 
ranges really simplifies things fundamentally.

> As to 2), I'm not sure overlap makes search faster.
> Assume the scope of the positions is [0, 400) and the replication
> level is 2. Let's say we have 4 nodes with overlap:
>   - node 0 serves [0, 200)
>   - node 1 serves [100, 300)
>   - node 2 serves [200, 400)
>   - node 3 serves [300, 100)
> Without overlap, it's only fair that nodes serve ranges of the same
> size as with overlap, so we only have 2 nodes:
>   - node 0 serves [0, 200)
>   - node 1 serves [200, 400)
> Each node is replicated twice. So it's 4 nodes in total.
> 
> I think this is a fair comparison so the search performance should
> be comparable. What do you think?

Good point.

One other difference is that, with overlapping ranges, when a node 
fails, its load is spread more evenly over the remaining nodes.  In this 
example, the load of any single node would become shared by two other 
nodes rather than one, so each node only needs 50% headroom rather than 
100%.

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-08 20:29:32

Ning Li wrote:
> I thought only the master writes to Zookeeper. Hosts read from Zookeeper.
> We should still have a light-weight master, no? Which keeps track of
> the heartbeats from the hosts, detects host failures and responds
> accordingly, and decides state changes (add/remove nodes) when
> appropriate...

As a thought experiment, I'm trying to see whether, with Zookeeper, we 
actually need such a master.

We could replace heartbeats with an ephemeral file per node containing 
its status.  (Ephemeral files disappear if their owner goes offline.)

Any host could (a) grab a lock; (b) analyze the ring for potential 
add/removes; (c) post these requests to a directory.  In effect, getting 
the lock is master election, and while a node is doing this analysis, it 
is the master.  But the master moves around: each host has a "master" 
thread, and remains "master" only during one analysis/action cycle. 
This analysis cycle should not be very compute or i/o intensive.

The advantage of this is that we wouldn't have to explicitly test or 
otherwise engineer master failover, since it would be happening all the 
time.  All global state would be redundantly stored in Zookeeper.

Could this work?

> In this example, say B synced with A to A's entry 11 and
> with C to C's entry 12 (which includes adding doc X) before
> it went offline. Because A expunged its log, now A's log
> entry number starts from 21. B comes back online and
> finds A's 21 is newer than last time B synced with it (11).
> So B cannot recover but has to sync from scratch.

My concern was that A might try to sync from B and get stale data.  I 
think you're arguing that B should not go online, accepting sync 
requests, until it has itself sync'd with its neighbors.  In this case, 
incremental sync would fail and B would sync from scratch, removing any 
stale adds.

How does this work at system startup?  How does B know not to go online, 
providing its stale adds, yet A is permitted to provide its data? 
Perhaps, at startup a node can respond to requests for its first log 
entry number, but refuse requests for the logs themselves.  Then B can 
see that it must reload, and refuse requests from A to sync until reload 
is complete.  Does that work?

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-07 17:55:26

On Fri, Apr 4, 2008 at 7:26 PM, Doug Cutting <cu...@ap...> wrote:
>  I've been going back-and-forth in my head to try to decide how we want
>  to use Zookeeper and whether we should store the current ring there.
>  This is certainly an argument for that.  It makes for a more stateful
>  master than I'd hoped, but that might not be bad.
>
>  An extreme idea would be to make Zookeeper *be* the master.  There would
>  be no XxxToMaster protocols: instead everything would be coordinated
>  through Zookeeper files.  Letting nodes directly access Zookeeper makes
>  the master less of a bottleneck, since Zookeeper itself replicates data.

I thought only the master writes to Zookeeper. Hosts read from Zookeeper.
We should still have a light-weight master, no? Which keeps track of
the heartbeats from the hosts, detects host failures and responds
accordingly, and decides state changes (add/remove nodes) when
appropriate...

>  I don't see it yet.  Log entry numbers are per-node, and can't be
>  compared across nodes, right?
>
>  Node A syncs from B log entries 1-10.
>  Document X is added to C.
>  A and B both sync X from C.
>  B goes offline.
>  X is deleted from C.
>  A syncs X's deletion from C.
>  A expunges its logs.
>  B comes back online.
>  A tries to sync events after 10 from B.
>
>  How does A know to ignore the addition of X?

Log entry numbers are per-node. Each node remembers the
log entry numbers it has synced with its overlapping nodes.

In this example, say B synced with A to A's entry 11 and
with C to C's entry 12 (which includes adding doc X) before
it went offline. Because A expunged its log, now A's log
entry number starts from 21. B comes back online and
finds A's 21 is newer than last time B synced with it (11).
So B cannot recover but has to sync from scratch.

Does it make sense?

Ning

[bailey-developers] SF.net SVN: bailey: [21] trunk/src

From: <ni...@us...> - 2008-04-07 17:35:40

Revision: 21
          http://bailey.svn.sourceforge.net/bailey/?rev=21&view=rev
Author:   ning_li
Date:     2008-04-07 10:35:38 -0700 (Mon, 07 Apr 2008)

Log Message:
-----------
Create the "provider" package and the "ServiceDatabase" class in it. This is the interface that a service provider for a single-node database should implement. The commit also includes various changes related to it.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
    trunk/src/java/org/apache/bailey/ddb/Host.java
    trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
    trunk/src/java/org/apache/bailey/ddb/LogEntry.java
    trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
    trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
    trunk/src/test/org/apache/bailey/TestHeapDb.java

Added Paths:
-----------
    trunk/src/java/org/apache/bailey/ddb/WriteAction.java
    trunk/src/java/org/apache/bailey/provider/
    trunk/src/java/org/apache/bailey/provider/ServiceDatabase.java

Removed Paths:
-------------
    trunk/src/java/org/apache/bailey/ddb/IndexAction.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-04 23:26:18

Ning Li wrote:
> We were thinking storing the "metadata" in Zookeeper, right?
> All the nodes (ever) created, their ranges, their current log
> entry number, the starting entry numbers of their overlapping
> nodes, the current ring... So when a master is restarted,
> it starts from exactly where it was before it went down by
> retrieving the "metadata" from Zookeeper.

I've been going back-and-forth in my head to try to decide how we want 
to use Zookeeper and whether we should store the current ring there. 
This is certainly an argument for that.  It makes for a more stateful 
master than I'd hoped, but that might not be bad.

An extreme idea would be to make Zookeeper *be* the master.  There would 
be no XxxToMaster protocols: instead everything would be coordinated 
through Zookeeper files.  Letting nodes directly access Zookeeper makes 
the master less of a bottleneck, since Zookeeper itself replicates data.

> When an offline node comes back up, we verify with Zookeeper
> that it is a valid node in a valid state. Then we check if we
> can patch up its data by checking whether the overlapping
> nodes and their logs still have the entries back to the
> starting entry numbers of the node. If not, we sync from
> scratch.

I don't see it yet.  Log entry numbers are per-node, and can't be 
compared across nodes, right?

Node A syncs from B log entries 1-10.
Document X is added to C.
A and B both sync X from C.
B goes offline.
X is deleted from C.
A syncs X's deletion from C.
A expunges its logs.
B comes back online.
A tries to sync events after 10 from B.

How does A know to ignore the addition of X?

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-04 22:18:40

On Fri, Apr 4, 2008 at 12:22 PM, Doug Cutting <cu...@ap...> wrote:
>  It would be nice to be able to eliminate long-offline nodes, but I don't
>  yet see how.
>
>  At startup we want nodes to announce their content to the master.  Not
>  all nodes will start at exactly the same time.  (Note also that, if the
>  master fails then nodes will also re-elect a new master and post their
>  state there.  Search and indexing should continue uninterrupted through
>  master moves.)  So, when a master first starts it needs to avoid
>  modifying the ring for a time until it assumes that all nodes are up.
>  We might even have nodes randomly delay their first report, so that the
>  master isn't overwhelmed.
>
>  If the network is partitioned then the master would allocate new nodes
>  to underserviced regions.  When the network is repaired, we have the
>  choice of ignoring the data on the nodes that were replaced, or
>  synchronizing it with what has transpired in their absence.  In the case
>  where all replicas of a region were offline, then we would want to use
>  their data when they come online (like the system restart case), but
>  when only a single replica was offline we might simply ignore its data
>  and let it sync from scratch.  However it may not be easy to distinguish
>  these cases.  If all replicas go offline, then we add new nodes to the
>  region, we'd need to remember that, at some point in the past, all nodes
>  in that region were offline.  If the master was restarted during this
>  time, it will be even harder to keep track of this.

We were thinking storing the "metadata" in Zookeeper, right?
All the nodes (ever) created, their ranges, their current log
entry number, the starting entry numbers of their overlapping
nodes, the current ring... So when a master is restarted,
it starts from exactly where it was before it went down by
retrieving the "metadata" from Zookeeper.

When an offline node comes back up, we verify with Zookeeper
that it is a valid node in a valid state. Then we check if we
can patch up its data by checking whether the overlapping
nodes and their logs still have the entries back to the
starting entry numbers of the node. If not, we sync from
scratch.

What do you think?

Ning

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-04 19:04:01

On Fri, Apr 4, 2008 at 12:22 PM, Doug Cutting <cu...@ap...> wrote:
> Ning Li wrote:
>  > Database as the application interface and RangedDatabase as the
>  > service-provider interface sounds good. The methods in RangedDatabase
>  > will be very similar to those in the current RangedDatabase?
>
>  Yes.  HeapDatabase should extend this now though.

Yes.
I'm thinking we should put RangedDatabase in a separate package?
Is "provider" a good name for the package? Hmm, it should be
different from the current RangedDatabase in that it should not
include the node status and the log?

I need to think about the other topic of long-offline node...

Ning

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-04 16:22:04

Ning Li wrote:
> Database as the application interface and RangedDatabase as the
> service-provider interface sounds good. The methods in RangedDatabase
> will be very similar to those in the current RangedDatabase?

Yes.  HeapDatabase should extend this now though.

> I've been thinking about this. Finally, I think this is a possibility:
> 1 The database records and logs a deleted document and its version.
> 2 If all the replicas have recorded and logged the deleted document
>    with the same version number, the document can be removed from
>    the database. This is because any new versions of the document
>    come after will have a larger version number.
> 
> Does this sound right? Things are more complicated when we
> consider state changes...

It sounds right except for the case of a long-offline node coming back 
online.  More on that below...

> Do we allow a node to go offline for a long time? I thought we'd consider
> the node goes down and pick a replacement for it.

It would be nice to be able to eliminate long-offline nodes, but I don't 
yet see how.

At startup we want nodes to announce their content to the master.  Not 
all nodes will start at exactly the same time.  (Note also that, if the 
master fails then nodes will also re-elect a new master and post their 
state there.  Search and indexing should continue uninterrupted through 
master moves.)  So, when a master first starts it needs to avoid 
modifying the ring for a time until it assumes that all nodes are up. 
We might even have nodes randomly delay their first report, so that the 
master isn't overwhelmed.

If the network is partitioned then the master would allocate new nodes 
to underserviced regions.  When the network is repaired, we have the 
choice of ignoring the data on the nodes that were replaced, or 
synchronizing it with what has transpired in their absence.  In the case 
where all replicas of a region were offline, then we would want to use 
their data when they come online (like the system restart case), but 
when only a single replica was offline we might simply ignore its data 
and let it sync from scratch.  However it may not be easy to distinguish 
these cases.  If all replicas go offline, then we add new nodes to the 
region, we'd need to remember that, at some point in the past, all nodes 
in that region were offline.  If the master was restarted during this 
time, it will be even harder to keep track of this.

I'm still hopeful that we can come up a heuristic for this, but I need 
to think more about what it should be.

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-03 20:40:07

On Wed, Apr 2, 2008 at 1:36 PM, Doug Cutting <cu...@ap...> wrote:
>  First, I think we need to add an abstrct Database service-provider
>  interface, called perhaps RangeDatabase, that's different from Database,
>  adding methods that will be critical to good performance that must be
>  implemented by, e.g., HeapDatabase and LuceneDatabase.

Database as the application interface and RangedDatabase as the
service-provider interface sounds good. The methods in RangedDatabase
will be very similar to those in the current RangedDatabase?

>  Second, I don't yet see a way around checking versions when documents
>  are added or deleted.  The ugliest bit is that we have to keep track of
>  the version of every document that's ever been deleted, in case a
>  long-offline node comes online and reports a stale addition.  That table
>  could grow without bound.  Sigh.  Do you see a way around this?

I've been thinking about this. Finally, I think this is a possibility:
1 The database records and logs a deleted document and its version.
2 If all the replicas have recorded and logged the deleted document
   with the same version number, the document can be removed from
   the database. This is because any new versions of the document
   come after will have a larger version number.

Does this sound right? Things are more complicated when we
consider state changes...

>  Perhaps a node could discard old deletions after a time, keeping track
>  of the log entry number of the oldest retained deletion.  Attempts to
>  sync starting with an older entry number should be rejected and should
>  trigger a complete copy-based replacement of the stale index.  The
>  hazard is that, if a document is added to a single node, then that node
>  goes offline for a long time, then, when it comes online, the addition
>  will be lost.  Not great.

Do we allow a node to go offline for a long time? I thought we'd consider
the node goes down and pick a replacement for it.

Ning

[bailey-developers] SF.net SVN: bailey: [20] trunk/src

From: <ni...@us...> - 2008-04-03 16:35:32

Revision: 20
          http://bailey.svn.sourceforge.net/bailey/?rev=20&view=rev
Author:   ning_li
Date:     2008-04-03 09:35:33 -0700 (Thu, 03 Apr 2008)

Log Message:
-----------
Add "position" to Document. Add getDoc(id, position) and search(range, query, maxHits) to Database. With all the changes because of the two additions.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/Database.java
    trunk/src/java/org/apache/bailey/Document.java
    trunk/src/java/org/apache/bailey/ddb/Client.java
    trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
    trunk/src/java/org/apache/bailey/ddb/Host.java
    trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
    trunk/src/java/org/apache/bailey/ddb/Log.java
    trunk/src/java/org/apache/bailey/ddb/LogEntry.java
    trunk/src/java/org/apache/bailey/ddb/Mapper.java
    trunk/src/java/org/apache/bailey/ddb/NodeInfo.java
    trunk/src/java/org/apache/bailey/ddb/RangeResults.java
    trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
    trunk/src/java/org/apache/bailey/ddb/Ring.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
    trunk/src/java/org/apache/bailey/ddb/withlog/SimpleLog.java
    trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
    trunk/src/test/org/apache/bailey/TestHeapDb.java
    trunk/src/test/org/apache/bailey/TestSimpleDb.java
    trunk/src/test/org/apache/bailey/TestSimpleDbWithLog.java
    trunk/src/test/org/apache/bailey/ddb/TestRange.java
    trunk/src/test/org/apache/bailey/ddb/TestRing.java

Added Paths:
-----------
    trunk/src/java/org/apache/bailey/Range.java

Removed Paths:
-------------
    trunk/src/java/org/apache/bailey/ddb/Range.java
    trunk/src/java/org/apache/bailey/util/HashUtil.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-02 23:35:53

On Wed, Apr 2, 2008 at 1:55 PM, Doug Cutting <cu...@ap...> wrote:
>  This looks great!  We can implement node copying with this too, right,
>  just by setting X'=X?  And 2.2 could be done as a database checkpoint
>  copy & filter.  In the case where X'=X, this would result in the same
>  algorithm as a node copy, since the filter would be a no-op.  Or am I
>  missing some fundamental difference?

Yes, copy and ranged copy could/should use the same mechanism.

>  Initially we can avoid implementing checkpoint copying, since I don't
>  think it easily fits into an RPC protocol.  Perhaps it will work by
>  specifying a set of URLs to fetch or something?

Actually, the mechanism I want copy and ranged copy to use is
not checkpoint copying. This is because ranged copy may only
want a small sub-range of the checkpoint (in removing a node).
So, what if a RangedDatabase supports getDocs(Range range)
where full documents in the range are retrieved. To be conservative,
the starting entry numbers are set to the ones at the beginning
of the retrieval. Of course we need re-parsing the documents.
But we do that in log propagation as well. So, which approach
do you think is better?

>  Another downside is that load balancing is trickier.  With disjoint
>  ranges, how would we split a hot range?  BTW, I don't think we'll
>  actually randomly assign nodes, but rather try to initially add them
>  evenly around the ring, then use load balancing to split hot spots.

Halfen a hot range? More some more sophisticated algorithm.
We can still have nodes evenly around the ring at the start.
The partitioning algorithm doesn't change. Only the replication
algorithm is different.

>  Also, with disjoint ranges, each search would require searching more
>  indexes.  The search performance impact would thus be significant.
>
>  So the reasons for overlap are: (1) we can't split ranges without
>  permitting it, and (2) it makes search faster.  Do you agree?

What do you mean by "without permitting it"?

As to 2), I'm not sure overlap makes search faster.
Assume the scope of the positions is [0, 400) and the replication
level is 2. Let's say we have 4 nodes with overlap:
  - node 0 serves [0, 200)
  - node 1 serves [100, 300)
  - node 2 serves [200, 400)
  - node 3 serves [300, 100)
Without overlap, it's only fair that nodes serve ranges of the same
size as with overlap, so we only have 2 nodes:
  - node 0 serves [0, 200)
  - node 1 serves [200, 400)
Each node is replicated twice. So it's 4 nodes in total.

I think this is a fair comparison so the search performance should
be comparable. What do you think?

Ning

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-02 18:01:17

Ning Li wrote:
> On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote:
>>  I think we should add a Range element to Query that narrows it.  But we
>>  first need to define what it means in terms of other public API
>>  elements.  I think we define it in terms of the document's "position"
>>  field, which is the hashCode of its id by default, but can be explicitly
>>  specified.  Does that sound right?
> 
> I realize, we'll have to add getDoc(String id, int position) in addition to
> adding search(Range range, Query q, int maxHits) to Database?

Yes.  getDoc(id) should not be abstract, but should use the default hash 
function and call getDoc(id, position).  Applications that explicitly 
specify positions when adding documents must always call getDoc(id, 
position).  We should perhaps mark the methods that take an explicit 
position as "Expert" or somesuch?

Doug

Re: [bailey-developers] More detailed design on load balancing

From: Doug C. <cu...@ap...> - 2008-04-02 17:55:48

Ning Li wrote:
> Assume the replication level is 3. Let's look at adding
> a node first. Assume a new node X' is added at the
> position X' between X and X+1 and assigned to host A.
> The neighboring nodes will be affected and switch from:
> node X-2 serves range [X-2, X+1)
> node X-1 serves range [X-1, X+2)
> node X serves range [X, X+3)
> to:
> node X-2 serves range [X-2, X')
> node X-1 serves range [X-1, X+1)
> node X serves range [X, X+2)
> node X' serves range [X', X+3)
> 
> The process works as follows:
> 1    Master tells host A to start serving node X'
> 2    Host A copies [X', X+3) from node X (or other
>      node(s) which cover(s) [X', X+3))
> 2.1 Host A records the starting entry numbers of
>      the overlapping nodes of node X
> 2.2 Host A retrieves documents in range [X', X+3)
>      from node X (a new functionality)
> 2.3 Host A builds node X' from retrieved documents
> 2.4 Host A processes log entries from the
>      overlapping nodes starting from the entry
>      numbers recorded in 2.1
> 3    Host A tells master it's now serving node X'
> 4    Master changes the map and tells X, X-1, X-2
>      their new smaller ranges
> 5    Nodes X, X-1, X-2 gradually remove the
>      documents no longer in their ranges
> 
> Step 2.2 achieves G2 and step 2.1 achieves G3.
> Achieving G1 is a bit complicated but possible.
> 
> The process to remove a node is similar and again
> the algorithm to achieve G1 is complicated.
> 
> What do you think?

This looks great!  We can implement node copying with this too, right, 
just by setting X'=X?  And 2.2 could be done as a database checkpoint 
copy & filter.  In the case where X'=X, this would result in the same 
algorithm as a node copy, since the filter would be a no-op.  Or am I 
missing some fundamental difference?

Initially we can avoid implementing checkpoint copying, since I don't 
think it easily fits into an RPC protocol.  Perhaps it will work by 
specifying a set of URLs to fetch or something?

> The algorithm to achieve G1 can be simpler if we
> change to a simpler replication scheme - instead
> of nodes having overlapping ranges, nodes do not
> overlap and there are R replicas for each node.
> The down side is that when we add a new node,
> we need to create R replicas for the new node.

Another downside is that load balancing is trickier.  With disjoint 
ranges, how would we split a hot range?  BTW, I don't think we'll 
actually randomly assign nodes, but rather try to initially add them 
evenly around the ring, then use load balancing to split hot spots.

Also, with disjoint ranges, each search would require searching more 
indexes.  The search performance impact would thus be significant.

So the reasons for overlap are: (1) we can't split ranges without 
permitting it, and (2) it makes search faster.  Do you agree?

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-02 17:36:40

Ning Li wrote:
> Yes, we have to log and propagate deletes correctly.
> 
> What I'm worried about is the impact of the version check on the index
> build performance. As you said, for general synchronization, we always
> need to check versions. After we check a database/log and decide to
> add/delete a document, we call Database's addDoc/removeDoc method.
> In this addDoc/removeDoc method, we first parse the document for
> addDoc. Then in the same critical section, we have to check again if
> it is the latest version and applies the add/delete. Is checking the log
> for the version for a delete expensive here? And a log is not part of
> the Database abstraction, but part of RangedDatabase...

First, I think we need to add an abstrct Database service-provider 
interface, called perhaps RangeDatabase, that's different from Database, 
adding methods that will be critical to good performance that must be 
implemented by, e.g., HeapDatabase and LuceneDatabase.

Second, I don't yet see a way around checking versions when documents 
are added or deleted.  The ugliest bit is that we have to keep track of 
the version of every document that's ever been deleted, in case a 
long-offline node comes online and reports a stale addition.  That table 
could grow without bound.  Sigh.  Do you see a way around this?

Perhaps a node could discard old deletions after a time, keeping track 
of the log entry number of the oldest retained deletion.  Attempts to 
sync starting with an older entry number should be rejected and should 
trigger a complete copy-based replacement of the stale index.  The 
hazard is that, if a document is added to a single node, then that node 
goes offline for a long time, then, when it comes online, the addition 
will be lost.  Not great.

Doug

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-02 16:26:51

On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote:
>  I think we should add a Range element to Query that narrows it.  But we
>  first need to define what it means in terms of other public API
>  elements.  I think we define it in terms of the document's "position"
>  field, which is the hashCode of its id by default, but can be explicitly
>  specified.  Does that sound right?

I realize, we'll have to add getDoc(String id, int position) in addition to
adding search(Range range, Query q, int maxHits) to Database?


Ning

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-02 16:24:00

On Wed, Apr 2, 2008 at 11:49 AM, Doug Cutting <cu...@ap...> wrote:
> Ning Li wrote:
>  > On Mon, Mar 31, 2008 at 6:03 PM, Doug Cutting <cu...@ap...> wrote:
>  >>  The naive approach is simply to db.getDoc() and check the version, no?
>  >
>  > Yes. The current implementation does that. I was worried about
>  > its performance because I thought db.getDoc() means retrieving
>  > a full document. But you mentioned the semantics of retrieving
>  > the outline of a document. That would work. It's time to have
>  > db.getDoc(id) and db.getDoc(id, fields)?
>
>  Maybe we need 'getVersion()' for this, that returns a special value for
>  deleted documents?
>
>
>  > Do we index a delete?
>
>  No, but we need to log them, and, when replaying logs decide whether to
>  execute them.  If you delete a doc on one node and subsequently add it
>  on another then, when the deletion is propagated to the former it should
>  be ignored.  But if you just delete it then the propagated delete should
>  be executed.  We distinguish the cases by comparing the version in the
>  deletion log event to the version in the index -- just like for adds.

Yes, we have to log and propagate deletes correctly.

What I'm worried about is the impact of the version check on the index
build performance. As you said, for general synchronization, we always
need to check versions. After we check a database/log and decide to
add/delete a document, we call Database's addDoc/removeDoc method.
In this addDoc/removeDoc method, we first parse the document for
addDoc. Then in the same critical section, we have to check again if
it is the latest version and applies the add/delete. Is checking the log
for the version for a delete expensive here? And a log is not part of
the Database abstraction, but part of RangedDatabase...

Ning

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Doug C. <cu...@ap...> - 2008-04-02 15:49:14

Ning Li wrote:
> On Mon, Mar 31, 2008 at 6:03 PM, Doug Cutting <cu...@ap...> wrote:
>>  The naive approach is simply to db.getDoc() and check the version, no?
> 
> Yes. The current implementation does that. I was worried about
> its performance because I thought db.getDoc() means retrieving
> a full document. But you mentioned the semantics of retrieving
> the outline of a document. That would work. It's time to have
> db.getDoc(id) and db.getDoc(id, fields)?

Maybe we need 'getVersion()' for this, that returns a special value for 
deleted documents?

> Do we index a delete?

No, but we need to log them, and, when replaying logs decide whether to 
execute them.  If you delete a doc on one node and subsequently add it 
on another then, when the deletion is propagated to the former it should 
be ignored.  But if you just delete it then the propagated delete should 
be executed.  We distinguish the cases by comparing the version in the 
deletion log event to the version in the index -- just like for adds.

Does that make sense?

Doug

[bailey-developers] SF.net SVN: bailey: [19] trunk/src/java/org/apache/bailey

From: <ni...@us...> - 2008-04-02 14:59:26

Revision: 19
          http://bailey.svn.sourceforge.net/bailey/?rev=19&view=rev
Author:   ning_li
Date:     2008-04-02 07:59:32 -0700 (Wed, 02 Apr 2008)

Log Message:
-----------
Change the getDocs() method to be an implementation level API instead of a top-level API in Database.

Modified Paths:
--------------
    trunk/src/java/org/apache/bailey/Database.java
    trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
    trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Re: [bailey-developers] SF.net SVN: bailey: [17] trunk/src

From: Ning L. <nin...@gm...> - 2008-04-02 00:39:50

On Mon, Mar 31, 2008 at 6:03 PM, Doug Cutting <cu...@ap...> wrote:
>  The naive approach is simply to db.getDoc() and check the version, no?

Yes. The current implementation does that. I was worried about
its performance because I thought db.getDoc() means retrieving
a full document. But you mentioned the semantics of retrieving
the outline of a document. That would work. It's time to have
db.getDoc(id) and db.getDoc(id, fields)?

>  If the existing version is newer, then we ignore.  Note that deletes
>  come with a version, so if the indexed version is newer than the
>  deletion, we ignore the delete, just like an add.

Do we index a delete?

>  Yes, in general.  We might be able to avoid it when copying new ranges
>  into a node, if the log is de-duplicated.  But for general
>  synchronization I think we always need to check versions.
>
>  So it might be worth optimizing the doc->version implementation.  So the
>  method to add might be db.getVersion(String id). This doesn't make sense
>  for end users, so should be added to the Database API we use in a node,
>  not to the base abstract class.

So we have index deletes and keep them?

>  Yes, that sounds reasonable.  I was just questioning whether this
>  belongs the top-level Database API, or rather just in
>  HostToHostProtocol.  I just want to hide as much as we can from
>  end-users, so that we can more freely re-arrange things underneath later.
>
>  Eventually the top-level API might add a method like getDocs(String[]
>  ids, String[] fields), which looks similar, but has a different purpose.

I see.

Ning

Re: [bailey-developers] More detailed design on load balancing

From: Ning L. <nin...@gm...> - 2008-04-02 00:10:01

On Mon, Mar 31, 2008 at 5:35 PM, Doug Cutting <cu...@ap...> wrote:
>  I'd thought mostly in terms of adding/removing.  If a hot spot is
>  identified on the ring, then it should be split by adding a new node in
>  its range.  The node can be allocated by finding a cool spot and
>  removing a node there.
>
>  But moving a node from a hot host to a cooler host might be faster,
>  since the index could be copied rather than reconstructed from logs.
>  However this might not be enough to handle uneven loading of the ring.
>  For that we really need to be able to change the partitioning by placing
>  new nodes in hot spots.

I thought both types are useful and I started with the simpler one.
But the support for adding/removing nodes are definitely necessary.


>  That sounds right.  In terms of hostToMaster protocol, there are four
>  messages:
>    - Master tells B to start copying N (in response to heartbeat)
>    - B tells Master that it now serves N
>    - Master tells A to remove N (in heartbeat response)
>    - A tells Master it no longer serves N
>
>  The master will have to keep track of what moves are in flight, so that
>  it doesn't schedule too many in one region and risk losing data, right?

Yes. For simplicity and to reduce the risk of losing data, no other
move or add/remove should be started for the nodes overlapping
with node N.

In designing the addition/removal or the move of a node, I think
we must achieve:
G1 There is no data loss if no host goes down in the process,
     even if the replication level is 1.

And we should achieve:
G2 When a new node is added or a node adds a new range,
     the node can build the database on the range from the
     node(s) serving that range. That is, the node does not
     have to construct the database on the range from the
     log(s).
G3 When a (new) node starts to serve a new range, it
     knows where to start processing the log entries from
     the overlapping nodes of the range.

G2 and G3 aim at avoiding processing a log from the
beginning, which becomes less feasible as a log grows.
If we achieve G2 and G3, it means it's possible to remove
old log entries.

In the design on how to move a node:
  - We achieve G2 by B copying a checkpoint of N from A.
  - We achieve G3 by B also copying the starting entry
    numbers of the overlapping nodes as part of the
    checkpoint. That tells B where to start processing the
    log entries from the overlapping nodes.
  - We achieve G1 by, after A stops serving N and stops
    processing logs from the nodes overlapping N,
    N and its overlapping nodes continuing to process
    N's log on A until all entries are processed. Only then
    can A remove N and its log.


>  I think I'd rather see us implement add/remove balancing rather than
>  copying first, since, as mentioned above, I think it is more general.
>  It also means that there's much less chance that the same range/node
>  will exist on two hosts at once.
>
>  The communication with the master would be much the same.  Assume that N
>  is an underloaded node, and that M is an overloaded point on the ring.
>   1. Master tells N's neighbors to expand to cover N
>   2. N's neighbors tell Master that N is now redundant
>   3. Master tells A to stop serving N
>   4. Master tells A to start serving M
>   5. A tells master it's now serving M
>
>  The master must keep track of what nodes it is trying to decommission,
>  and avoid trying to decommission more than one in any overlapping region
>  at a time.  It also has to keep track of new nodes requests that it has
>  made, so that it doesn't issue them to multiple nodes.  Heartbeats
>  should refresh this state, i.e., a host should list its pending ranges
>  as well as its complete ranges.  I think that's sufficient state for the
>  master.  Does that sound right?

Would it be clearer if we describe adding a node and
removing a node separately? I think it's important to
figure out how G1, G2 & G3 above are achieved in the
design.

Assume the replication level is 3. Let's look at adding
a node first. Assume a new node X' is added at the
position X' between X and X+1 and assigned to host A.
The neighboring nodes will be affected and switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)

The process works as follows:
1    Master tells host A to start serving node X'
2    Host A copies [X', X+3) from node X (or other
     node(s) which cover(s) [X', X+3))
2.1 Host A records the starting entry numbers of
     the overlapping nodes of node X
2.2 Host A retrieves documents in range [X', X+3)
     from node X (a new functionality)
2.3 Host A builds node X' from retrieved documents
2.4 Host A processes log entries from the
     overlapping nodes starting from the entry
     numbers recorded in 2.1
3    Host A tells master it's now serving node X'
4    Master changes the map and tells X, X-1, X-2
     their new smaller ranges
5    Nodes X, X-1, X-2 gradually remove the
     documents no longer in their ranges

Step 2.2 achieves G2 and step 2.1 achieves G3.
Achieving G1 is a bit complicated but possible.

The process to remove a node is similar and again
the algorithm to achieve G1 is complicated.

What do you think?

The algorithm to achieve G1 can be simpler if we
change to a simpler replication scheme - instead
of nodes having overlapping ranges, nodes do not
overlap and there are R replicas for each node.
The down side is that when we add a new node,
we need to create R replicas for the new node.


>  The master might try to keep the cluster less than fully occupied so
>  that it can more quickly respond to hotspots and failures.  For example,
>  if each host can serve five nodes, it might normally use only four, so
>  that each always has a spare slot.  Then the above steps would happen as
>  4, 5, 1, 2, 3, first addressing the hot spot, then the cool spot.

Agree.


Cheers,
Ning

Flat | Threaded

<< < 1 2 3 4 .. 7 > >> (Page 2 of 7)