You can subscribe to this list here.
| 2008 |
Jan
|
Feb
(60) |
Mar
(65) |
Apr
(44) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: <yo...@us...> - 2008-04-10 03:23:54
|
Revision: 23
http://bailey.svn.sourceforge.net/bailey/?rev=23&view=rev
Author: yonik
Date: 2008-04-09 20:23:56 -0700 (Wed, 09 Apr 2008)
Log Message:
-----------
test commit w/ new email
Modified Paths:
--------------
trunk/src/java/overview.html
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <yo...@us...> - 2008-04-10 03:18:12
|
Revision: 22
http://bailey.svn.sourceforge.net/bailey/?rev=22&view=rev
Author: yonik
Date: 2008-04-09 20:18:17 -0700 (Wed, 09 Apr 2008)
Log Message:
-----------
test commit
Modified Paths:
--------------
trunk/src/java/overview.html
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Doug C. <cu...@ap...> - 2008-04-09 20:57:41
|
Yonik Seeley wrote: > Everything seemed easy enough until you brought up network partitioning. > In that event, could each partition elect a master? We would then > have to handle merging of multiple different rings when the network > was repaired, right? Zookeeper is supposed to handle hard stuff like this. Multiple masters should be impossible. But clients and nodes will have stale rings at times, but they should be able to continue w/o compromising things. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-04-09 17:45:02
|
Everything seemed easy enough until you brought up network partitioning. In that event, could each partition elect a master? We would then have to handle merging of multiple different rings when the network was repaired, right? -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-04-09 17:11:42
|
I'm trying to catch up with all the activity... please ignore if this has been addressed in later emails. On Wed, Apr 2, 2008 at 1:36 PM, Doug Cutting <cu...@ap...> wrote: > Second, I don't yet see a way around checking versions when documents > are added or deleted. The ugliest bit is that we have to keep track of > the version of every document that's ever been deleted, in case a > long-offline node comes online and reports a stale addition. That table > could grow without bound. Sigh. Do you see a way around this? > > Perhaps a node could discard old deletions after a time, keeping track > of the log entry number of the oldest retained deletion. Attempts to > sync starting with an older entry number should be rejected and should > trigger a complete copy-based replacement of the stale index. Right, this seems like the right approach. We shouldn't have to support incremental syncs forever. > The > hazard is that, if a document is added to a single node, then that node > goes offline for a long time, then, when it comes online, the addition > will be lost. Not great. If a document is added to a single node, it could be lost permanently anyway (node may never come back). That's why an add should go to multiple nodes. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-04-08 21:39:08
|
Ning Li wrote: > Actually, the mechanism I want copy and ranged copy to use is > not checkpoint copying. This is because ranged copy may only > want a small sub-range of the checkpoint (in removing a node). > So, what if a RangedDatabase supports getDocs(Range range) > where full documents in the range are retrieved. To be conservative, > the starting entry numbers are set to the ones at the beginning > of the retrieval. Of course we need re-parsing the documents. > But we do that in log propagation as well. So, which approach > do you think is better? I agree. We shouldn't copy checkpoints, at least not initially, but rather copy documents & reindex them. >> Another downside is that load balancing is trickier. With disjoint >> ranges, how would we split a hot range? BTW, I don't think we'll >> actually randomly assign nodes, but rather try to initially add them >> evenly around the ring, then use load balancing to split hot spots. > > Halfen a hot range? More some more sophisticated algorithm. > We can still have nodes evenly around the ring at the start. > The partitioning algorithm doesn't change. Only the replication > algorithm is different. But we'll need to dynamically balance things, since load may be uneven around the ring, especially with user-provided ring positions. And once we permit splitting, don't we have to handle overlapping ranges? >> So the reasons for overlap are: (1) we can't split ranges without >> permitting it, and (2) it makes search faster. Do you agree? > > What do you mean by "without permitting it"? Just that, if we split a node in a disjoint partitioning then we'll have overlapping ranges during the split. Switching all clients to search the new half-sized nodes won't be atomic. So I'm not sure that disjoint ranges really simplifies things fundamentally. > As to 2), I'm not sure overlap makes search faster. > Assume the scope of the positions is [0, 400) and the replication > level is 2. Let's say we have 4 nodes with overlap: > - node 0 serves [0, 200) > - node 1 serves [100, 300) > - node 2 serves [200, 400) > - node 3 serves [300, 100) > Without overlap, it's only fair that nodes serve ranges of the same > size as with overlap, so we only have 2 nodes: > - node 0 serves [0, 200) > - node 1 serves [200, 400) > Each node is replicated twice. So it's 4 nodes in total. > > I think this is a fair comparison so the search performance should > be comparable. What do you think? Good point. One other difference is that, with overlapping ranges, when a node fails, its load is spread more evenly over the remaining nodes. In this example, the load of any single node would become shared by two other nodes rather than one, so each node only needs 50% headroom rather than 100%. Doug |
|
From: Doug C. <cu...@ap...> - 2008-04-08 20:29:32
|
Ning Li wrote: > I thought only the master writes to Zookeeper. Hosts read from Zookeeper. > We should still have a light-weight master, no? Which keeps track of > the heartbeats from the hosts, detects host failures and responds > accordingly, and decides state changes (add/remove nodes) when > appropriate... As a thought experiment, I'm trying to see whether, with Zookeeper, we actually need such a master. We could replace heartbeats with an ephemeral file per node containing its status. (Ephemeral files disappear if their owner goes offline.) Any host could (a) grab a lock; (b) analyze the ring for potential add/removes; (c) post these requests to a directory. In effect, getting the lock is master election, and while a node is doing this analysis, it is the master. But the master moves around: each host has a "master" thread, and remains "master" only during one analysis/action cycle. This analysis cycle should not be very compute or i/o intensive. The advantage of this is that we wouldn't have to explicitly test or otherwise engineer master failover, since it would be happening all the time. All global state would be redundantly stored in Zookeeper. Could this work? > In this example, say B synced with A to A's entry 11 and > with C to C's entry 12 (which includes adding doc X) before > it went offline. Because A expunged its log, now A's log > entry number starts from 21. B comes back online and > finds A's 21 is newer than last time B synced with it (11). > So B cannot recover but has to sync from scratch. My concern was that A might try to sync from B and get stale data. I think you're arguing that B should not go online, accepting sync requests, until it has itself sync'd with its neighbors. In this case, incremental sync would fail and B would sync from scratch, removing any stale adds. How does this work at system startup? How does B know not to go online, providing its stale adds, yet A is permitted to provide its data? Perhaps, at startup a node can respond to requests for its first log entry number, but refuse requests for the logs themselves. Then B can see that it must reload, and refuse requests from A to sync until reload is complete. Does that work? Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-07 17:55:26
|
On Fri, Apr 4, 2008 at 7:26 PM, Doug Cutting <cu...@ap...> wrote: > I've been going back-and-forth in my head to try to decide how we want > to use Zookeeper and whether we should store the current ring there. > This is certainly an argument for that. It makes for a more stateful > master than I'd hoped, but that might not be bad. > > An extreme idea would be to make Zookeeper *be* the master. There would > be no XxxToMaster protocols: instead everything would be coordinated > through Zookeeper files. Letting nodes directly access Zookeeper makes > the master less of a bottleneck, since Zookeeper itself replicates data. I thought only the master writes to Zookeeper. Hosts read from Zookeeper. We should still have a light-weight master, no? Which keeps track of the heartbeats from the hosts, detects host failures and responds accordingly, and decides state changes (add/remove nodes) when appropriate... > I don't see it yet. Log entry numbers are per-node, and can't be > compared across nodes, right? > > Node A syncs from B log entries 1-10. > Document X is added to C. > A and B both sync X from C. > B goes offline. > X is deleted from C. > A syncs X's deletion from C. > A expunges its logs. > B comes back online. > A tries to sync events after 10 from B. > > How does A know to ignore the addition of X? Log entry numbers are per-node. Each node remembers the log entry numbers it has synced with its overlapping nodes. In this example, say B synced with A to A's entry 11 and with C to C's entry 12 (which includes adding doc X) before it went offline. Because A expunged its log, now A's log entry number starts from 21. B comes back online and finds A's 21 is newer than last time B synced with it (11). So B cannot recover but has to sync from scratch. Does it make sense? Ning |
|
From: <ni...@us...> - 2008-04-07 17:35:40
|
Revision: 21
http://bailey.svn.sourceforge.net/bailey/?rev=21&view=rev
Author: ning_li
Date: 2008-04-07 10:35:38 -0700 (Mon, 07 Apr 2008)
Log Message:
-----------
Create the "provider" package and the "ServiceDatabase" class in it. This is the interface that a service provider for a single-node database should implement. The commit also includes various changes related to it.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Host.java
trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/LogEntry.java
trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
trunk/src/test/org/apache/bailey/TestHeapDb.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/ddb/WriteAction.java
trunk/src/java/org/apache/bailey/provider/
trunk/src/java/org/apache/bailey/provider/ServiceDatabase.java
Removed Paths:
-------------
trunk/src/java/org/apache/bailey/ddb/IndexAction.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Doug C. <cu...@ap...> - 2008-04-04 23:26:18
|
Ning Li wrote: > We were thinking storing the "metadata" in Zookeeper, right? > All the nodes (ever) created, their ranges, their current log > entry number, the starting entry numbers of their overlapping > nodes, the current ring... So when a master is restarted, > it starts from exactly where it was before it went down by > retrieving the "metadata" from Zookeeper. I've been going back-and-forth in my head to try to decide how we want to use Zookeeper and whether we should store the current ring there. This is certainly an argument for that. It makes for a more stateful master than I'd hoped, but that might not be bad. An extreme idea would be to make Zookeeper *be* the master. There would be no XxxToMaster protocols: instead everything would be coordinated through Zookeeper files. Letting nodes directly access Zookeeper makes the master less of a bottleneck, since Zookeeper itself replicates data. > When an offline node comes back up, we verify with Zookeeper > that it is a valid node in a valid state. Then we check if we > can patch up its data by checking whether the overlapping > nodes and their logs still have the entries back to the > starting entry numbers of the node. If not, we sync from > scratch. I don't see it yet. Log entry numbers are per-node, and can't be compared across nodes, right? Node A syncs from B log entries 1-10. Document X is added to C. A and B both sync X from C. B goes offline. X is deleted from C. A syncs X's deletion from C. A expunges its logs. B comes back online. A tries to sync events after 10 from B. How does A know to ignore the addition of X? Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-04 22:18:40
|
On Fri, Apr 4, 2008 at 12:22 PM, Doug Cutting <cu...@ap...> wrote: > It would be nice to be able to eliminate long-offline nodes, but I don't > yet see how. > > At startup we want nodes to announce their content to the master. Not > all nodes will start at exactly the same time. (Note also that, if the > master fails then nodes will also re-elect a new master and post their > state there. Search and indexing should continue uninterrupted through > master moves.) So, when a master first starts it needs to avoid > modifying the ring for a time until it assumes that all nodes are up. > We might even have nodes randomly delay their first report, so that the > master isn't overwhelmed. > > If the network is partitioned then the master would allocate new nodes > to underserviced regions. When the network is repaired, we have the > choice of ignoring the data on the nodes that were replaced, or > synchronizing it with what has transpired in their absence. In the case > where all replicas of a region were offline, then we would want to use > their data when they come online (like the system restart case), but > when only a single replica was offline we might simply ignore its data > and let it sync from scratch. However it may not be easy to distinguish > these cases. If all replicas go offline, then we add new nodes to the > region, we'd need to remember that, at some point in the past, all nodes > in that region were offline. If the master was restarted during this > time, it will be even harder to keep track of this. We were thinking storing the "metadata" in Zookeeper, right? All the nodes (ever) created, their ranges, their current log entry number, the starting entry numbers of their overlapping nodes, the current ring... So when a master is restarted, it starts from exactly where it was before it went down by retrieving the "metadata" from Zookeeper. When an offline node comes back up, we verify with Zookeeper that it is a valid node in a valid state. Then we check if we can patch up its data by checking whether the overlapping nodes and their logs still have the entries back to the starting entry numbers of the node. If not, we sync from scratch. What do you think? Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-04 19:04:01
|
On Fri, Apr 4, 2008 at 12:22 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > Database as the application interface and RangedDatabase as the > > service-provider interface sounds good. The methods in RangedDatabase > > will be very similar to those in the current RangedDatabase? > > Yes. HeapDatabase should extend this now though. Yes. I'm thinking we should put RangedDatabase in a separate package? Is "provider" a good name for the package? Hmm, it should be different from the current RangedDatabase in that it should not include the node status and the log? I need to think about the other topic of long-offline node... Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-04 16:22:04
|
Ning Li wrote: > Database as the application interface and RangedDatabase as the > service-provider interface sounds good. The methods in RangedDatabase > will be very similar to those in the current RangedDatabase? Yes. HeapDatabase should extend this now though. > I've been thinking about this. Finally, I think this is a possibility: > 1 The database records and logs a deleted document and its version. > 2 If all the replicas have recorded and logged the deleted document > with the same version number, the document can be removed from > the database. This is because any new versions of the document > come after will have a larger version number. > > Does this sound right? Things are more complicated when we > consider state changes... It sounds right except for the case of a long-offline node coming back online. More on that below... > Do we allow a node to go offline for a long time? I thought we'd consider > the node goes down and pick a replacement for it. It would be nice to be able to eliminate long-offline nodes, but I don't yet see how. At startup we want nodes to announce their content to the master. Not all nodes will start at exactly the same time. (Note also that, if the master fails then nodes will also re-elect a new master and post their state there. Search and indexing should continue uninterrupted through master moves.) So, when a master first starts it needs to avoid modifying the ring for a time until it assumes that all nodes are up. We might even have nodes randomly delay their first report, so that the master isn't overwhelmed. If the network is partitioned then the master would allocate new nodes to underserviced regions. When the network is repaired, we have the choice of ignoring the data on the nodes that were replaced, or synchronizing it with what has transpired in their absence. In the case where all replicas of a region were offline, then we would want to use their data when they come online (like the system restart case), but when only a single replica was offline we might simply ignore its data and let it sync from scratch. However it may not be easy to distinguish these cases. If all replicas go offline, then we add new nodes to the region, we'd need to remember that, at some point in the past, all nodes in that region were offline. If the master was restarted during this time, it will be even harder to keep track of this. I'm still hopeful that we can come up a heuristic for this, but I need to think more about what it should be. Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-03 20:40:07
|
On Wed, Apr 2, 2008 at 1:36 PM, Doug Cutting <cu...@ap...> wrote: > First, I think we need to add an abstrct Database service-provider > interface, called perhaps RangeDatabase, that's different from Database, > adding methods that will be critical to good performance that must be > implemented by, e.g., HeapDatabase and LuceneDatabase. Database as the application interface and RangedDatabase as the service-provider interface sounds good. The methods in RangedDatabase will be very similar to those in the current RangedDatabase? > Second, I don't yet see a way around checking versions when documents > are added or deleted. The ugliest bit is that we have to keep track of > the version of every document that's ever been deleted, in case a > long-offline node comes online and reports a stale addition. That table > could grow without bound. Sigh. Do you see a way around this? I've been thinking about this. Finally, I think this is a possibility: 1 The database records and logs a deleted document and its version. 2 If all the replicas have recorded and logged the deleted document with the same version number, the document can be removed from the database. This is because any new versions of the document come after will have a larger version number. Does this sound right? Things are more complicated when we consider state changes... > Perhaps a node could discard old deletions after a time, keeping track > of the log entry number of the oldest retained deletion. Attempts to > sync starting with an older entry number should be rejected and should > trigger a complete copy-based replacement of the stale index. The > hazard is that, if a document is added to a single node, then that node > goes offline for a long time, then, when it comes online, the addition > will be lost. Not great. Do we allow a node to go offline for a long time? I thought we'd consider the node goes down and pick a replacement for it. Ning |
|
From: <ni...@us...> - 2008-04-03 16:35:32
|
Revision: 20
http://bailey.svn.sourceforge.net/bailey/?rev=20&view=rev
Author: ning_li
Date: 2008-04-03 09:35:33 -0700 (Thu, 03 Apr 2008)
Log Message:
-----------
Add "position" to Document. Add getDoc(id, position) and search(range, query, maxHits) to Database. With all the changes because of the two additions.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/Database.java
trunk/src/java/org/apache/bailey/Document.java
trunk/src/java/org/apache/bailey/ddb/Client.java
trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Host.java
trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Log.java
trunk/src/java/org/apache/bailey/ddb/LogEntry.java
trunk/src/java/org/apache/bailey/ddb/Mapper.java
trunk/src/java/org/apache/bailey/ddb/NodeInfo.java
trunk/src/java/org/apache/bailey/ddb/RangeResults.java
trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
trunk/src/java/org/apache/bailey/ddb/Ring.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleLog.java
trunk/src/java/org/apache/bailey/heap/HeapDatabase.java
trunk/src/test/org/apache/bailey/TestHeapDb.java
trunk/src/test/org/apache/bailey/TestSimpleDb.java
trunk/src/test/org/apache/bailey/TestSimpleDbWithLog.java
trunk/src/test/org/apache/bailey/ddb/TestRange.java
trunk/src/test/org/apache/bailey/ddb/TestRing.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/Range.java
Removed Paths:
-------------
trunk/src/java/org/apache/bailey/ddb/Range.java
trunk/src/java/org/apache/bailey/util/HashUtil.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Ning L. <nin...@gm...> - 2008-04-02 23:35:53
|
On Wed, Apr 2, 2008 at 1:55 PM, Doug Cutting <cu...@ap...> wrote: > This looks great! We can implement node copying with this too, right, > just by setting X'=X? And 2.2 could be done as a database checkpoint > copy & filter. In the case where X'=X, this would result in the same > algorithm as a node copy, since the filter would be a no-op. Or am I > missing some fundamental difference? Yes, copy and ranged copy could/should use the same mechanism. > Initially we can avoid implementing checkpoint copying, since I don't > think it easily fits into an RPC protocol. Perhaps it will work by > specifying a set of URLs to fetch or something? Actually, the mechanism I want copy and ranged copy to use is not checkpoint copying. This is because ranged copy may only want a small sub-range of the checkpoint (in removing a node). So, what if a RangedDatabase supports getDocs(Range range) where full documents in the range are retrieved. To be conservative, the starting entry numbers are set to the ones at the beginning of the retrieval. Of course we need re-parsing the documents. But we do that in log propagation as well. So, which approach do you think is better? > Another downside is that load balancing is trickier. With disjoint > ranges, how would we split a hot range? BTW, I don't think we'll > actually randomly assign nodes, but rather try to initially add them > evenly around the ring, then use load balancing to split hot spots. Halfen a hot range? More some more sophisticated algorithm. We can still have nodes evenly around the ring at the start. The partitioning algorithm doesn't change. Only the replication algorithm is different. > Also, with disjoint ranges, each search would require searching more > indexes. The search performance impact would thus be significant. > > So the reasons for overlap are: (1) we can't split ranges without > permitting it, and (2) it makes search faster. Do you agree? What do you mean by "without permitting it"? As to 2), I'm not sure overlap makes search faster. Assume the scope of the positions is [0, 400) and the replication level is 2. Let's say we have 4 nodes with overlap: - node 0 serves [0, 200) - node 1 serves [100, 300) - node 2 serves [200, 400) - node 3 serves [300, 100) Without overlap, it's only fair that nodes serve ranges of the same size as with overlap, so we only have 2 nodes: - node 0 serves [0, 200) - node 1 serves [200, 400) Each node is replicated twice. So it's 4 nodes in total. I think this is a fair comparison so the search performance should be comparable. What do you think? Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-02 18:01:17
|
Ning Li wrote: > On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote: >> I think we should add a Range element to Query that narrows it. But we >> first need to define what it means in terms of other public API >> elements. I think we define it in terms of the document's "position" >> field, which is the hashCode of its id by default, but can be explicitly >> specified. Does that sound right? > > I realize, we'll have to add getDoc(String id, int position) in addition to > adding search(Range range, Query q, int maxHits) to Database? Yes. getDoc(id) should not be abstract, but should use the default hash function and call getDoc(id, position). Applications that explicitly specify positions when adding documents must always call getDoc(id, position). We should perhaps mark the methods that take an explicit position as "Expert" or somesuch? Doug |
|
From: Doug C. <cu...@ap...> - 2008-04-02 17:55:48
|
Ning Li wrote: > Assume the replication level is 3. Let's look at adding > a node first. Assume a new node X' is added at the > position X' between X and X+1 and assigned to host A. > The neighboring nodes will be affected and switch from: > node X-2 serves range [X-2, X+1) > node X-1 serves range [X-1, X+2) > node X serves range [X, X+3) > to: > node X-2 serves range [X-2, X') > node X-1 serves range [X-1, X+1) > node X serves range [X, X+2) > node X' serves range [X', X+3) > > The process works as follows: > 1 Master tells host A to start serving node X' > 2 Host A copies [X', X+3) from node X (or other > node(s) which cover(s) [X', X+3)) > 2.1 Host A records the starting entry numbers of > the overlapping nodes of node X > 2.2 Host A retrieves documents in range [X', X+3) > from node X (a new functionality) > 2.3 Host A builds node X' from retrieved documents > 2.4 Host A processes log entries from the > overlapping nodes starting from the entry > numbers recorded in 2.1 > 3 Host A tells master it's now serving node X' > 4 Master changes the map and tells X, X-1, X-2 > their new smaller ranges > 5 Nodes X, X-1, X-2 gradually remove the > documents no longer in their ranges > > Step 2.2 achieves G2 and step 2.1 achieves G3. > Achieving G1 is a bit complicated but possible. > > The process to remove a node is similar and again > the algorithm to achieve G1 is complicated. > > What do you think? This looks great! We can implement node copying with this too, right, just by setting X'=X? And 2.2 could be done as a database checkpoint copy & filter. In the case where X'=X, this would result in the same algorithm as a node copy, since the filter would be a no-op. Or am I missing some fundamental difference? Initially we can avoid implementing checkpoint copying, since I don't think it easily fits into an RPC protocol. Perhaps it will work by specifying a set of URLs to fetch or something? > The algorithm to achieve G1 can be simpler if we > change to a simpler replication scheme - instead > of nodes having overlapping ranges, nodes do not > overlap and there are R replicas for each node. > The down side is that when we add a new node, > we need to create R replicas for the new node. Another downside is that load balancing is trickier. With disjoint ranges, how would we split a hot range? BTW, I don't think we'll actually randomly assign nodes, but rather try to initially add them evenly around the ring, then use load balancing to split hot spots. Also, with disjoint ranges, each search would require searching more indexes. The search performance impact would thus be significant. So the reasons for overlap are: (1) we can't split ranges without permitting it, and (2) it makes search faster. Do you agree? Doug |
|
From: Doug C. <cu...@ap...> - 2008-04-02 17:36:40
|
Ning Li wrote: > Yes, we have to log and propagate deletes correctly. > > What I'm worried about is the impact of the version check on the index > build performance. As you said, for general synchronization, we always > need to check versions. After we check a database/log and decide to > add/delete a document, we call Database's addDoc/removeDoc method. > In this addDoc/removeDoc method, we first parse the document for > addDoc. Then in the same critical section, we have to check again if > it is the latest version and applies the add/delete. Is checking the log > for the version for a delete expensive here? And a log is not part of > the Database abstraction, but part of RangedDatabase... First, I think we need to add an abstrct Database service-provider interface, called perhaps RangeDatabase, that's different from Database, adding methods that will be critical to good performance that must be implemented by, e.g., HeapDatabase and LuceneDatabase. Second, I don't yet see a way around checking versions when documents are added or deleted. The ugliest bit is that we have to keep track of the version of every document that's ever been deleted, in case a long-offline node comes online and reports a stale addition. That table could grow without bound. Sigh. Do you see a way around this? Perhaps a node could discard old deletions after a time, keeping track of the log entry number of the oldest retained deletion. Attempts to sync starting with an older entry number should be rejected and should trigger a complete copy-based replacement of the stale index. The hazard is that, if a document is added to a single node, then that node goes offline for a long time, then, when it comes online, the addition will be lost. Not great. Doug |
|
From: Ning L. <nin...@gm...> - 2008-04-02 16:26:51
|
On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote: > I think we should add a Range element to Query that narrows it. But we > first need to define what it means in terms of other public API > elements. I think we define it in terms of the document's "position" > field, which is the hashCode of its id by default, but can be explicitly > specified. Does that sound right? I realize, we'll have to add getDoc(String id, int position) in addition to adding search(Range range, Query q, int maxHits) to Database? Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-02 16:24:00
|
On Wed, Apr 2, 2008 at 11:49 AM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > On Mon, Mar 31, 2008 at 6:03 PM, Doug Cutting <cu...@ap...> wrote: > >> The naive approach is simply to db.getDoc() and check the version, no? > > > > Yes. The current implementation does that. I was worried about > > its performance because I thought db.getDoc() means retrieving > > a full document. But you mentioned the semantics of retrieving > > the outline of a document. That would work. It's time to have > > db.getDoc(id) and db.getDoc(id, fields)? > > Maybe we need 'getVersion()' for this, that returns a special value for > deleted documents? > > > > Do we index a delete? > > No, but we need to log them, and, when replaying logs decide whether to > execute them. If you delete a doc on one node and subsequently add it > on another then, when the deletion is propagated to the former it should > be ignored. But if you just delete it then the propagated delete should > be executed. We distinguish the cases by comparing the version in the > deletion log event to the version in the index -- just like for adds. Yes, we have to log and propagate deletes correctly. What I'm worried about is the impact of the version check on the index build performance. As you said, for general synchronization, we always need to check versions. After we check a database/log and decide to add/delete a document, we call Database's addDoc/removeDoc method. In this addDoc/removeDoc method, we first parse the document for addDoc. Then in the same critical section, we have to check again if it is the latest version and applies the add/delete. Is checking the log for the version for a delete expensive here? And a log is not part of the Database abstraction, but part of RangedDatabase... Ning |
|
From: Doug C. <cu...@ap...> - 2008-04-02 15:49:14
|
Ning Li wrote: > On Mon, Mar 31, 2008 at 6:03 PM, Doug Cutting <cu...@ap...> wrote: >> The naive approach is simply to db.getDoc() and check the version, no? > > Yes. The current implementation does that. I was worried about > its performance because I thought db.getDoc() means retrieving > a full document. But you mentioned the semantics of retrieving > the outline of a document. That would work. It's time to have > db.getDoc(id) and db.getDoc(id, fields)? Maybe we need 'getVersion()' for this, that returns a special value for deleted documents? > Do we index a delete? No, but we need to log them, and, when replaying logs decide whether to execute them. If you delete a doc on one node and subsequently add it on another then, when the deletion is propagated to the former it should be ignored. But if you just delete it then the propagated delete should be executed. We distinguish the cases by comparing the version in the deletion log event to the version in the index -- just like for adds. Does that make sense? Doug |
|
From: <ni...@us...> - 2008-04-02 14:59:26
|
Revision: 19
http://bailey.svn.sourceforge.net/bailey/?rev=19&view=rev
Author: ning_li
Date: 2008-04-02 07:59:32 -0700 (Wed, 02 Apr 2008)
Log Message:
-----------
Change the getDocs() method to be an implementation level API instead of a top-level API in Database.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/Database.java
trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Ning L. <nin...@gm...> - 2008-04-02 00:39:50
|
On Mon, Mar 31, 2008 at 6:03 PM, Doug Cutting <cu...@ap...> wrote: > The naive approach is simply to db.getDoc() and check the version, no? Yes. The current implementation does that. I was worried about its performance because I thought db.getDoc() means retrieving a full document. But you mentioned the semantics of retrieving the outline of a document. That would work. It's time to have db.getDoc(id) and db.getDoc(id, fields)? > If the existing version is newer, then we ignore. Note that deletes > come with a version, so if the indexed version is newer than the > deletion, we ignore the delete, just like an add. Do we index a delete? > Yes, in general. We might be able to avoid it when copying new ranges > into a node, if the log is de-duplicated. But for general > synchronization I think we always need to check versions. > > So it might be worth optimizing the doc->version implementation. So the > method to add might be db.getVersion(String id). This doesn't make sense > for end users, so should be added to the Database API we use in a node, > not to the base abstract class. So we have index deletes and keep them? > Yes, that sounds reasonable. I was just questioning whether this > belongs the top-level Database API, or rather just in > HostToHostProtocol. I just want to hide as much as we can from > end-users, so that we can more freely re-arrange things underneath later. > > Eventually the top-level API might add a method like getDocs(String[] > ids, String[] fields), which looks similar, but has a different purpose. I see. Ning |
|
From: Ning L. <nin...@gm...> - 2008-04-02 00:10:01
|
On Mon, Mar 31, 2008 at 5:35 PM, Doug Cutting <cu...@ap...> wrote:
> I'd thought mostly in terms of adding/removing. If a hot spot is
> identified on the ring, then it should be split by adding a new node in
> its range. The node can be allocated by finding a cool spot and
> removing a node there.
>
> But moving a node from a hot host to a cooler host might be faster,
> since the index could be copied rather than reconstructed from logs.
> However this might not be enough to handle uneven loading of the ring.
> For that we really need to be able to change the partitioning by placing
> new nodes in hot spots.
I thought both types are useful and I started with the simpler one.
But the support for adding/removing nodes are definitely necessary.
> That sounds right. In terms of hostToMaster protocol, there are four
> messages:
> - Master tells B to start copying N (in response to heartbeat)
> - B tells Master that it now serves N
> - Master tells A to remove N (in heartbeat response)
> - A tells Master it no longer serves N
>
> The master will have to keep track of what moves are in flight, so that
> it doesn't schedule too many in one region and risk losing data, right?
Yes. For simplicity and to reduce the risk of losing data, no other
move or add/remove should be started for the nodes overlapping
with node N.
In designing the addition/removal or the move of a node, I think
we must achieve:
G1 There is no data loss if no host goes down in the process,
even if the replication level is 1.
And we should achieve:
G2 When a new node is added or a node adds a new range,
the node can build the database on the range from the
node(s) serving that range. That is, the node does not
have to construct the database on the range from the
log(s).
G3 When a (new) node starts to serve a new range, it
knows where to start processing the log entries from
the overlapping nodes of the range.
G2 and G3 aim at avoiding processing a log from the
beginning, which becomes less feasible as a log grows.
If we achieve G2 and G3, it means it's possible to remove
old log entries.
In the design on how to move a node:
- We achieve G2 by B copying a checkpoint of N from A.
- We achieve G3 by B also copying the starting entry
numbers of the overlapping nodes as part of the
checkpoint. That tells B where to start processing the
log entries from the overlapping nodes.
- We achieve G1 by, after A stops serving N and stops
processing logs from the nodes overlapping N,
N and its overlapping nodes continuing to process
N's log on A until all entries are processed. Only then
can A remove N and its log.
> I think I'd rather see us implement add/remove balancing rather than
> copying first, since, as mentioned above, I think it is more general.
> It also means that there's much less chance that the same range/node
> will exist on two hosts at once.
>
> The communication with the master would be much the same. Assume that N
> is an underloaded node, and that M is an overloaded point on the ring.
> 1. Master tells N's neighbors to expand to cover N
> 2. N's neighbors tell Master that N is now redundant
> 3. Master tells A to stop serving N
> 4. Master tells A to start serving M
> 5. A tells master it's now serving M
>
> The master must keep track of what nodes it is trying to decommission,
> and avoid trying to decommission more than one in any overlapping region
> at a time. It also has to keep track of new nodes requests that it has
> made, so that it doesn't issue them to multiple nodes. Heartbeats
> should refresh this state, i.e., a host should list its pending ranges
> as well as its complete ranges. I think that's sufficient state for the
> master. Does that sound right?
Would it be clearer if we describe adding a node and
removing a node separately? I think it's important to
figure out how G1, G2 & G3 above are achieved in the
design.
Assume the replication level is 3. Let's look at adding
a node first. Assume a new node X' is added at the
position X' between X and X+1 and assigned to host A.
The neighboring nodes will be affected and switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
The process works as follows:
1 Master tells host A to start serving node X'
2 Host A copies [X', X+3) from node X (or other
node(s) which cover(s) [X', X+3))
2.1 Host A records the starting entry numbers of
the overlapping nodes of node X
2.2 Host A retrieves documents in range [X', X+3)
from node X (a new functionality)
2.3 Host A builds node X' from retrieved documents
2.4 Host A processes log entries from the
overlapping nodes starting from the entry
numbers recorded in 2.1
3 Host A tells master it's now serving node X'
4 Master changes the map and tells X, X-1, X-2
their new smaller ranges
5 Nodes X, X-1, X-2 gradually remove the
documents no longer in their ranges
Step 2.2 achieves G2 and step 2.1 achieves G3.
Achieving G1 is a bit complicated but possible.
The process to remove a node is similar and again
the algorithm to achieve G1 is complicated.
What do you think?
The algorithm to achieve G1 can be simpler if we
change to a simpler replication scheme - instead
of nodes having overlapping ranges, nodes do not
overlap and there are R replicas for each node.
The down side is that when we add a new node,
we need to create R replicas for the new node.
> The master might try to keep the cluster less than fully occupied so
> that it can more quickly respond to hotspots and failures. For example,
> if each host can serve five nodes, it might normally use only four, so
> that each always has a spare slot. Then the above steps would happen as
> 4, 5, 1, 2, 3, first addressing the hot spot, then the cool spot.
Agree.
Cheers,
Ning
|