You can subscribe to this list here.
| 2008 |
Jan
|
Feb
(60) |
Mar
(65) |
Apr
(44) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Doug C. <cu...@ap...> - 2008-03-31 22:15:14
|
Ning Li wrote: > It is a single propagator per host. But the question is, when should the > synchronizer be started? Hosts will both (1) respond to RPCs and (2) have a control loop. So maybe we should have them implement Runnable. The RPC listener could be started at the beginning of the run() implementation, and stopped when it exits. Then programs (like unit tests) can manage them using the Runnable API & Thread.interrupt(). They can also have a main() method that runs them in the foreground, for command-line use. Does that work? Doug |
|
From: Doug C. <cu...@ap...> - 2008-03-31 22:04:08
|
Ning Li wrote: >> - the point where we have a new log event from a neighbor, and need to >> resolve it against ourselves seems like a good point for a method call. > > The database should decide whether a version of a doc already > exists, right? Should we add a method to the database class? The naive approach is simply to db.getDoc() and check the version, no? If the existing version is newer, then we ignore. Note that deletes come with a version, so if the indexed version is newer than the deletion, we ignore the delete, just like an add. > Also, there will be a delay between checking if a version of a doc > already exists and adding the version of the doc into the database > (after retrieving it from a neighbor). Does this mean we always > have to check if a version of a doc already exists before adding it? Yes, in general. We might be able to avoid it when copying new ranges into a node, if the log is de-duplicated. But for general synchronization I think we always need to check versions. So it might be worth optimizing the doc->version implementation. So the method to add might be db.getVersion(String id). This doesn't make sense for end users, so should be added to the Database API we use in a node, not to the base abstract class. > I thought getDoc() means getting the full document. I want to use > getDocs() in log processing - after we process a number of log > entries from a node and identify the docs we should retrieve, we > call getDocs() to get those docs. What do you think? Yes, that sounds reasonable. I was just questioning whether this belongs the top-level Database API, or rather just in HostToHostProtocol. I just want to hide as much as we can from end-users, so that we can more freely re-arrange things underneath later. Eventually the top-level API might add a method like getDocs(String[] ids, String[] fields), which looks similar, but has a different purpose. Doug |
|
From: Doug C. <cu...@ap...> - 2008-03-31 21:41:34
|
Ning Li wrote: > There are two types of load balancing: moving a node from one > host to another, and adding/removing a node. Let's see if we > can agree on the simpler one first. I'd thought mostly in terms of adding/removing. If a hot spot is identified on the ring, then it should be split by adding a new node in its range. The node can be allocated by finding a cool spot and removing a node there. But moving a node from a hot host to a cooler host might be faster, since the index could be copied rather than reconstructed from logs. However this might not be enough to handle uneven loading of the ring. For that we really need to be able to change the partitioning by placing new nodes in hot spots. > Protocol for moving a node (N) from one host (A) to another (B): > Note: No other load balancing can be started for the nodes > overlapping with node N. > 1 Copy the latest checkpoint C of node N to host B. > 2 Node N on host B has an empty log. It starts to process logs > of the nodes with overlapping ranges, including node N on > host A. > 3 The master commits the change to the ring - officially node N > is on host B, not on host A. > 4 Node N on host A stops processing logs of the nodes with > overlapping ranges. > 5 After all the overlapping nodes finish processing node N's > log on host A, node N is removed from host A. > > Step 3 is the commit point. If host A or host B goes down > before that, the move fails. Otherwise, the move succeeds. > The move should also be considered fail if any overlapping > nodes goes down before the commit point to reduce the > complexity? That sounds right. In terms of hostToMaster protocol, there are four messages: - Master tells B to start copying N (in response to heartbeat) - B tells Master that it now serves N - Master tells A to remove N (in heartbeat response) - A tells Master it no longer serves N The master will have to keep track of what moves are in flight, so that it doesn't schedule too many in one region and risk losing data, right? I think I'd rather see us implement add/remove balancing rather than copying first, since, as mentioned above, I think it is more general. It also means that there's much less chance that the same range/node will exist on two hosts at once. The communication with the master would be much the same. Assume that N is an underloaded node, and that M is an overloaded point on the ring. 1. Master tells N's neighbors to expand to cover N 2. N's neighbors tell Master that N is now redundant 3. Master tells A to stop serving N 4. Master tells A to start serving M 5. A tells master it's now serving M The master must keep track of what nodes it is trying to decommission, and avoid trying to decommission more than one in any overlapping region at a time. It also has to keep track of new nodes requests that it has made, so that it doesn't issue them to multiple nodes. Heartbeats should refresh this state, i.e., a host should list its pending ranges as well as its complete ranges. I think that's sufficient state for the master. Does that sound right? The master might try to keep the cluster less than fully occupied so that it can more quickly respond to hotspots and failures. For example, if each host can serve five nodes, it might normally use only four, so that each always has a spare slot. Then the above steps would happen as 4, 5, 1, 2, 3, first addressing the hot spot, then the cool spot. Doug |
|
From: Ning L. <nin...@gm...> - 2008-03-31 20:39:50
|
First, summarize the terms used: 1 A node consists of a key/hash range, a database, and a log. 2 A node periodically processes logs of the nodes with overlapping ranges. A node remembers the entry number Ei it processes of an overlapping node Ni and would start from (Ei)+1 next time. 3 A node checkpoint consists of the node's range, a database checkpoint, a log entry number E and log entry numbers Ei's for the overlapping nodes. A checkpoint provides a consistent view of the node. There are two types of load balancing: moving a node from one host to another, and adding/removing a node. Let's see if we can agree on the simpler one first. Protocol for moving a node (N) from one host (A) to another (B): Note: No other load balancing can be started for the nodes overlapping with node N. 1 Copy the latest checkpoint C of node N to host B. 2 Node N on host B has an empty log. It starts to process logs of the nodes with overlapping ranges, including node N on host A. 3 The master commits the change to the ring - officially node N is on host B, not on host A. 4 Node N on host A stops processing logs of the nodes with overlapping ranges. 5 After all the overlapping nodes finish processing node N's log on host A, node N is removed from host A. Step 3 is the commit point. If host A or host B goes down before that, the move fails. Otherwise, the move succeeds. The move should also be considered fail if any overlapping nodes goes down before the commit point to reduce the complexity? Sounds right? Ning |
|
From: <ni...@us...> - 2008-03-31 16:51:25
|
Revision: 18
http://bailey.svn.sourceforge.net/bailey/?rev=18&view=rev
Author: ning_li
Date: 2008-03-31 09:51:31 -0700 (Mon, 31 Mar 2008)
Log Message:
-----------
1 Move RangeResults to the ddb implementation package. The top-level package has the end-user API, and RangeResults are an implementation detail.
2 Follow Hadoop's the way to stop threads with Thread.interrupt() rather than by setting a flag. In the thread, always treat InterruptedException as a signal to exit. The run() loop checks !this.isInterrupted().
3 Some possible name improvements: Tuple -> NodeState, logMap -> neighborMap, propagator -> synchronizer.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Host.java
trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/HostToMasterProtocol.java
trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
trunk/src/test/org/apache/bailey/TestSimpleDbWithLog.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/ddb/RangeResults.java
Removed Paths:
-------------
trunk/src/java/org/apache/bailey/RangeResults.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Ning L. <nin...@gm...> - 2008-03-31 16:06:46
|
On Mon, Mar 31, 2008 at 11:55 AM, Ning Li <nin...@gm...> wrote: > On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote: > > - Should we have a single propagator per host, instead of per node? > > That would conserve calls to the master, and a single propagation thread > > would throttle things, so that indexing doesn't overwhelm search > > performance. OTOH, we might sometimes want to propagate changes faster > > than a single thread can. But that's probably better dealt with > > explicitly rather than having a thread per node... > > Agree. The part needs to be optimized - e.g. currently the synchronizer > retrieves one doc at a time after processing the log. It is a single propagator per host. But the question is, when should the synchronizer be started? Ning |
|
From: Ning L. <nin...@gm...> - 2008-03-31 15:55:20
|
On Fri, Mar 28, 2008 at 5:42 PM, Doug Cutting <cu...@ap...> wrote: > - RangeResults belongs in the ddb implementation package. The top-level > package has the end-user API, and RangeResults are an implementation detail. Agree. I'll make the change. > - in Hadoop, the way we handle threads is to stop them with > Thread.interrupt() rather than by setting a flag. In the thread, always > treat InterruptedException as a signal to exit. the run() loop should > check !this.isInterrupted(). OK. We can use the same practice. > - shouldn't the host get the hostMap from the master? And shouldn't it > periodically refresh both the hostMap and the logMap from the master? > For this, and instead of using ClientToMaster within a host, we need to > add a method to HostToMaster protocol that returns a Mapper for the > subset of the ring that concerns the calling node or host. Intitially > it might return the full mapper, but, eventually, it should only > transmit the node's neighborhood, or perhaps the host's neighborhoods. Agree on all the points. I wrote the withlog package with the minimal functionalities just enough to demonstrate the use of the log interface. Much more needs to be done for host-to-master interactions. I figure we should come up with the detailed design for how to carry out the two types of load balancing - move a node from one host to another, and add/remove a node? > - Should we have a single propagator per host, instead of per node? > That would conserve calls to the master, and a single propagation thread > would throttle things, so that indexing doesn't overwhelm search > performance. OTOH, we might sometimes want to propagate changes faster > than a single thread can. But that's probably better dealt with > explicitly rather than having a thread per node... Agree. The part needs to be optimized - e.g. currently the synchronizer retrieves one doc at a time after processing the log. > - some possible name improvements: > Tuple -> NodeState ? > logMap -> overlappingNodes or neighbors? > propagator -> retriever? synchronizer? I like better names. :) I said very early I'm not good at names. :) > - the point where we have a new log event from a neighbor, and need to > resolve it against ourselves seems like a good point for a method call. The database should decide whether a version of a doc already exists, right? Should we add a method to the database class? Also, there will be a delay between checking if a version of a doc already exists and adding the version of the doc into the database (after retrieving it from a neighbor). Does this mean we always have to check if a version of a doc already exists before adding it? > I think we should add a Range element to Query that narrows it. But we > first need to define what it means in terms of other public API > elements. I think we define it in terms of the document's "position" > field, which is the hashCode of its id by default, but can be explicitly > specified. Does that sound right? Sounds good. > Does getDocs() need to be in the top-level application API? At some > point we need to distinguish between full documents and "outline" > documents. E.g., if we're storing full-text then we don't want to > transmit that to search clients when they're just displaying hits. We > might, e.g., add a list of fields to be retrieved to Query. But I don't > yet see a case where an application will need to fetch a set of > documents by id. Except for search results, one-at-a-time access will > be more typical, no? I thought getDoc() means getting the full document. I want to use getDocs() in log processing - after we process a number of log entries from a node and identify the docs we should retrieve, we call getDocs() to get those docs. What do you think? Ning |
|
From: Doug C. <cu...@ap...> - 2008-03-28 21:42:11
|
ni...@us... wrote: > 1 Add the protocol and the classes related to log propagation. Also add a simple implementation in the withlog package and a test case TestSimpleDbWithLog. This stuff looks great! You're a tour de force! A few minor comments: - RangeResults belongs in the ddb implementation package. The top-level package has the end-user API, and RangeResults are an implementation detail. - in Hadoop, the way we handle threads is to stop them with Thread.interrupt() rather than by setting a flag. In the thread, always treat InterruptedException as a signal to exit. the run() loop should check !this.isInterrupted(). - shouldn't the host get the hostMap from the master? And shouldn't it periodically refresh both the hostMap and the logMap from the master? For this, and instead of using ClientToMaster within a host, we need to add a method to HostToMaster protocol that returns a Mapper for the subset of the ring that concerns the calling node or host. Intitially it might return the full mapper, but, eventually, it should only transmit the node's neighborhood, or perhaps the host's neighborhoods. - Should we have a single propagator per host, instead of per node? That would conserve calls to the master, and a single propagation thread would throttle things, so that indexing doesn't overwhelm search performance. OTOH, we might sometimes want to propagate changes faster than a single thread can. But that's probably better dealt with explicitly rather than having a thread per node... - some possible name improvements: Tuple -> NodeState ? logMap -> overlappingNodes or neighbors? propagator -> retriever? synchronizer? - the point where we have a new log event from a neighbor, and need to resolve it against ourselves seems like a good point for a method call. > 2 Add the RangedDatabase class which contains NodeStatus, Database and Log. > 3 Add "getDocs" to the Database class to retrieve a number of documents. This will be used to improve performance during the log propagation. Q: Should Database be aware of Range to support filtered queries based on Range? Or do we make RangedDatabase add a clause to a query before passing it down to Database? I think we should add a Range element to Query that narrows it. But we first need to define what it means in terms of other public API elements. I think we define it in terms of the document's "position" field, which is the hashCode of its id by default, but can be explicitly specified. Does that sound right? Does getDocs() need to be in the top-level application API? At some point we need to distinguish between full documents and "outline" documents. E.g., if we're storing full-text then we don't want to transmit that to search clients when they're just displaying hits. We might, e.g., add a list of fields to be retrieved to Query. But I don't yet see a case where an application will need to fetch a set of documents by id. Except for search results, one-at-a-time access will be more typical, no? Sorry I've not been more involved this week... Doug |
|
From: <ni...@us...> - 2008-03-28 20:21:07
|
Revision: 17
http://bailey.svn.sourceforge.net/bailey/?rev=17&view=rev
Author: ning_li
Date: 2008-03-28 13:21:12 -0700 (Fri, 28 Mar 2008)
Log Message:
-----------
1 Add the protocol and the classes related to log propagation. Also add a simple implementation in the withlog package and a test case TestSimpleDbWithLog.
2 Add the RangedDatabase class which contains NodeStatus, Database and Log.
3 Add "getDocs" to the Database class to retrieve a number of documents. This will be used to improve performance during the log propagation. Q: Should Database be aware of Range to support filtered queries based on Range? Or do we make RangedDatabase add a clause to a query before passing it down to Database?
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/Database.java
trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Host.java
trunk/src/java/org/apache/bailey/ddb/HostToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Mapper.java
trunk/src/java/org/apache/bailey/ddb/Range.java
trunk/src/java/org/apache/bailey/ddb/Ring.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleMaster.java
trunk/src/test/org/apache/bailey/TestHeapDb.java
trunk/src/test/org/apache/bailey/TestSimpleDb.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/ddb/HostProtocol.java
trunk/src/java/org/apache/bailey/ddb/IndexAction.java
trunk/src/java/org/apache/bailey/ddb/Log.java
trunk/src/java/org/apache/bailey/ddb/LogEntry.java
trunk/src/java/org/apache/bailey/ddb/RangedDatabase.java
trunk/src/java/org/apache/bailey/ddb/withlog/
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleHost.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleLog.java
trunk/src/java/org/apache/bailey/ddb/withlog/SimpleMaster.java
trunk/src/java/org/apache/bailey/util/HashUtil.java
trunk/src/test/org/apache/bailey/TestSimpleDbWithLog.java
Removed Paths:
-------------
trunk/src/java/org/apache/bailey/ddb/Logger.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <ni...@us...> - 2008-03-27 17:07:16
|
Revision: 16
http://bailey.svn.sourceforge.net/bailey/?rev=16&view=rev
Author: ning_li
Date: 2008-03-27 09:57:55 -0700 (Thu, 27 Mar 2008)
Log Message:
-----------
Modify the ring to be a real ring. The ring was actually [Integer.MIN_VALUE, Integer.MAX_VALUE) before.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/ddb/Mapper.java
trunk/src/java/org/apache/bailey/ddb/Range.java
trunk/src/java/org/apache/bailey/ddb/Ring.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/java/org/apache/bailey/util/Pair.java
trunk/src/test/org/apache/bailey/TestSimpleDb.java
Added Paths:
-----------
trunk/src/test/org/apache/bailey/ddb/
trunk/src/test/org/apache/bailey/ddb/TestRange.java
trunk/src/test/org/apache/bailey/ddb/TestRing.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <ni...@us...> - 2008-03-21 23:15:44
|
Revision: 15
http://bailey.svn.sourceforge.net/bailey/?rev=15&view=rev
Author: ning_li
Date: 2008-03-21 16:15:50 -0700 (Fri, 21 Mar 2008)
Log Message:
-----------
Change the type of begin/end in Range to integer. I forgot it in the last change. I still dream for string in the future. :) Also add getCoverage(Range) to the Ring API.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/ddb/Mapper.java
trunk/src/java/org/apache/bailey/ddb/Range.java
trunk/src/java/org/apache/bailey/ddb/Ring.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/test/org/apache/bailey/Generator.java
trunk/src/test/org/apache/bailey/TestSimpleDb.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <ni...@us...> - 2008-03-21 15:36:02
|
Revision: 14
http://bailey.svn.sourceforge.net/bailey/?rev=14&view=rev
Author: ning_li
Date: 2008-03-21 08:36:06 -0700 (Fri, 21 Mar 2008)
Log Message:
-----------
Change search() in the Client-to-Host protocol to return an array of RangeResults. Make id in NodeID a long and change HostID to be a name and a port.
Modified Paths:
--------------
trunk/src/java/org/apache/bailey/ddb/ClientToHostProtocol.java
trunk/src/java/org/apache/bailey/ddb/Host.java
trunk/src/java/org/apache/bailey/ddb/HostID.java
trunk/src/java/org/apache/bailey/ddb/NodeID.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleClient.java
trunk/src/java/org/apache/bailey/ddb/simple/SimpleHost.java
trunk/src/test/org/apache/bailey/TestSimpleDb.java
Added Paths:
-----------
trunk/src/java/org/apache/bailey/RangeResults.java
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: Ning L. <nin...@gm...> - 2008-03-20 23:35:20
|
On Wed, Mar 19, 2008 at 4:13 PM, Doug Cutting <cu...@ap...> wrote: > Someone at Y! last week asked why Bailey doesn't use HDFS. I gave the > following reasons: > > - performance: by keeping indexes local search & indexing will be faster > - reliability: bailey replicates already, so hdfs replication is redundant > - continuous growth: consistent hashing lets us add and remove nodes > without fundamentally changing the way the index is partitioned. a > host-independent partitioning in HDFS would be too static. Don't we still need something like consistent hashing or range partitioning to partition the index even if HDFS is used? What is a static host-independent partitioning in HDFS? > He countered: > - for decent search performance, the majority of the index must be in > memory anyway. i conceded that much of the benefit of local indexes > might come from the filesystem buffer cache, which hdfs lacks. If HDFS is used and a partition index cannot fit in memory, caching in local file system (or a complete copy in local FS) is definitely necessary. > - for decent indexing performance, we could persist only logs + index > checkpoints to HDFS (once it supports append). I'll do some calculation later in this email. > - even consistent hashing will require the master to be somewhat > involved in indexing as nodes are added and removed. is that really > inherently more complicated than having the master dole out > subdirectories from a central hdfs repository, merging and splitting > them as needed? Using HDFS won't be more complicated. Now let's compute how much data are stored using HDFS vs. using local FS. Let's assume the replication level in HDFS is h, and the read/write replication level for the index is r. A document d means the original (text) copy of the document. The index form of document d means the inverted document in the index (typically much smaller than the original copy). Case 1: a simple example case In my design of a scalable indexer on HDFS, only batch update is supported and an batch update is carried out using a Map/Reduce job. No logging is required. So we have h+r copies for each document in its index form - assume each index server caches a copy in its local FS. Case 2: Now, what about a scalable index supporting incremental update using HDFS? If multiple readers of a partition share the same copy of the partition in HDFS (which is possible if single-writer-multi-reader), we also have h+r copies for each document in its index form. In addition, we'll have h copies for each document in the log if we don't want to lose data. So the total is h+r copies in index form plus h copies of a doc. Case 3: If multiple readers of a partition do not shard the same copy of the partition in HDFS (as in Bailey), we'd have r * (1+h) copies for each document in its index form and r * (1+h) copies for each document. Case 4: In the current Bailey design, we have r copies for each document in its index form and r copies for each document. Cases 1 and 2 will have more copies for each document if getDoc() operation is to be supported. I think cases 1, 2 and 4 all make sense. But probably not case 3. Well, did I totally mis-interpret how Bailey would use HDFS? Ning |
|
From: Ning L. <nin...@gm...> - 2008-03-20 21:24:17
|
On Thu, Mar 20, 2008 at 5:10 PM, Doug Cutting <cu...@ap...> wrote: > Exactly, but the client needs to know what the subset is in order to > figure out where gaps are. So maybe we should add a RangeResult > subclass of Result that includes a Range, and change > ClientToHost#search() to return Result[], as you indicate above. Does > that sound right? Yep. I think we are on the same page. :) I'll make the changes and add comments into the code. Ning |
|
From: Doug C. <cu...@ap...> - 2008-03-20 21:11:10
|
Ning Li wrote: > The current Client to Host API is > Results search(NodeID[] nodeIDs, Range[] ranges, Query q, int maxHits) > > It means for each node n in nodeIDs and the correpsonding range r > in ranges, execute query q on node n with range r. The range r should > be a subset of the range that node n serves. In most case, the range r > is the same as the range that node n serves. An exception will be > thrown if either node n is not on the host or range r is not a subset > of the range that node n serves. I probably should change the return > type to Results[] and let the client decide how to combine the results. Yes, this is what I expected, but thanks for clarifying. > Are you suggesting range r be what the client thinks node n serves? > So if only a subset of range r is needed, the results are filtered on > the client? And when range r is different from the range that node n > serves, execute the query anyway and return the results with the > actual range that node n serves? The particular case I'm concerned about is if a Node now serves a subset of the desired range. We could either have such queries fail altogether, or, instead indicate what subset of the desired range was searched. > I think returning partial results is good but is it more efficient to > do filtering on the node/host? In this case, again, range r should > be a subset of the range that node n serves. If range r is not a > subset of the range that node n serves, execute the query on > the overlap range and return the results with the overlap range. Exactly, but the client needs to know what the subset is in order to figure out where gaps are. So maybe we should add a RangeResult subclass of Result that includes a Range, and change ClientToHost#search() to return Result[], as you indicate above. Does that sound right? We don't want to do filtering at the client, since that will substantially complicate facet counts. All filtering must be at the node so that the client can simply sum facet counts. Doug |
|
From: Ning L. <nin...@gm...> - 2008-03-20 18:15:59
|
On Wed, Mar 19, 2008 at 4:00 PM, Doug Cutting <cu...@ap...> wrote: > Should search results indicate the ranges actually searched? That way, > if the client's Mapper is out of date, the client can detect and repair > this. The current Client to Host API is Results search(NodeID[] nodeIDs, Range[] ranges, Query q, int maxHits) It means for each node n in nodeIDs and the correpsonding range r in ranges, execute query q on node n with range r. The range r should be a subset of the range that node n serves. In most case, the range r is the same as the range that node n serves. An exception will be thrown if either node n is not on the host or range r is not a subset of the range that node n serves. I probably should change the return type to Results[] and let the client decide how to combine the results. Are you suggesting range r be what the client thinks node n serves? So if only a subset of range r is needed, the results are filtered on the client? And when range r is different from the range that node n serves, execute the query anyway and return the results with the actual range that node n serves? I think returning partial results is good but is it more efficient to do filtering on the node/host? In this case, again, range r should be a subset of the range that node n serves. If range r is not a subset of the range that node n serves, execute the query on the overlap range and return the results with the overlap range. > To repair, given a set of search results, we need to check for gaps, > then figure out what to re-query to fill those gaps. Should this be > done through the Ring API? Yep, we can use the ring to find out nodes serving the gaps. At the same time, the client to contact the master for a new version of the ring. > Will host ids ever be other than an ipaddress+port? Should we represent > them that way instead? Hosts are not on the ring, so we don't need a > numeric value here. Agree. > Will node ids ever be other than a large (128 bit?) numeric value? They > need to be unique, but should be host-independent. E.g., it should be > possible to copy a node's data to a new host and have that new host > start serving that data. Agree. Ning |
|
From: Yonik S. <yo...@ap...> - 2008-03-20 13:51:41
|
Performance is certainly critical to many people and they don't always have the option to just add more boxes. >From what I know of some Solr users, it is also be very important to scale down well (all the way down to 1-2 nodes). I would expect a 2 node configuration to be very popular for minimum fault tolerance. Some users are also very sensitive to RAM consumption (they apparently run on hosted servers where more memory == more $$$ per month). -Yonik On Wed, Mar 19, 2008 at 4:13 PM, Doug Cutting <cu...@ap...> wrote: > Someone at Y! last week asked why Bailey doesn't use HDFS. I gave the > following reasons: > > - performance: by keeping indexes local search & indexing will be faster > - reliability: bailey replicates already, so hdfs replication is redundant > - continuous growth: consistent hashing lets us add and remove nodes > without fundamentally changing the way the index is partitioned. a > host-independent partitioning in HDFS would be too static. > > He countered: > - for decent search performance, the majority of the index must be in > memory anyway. i conceded that much of the benefit of local indexes > might come from the filesystem buffer cache, which hdfs lacks. > - for decent indexing performance, we could persist only logs + index > checkpoints to HDFS (once it supports append). > - even consistent hashing will require the master to be somewhat > involved in indexing as nodes are added and removed. is that really > inherently more complicated than having the master dole out > subdirectories from a central hdfs repository, merging and splitting > them as needed? > > The advantage of HDFS-based indexes is that nodes have less state. The > disadvantage is that you have to run HDFS (if you're not already), and > that performance will probably always be a bit less. I don't see a > clear advantage either way, and thus tend towards fewer dependencies and > better performance. > > Other thoughts? > > Doug > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > bailey-developers mailing list > bai...@li... > https://lists.sourceforge.net/lists/listinfo/bailey-developers > |
|
From: Yonik S. <yo...@ap...> - 2008-03-19 22:07:35
|
On Wed, Mar 19, 2008 at 5:38 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > I'm not exactly sure yet either... but it seems like a node does need > > to identify all documents within certain arbitrary ranges at some > > point for rebalancing (and perhaps for filtering too). > > Will the hash be indexed, stored somehow, or calculated on the fly on the node? > > The naive way to implement this on Lucene is to make the position be an > indexed field, then use a RangeFilter to constrain queries, if that's > what you're asking. Right. If it's indexed, it seems advantageous to use 32 bits rather than 64 (esp thinking about the .tii file size). Longer term it might make sense to use a column-stride field: https://issues.apache.org/jira/browse/LUCENE-1231 > Mostly we hope that queries will span the entire > range of an index & with no need for filtering. But sometimes the > Filter will be needed. > > When replaying the log to a neighbor node we'll also need to filter by > position range. Or just log the position along with the Id and version. > So we'll be touching these a lot, but I don't yet see a > case where we'd, e.g., want to create a bit vector of occupied > positions. They'll be pretty sparse for that, even if only 32 bit. Sorry for the confusion, I never meant that. I just meant the ability to map from range to documents in that range on the node. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-03-19 21:39:10
|
Yonik Seeley wrote: > On Wed, Mar 19, 2008 at 4:46 PM, Doug Cutting <cu...@ap...> wrote: >> A Document is uniquely identified by a String externalId and a long >> version. The position is not assumed to uniquely identify it. So I'm >> not sure where the size of the position will be significant. > > I'm not exactly sure yet either... but it seems like a node does need > to identify all documents within certain arbitrary ranges at some > point for rebalancing (and perhaps for filtering too). > Will the hash be indexed, stored somehow, or calculated on the fly on the node? The naive way to implement this on Lucene is to make the position be an indexed field, then use a RangeFilter to constrain queries, if that's what you're asking. Mostly we hope that queries will span the entire range of an index & with no need for filtering. But sometimes the Filter will be needed. When replaying the log to a neighbor node we'll also need to filter by position range. So we'll be touching these a lot, but I don't yet see a case where we'd, e.g., want to create a bit vector of occupied positions. They'll be pretty sparse for that, even if only 32 bit. > True... I guess it depends on if those 32 bits will be used for > anything other than a hash (or position) by the application level. Well, we've talked of applications encoding the user in the high 29 bits and message in the low three. Frankly, I have a hard time seeing where 32 bits would be a problem here. Typically what you'll want to do is have a primary field (e.g., user) that can limit the amount of the ring that must be queried, and trade that against the chances that a a single value of that field will overwhelm that portion of the ring. If the master rebalances by load, this will be easier. A single node should probably never index more than a few million items, so if you know that you might have, e.g., 100M items with a given primary field value, then you'd want to make sure that there are at least 100 distinct values within that. But I think 32 bits gives plenty of room for such things. Hey, did I just switch sides again? Doug |
|
From: Yonik S. <yo...@ap...> - 2008-03-19 21:06:19
|
On Wed, Mar 19, 2008 at 4:46 PM, Doug Cutting <cu...@ap...> wrote: > A Document is uniquely identified by a String externalId and a long > version. The position is not assumed to uniquely identify it. So I'm > not sure where the size of the position will be significant. I'm not exactly sure yet either... but it seems like a node does need to identify all documents within certain arbitrary ranges at some point for rebalancing (and perhaps for filtering too). Will the hash be indexed, stored somehow, or calculated on the fly on the node? > > 32 bits provides 4B slices of the ring! > > But, like IP addresses, having enough doesn't make them easy to divide. True... I guess it depends on if those 32 bits will be used for anything other than a hash (or position) by the application level. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-03-19 20:46:03
|
Yonik Seeley wrote: > Going with a long might be premature optimism that no one will need to > create or track by hash :-) I don't follow this. You mean create an array of all positions? > Going from 4 to 8 bytes per document could be significant. A Document is uniquely identified by a String externalId and a long version. The position is not assumed to uniquely identify it. So I'm not sure where the size of the position will be significant. We will be slinging around representations of the ring: the client will need to refresh its frequently. These will have a <position,position> range per node, so these would get bigger. But the ring also has to represent a host per node, plus each node's unique id and its ring position, so the size of the range is not dominant in the size of the ring. > 32 bits provides 4B slices of the ring! But, like IP addresses, having enough doesn't make them easy to divide. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-03-19 20:38:48
|
On Wed, Mar 19, 2008 at 4:03 PM, Ning Li <nin...@gm...> wrote: > One user may have one document. Another may have a lot. > Is 29 bits for username enough? Maybe. But is 3 bits for the > documents of a user enough? That means a user's documents > cannot span more than 8 nodes. With that particular application split, a users documents would span 1/8 of the ring (or 29 bits). In a system with 100 nodes, a users email would span ~13 nodes, an it can easily change by changing the split. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-03-19 20:33:21
|
On Wed, Mar 19, 2008 at 4:28 PM, Doug Cutting <cu...@ap...> wrote: > On the other hand, squeezing the most out of bits is often a premature > optimization that's later regretted. Long might be more future-proof. Going with a long might be premature optimism that no one will need to create or track by hash :-) Going from 4 to 8 bytes per document could be significant. 32 bits provides 4B slices of the ring! -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-03-19 20:28:23
|
Ning Li wrote: > The ring distribution won't be uniform in this case. But we have > to deal with this case anyway. So the main downside I see is > the performance cost with strings - computation, memory... > That's why I'm fine with a separate 'position' value. Also, having a well-known place for the application-specified external id is useful too. Lucene lacks this, which makes things like deletion and duplicate detection more complicated than they ought to be. So I think <externalId, version, position, <field>*> better than Lucene's minimalist <field>*. > One user may have one document. Another may have a lot. > Is 29 bits for username enough? Maybe. But is 3 bits for the > documents of a user enough? That means a user's documents > cannot span more than 8 nodes. I only have 50k emails in my archives. Even if I had 500k, one node would be plenty. I've heard that gmail handles all per-user requests on a single node, and gmail allows up to around 500k messages. On the other hand, squeezing the most out of bits is often a premature optimization that's later regretted. Long might be more future-proof. Doug |
|
From: Doug C. <cu...@ap...> - 2008-03-19 20:13:47
|
Someone at Y! last week asked why Bailey doesn't use HDFS. I gave the following reasons: - performance: by keeping indexes local search & indexing will be faster - reliability: bailey replicates already, so hdfs replication is redundant - continuous growth: consistent hashing lets us add and remove nodes without fundamentally changing the way the index is partitioned. a host-independent partitioning in HDFS would be too static. He countered: - for decent search performance, the majority of the index must be in memory anyway. i conceded that much of the benefit of local indexes might come from the filesystem buffer cache, which hdfs lacks. - for decent indexing performance, we could persist only logs + index checkpoints to HDFS (once it supports append). - even consistent hashing will require the master to be somewhat involved in indexing as nodes are added and removed. is that really inherently more complicated than having the master dole out subdirectories from a central hdfs repository, merging and splitting them as needed? The advantage of HDFS-based indexes is that nodes have less state. The disadvantage is that you have to run HDFS (if you're not already), and that performance will probably always be a bit less. I don't see a clear advantage either way, and thus tend towards fewer dependencies and better performance. Other thoughts? Doug |