|
From: Doug C. <cu...@ap...> - 2008-02-29 22:18:16
|
Ning Li wrote: > I think this is how it works with replication=3: > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > on range D-E, and syncs with nodes D & E on range E-F. You've switched to counter-clockwise replication, but I think that's generally more intuitive anyway. I also think of syncing as directional and pulled, not pushed. So I think I'd state it (equivalently) as: C syncs C-D from A, C-E from B, D-F from D, and E-F from F. In other words, from each host it overlaps it syncs the overlapping range. Numbers might be simpler: X has X to X+1 and syncs: X to X+1 from X-2 X to X+2 from X-1 X+1 to X+3 from X+1 X+2 to X+3 from X+2 Does that sound right? Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-29 22:38:02
|
On Fri, Feb 29, 2008 at 5:18 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > I think this is how it works with replication=3: > > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > > on range D-E, and syncs with nodes D & E on range E-F. > > You've switched to counter-clockwise replication, but I think that's Oops, I always used clockwise replication. :) > generally more intuitive anyway. I also think of syncing as directional > and pulled, not pushed. So I think I'd state it (equivalently) as: > > C syncs C-D from A, C-E from B, D-F from D, and E-F from F. It's not node F, but node E: C syncs C-D from A, C-E from B, D-F from D, and E-F from E. > In other words, from each host it overlaps it syncs the overlapping > range. Numbers might be simpler: > > X has X to X+1 and syncs: It should be "X has X to X+3 and syncs". > X to X+1 from X-2 > X to X+2 from X-1 > X+1 to X+3 from X+1 > X+2 to X+3 from X+2 The rest looks correct. Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-28 22:01:17
|
Yonik Seeley wrote: > On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: >> At the same time, should we also discuss what the update >> model should be: >> 1 One updatable replica vs. all updatable replicas. The former >> is simple. The latter is powerful. Is there sufficient need for >> the latter to justify its complexity? > > We should always be able to update (so if the "updateable" replica is > down, we need to be able to update another replica). I agree. To support network partitions, all nodes must accept updates for documents in their range. >> 2 The atomicity of an insert/delete/update operation. When >> an insert/delete/update operation is done, does it mean: >> 1) the new doc is indexed in the memory of the node >> 2) the new doc is indexed on the local disk of the node >> 3) the new doc is logged on the local disk of the node >> 4) the new doc is logged in some fault-tolerant shared FS >> (e.g. HDFS) >> 5) the new doc is indexed in the memory of at least X >> nodes >> The probability of the operation getting lost is from high >> to low: 1), then 2) and 3), then 4) and 5). > > [ ... ] I think the decision partially depends on how much of > a document storage system this is, vs just an index that can be > rebuilt. I'd like to position this against document databases, so I'm hoping it can be used as a primary storage. > I'm tempted to lean toward #3 since logs are needed to sync up nodes > (back to question #1). It would be a nice feature if we could arrange so that, in most cases, the client that adds a document sees it in search results immediately. We cannot guarantee that all other clients will see it. Some sort of immediate indexing of the document is required to support this feature, but in-memory is sufficient. We may not implement this feature right off, but we should keep it in mind. Logging is attractive, since it permits easy replaying of logs when shipping updates between nodes. Perhaps we can instead use queries to enumerate changes, but that requires more thought. As for disk versus memory: if we only send updates to a single node in a document's range, then we should sync them to disk. If we instead send updates to multiple nodes in the range, then it's probably okay not to sync, since we already assume that not all nodes in a range will fail at once. The downside of this is that documents could be lost in the case of a datacenter-wide powerfailure, but I think that's acceptable. Performance will suffer considerably if we have to sync on each add. So my inclination is to attempt to add documents to several nodes in the range and not require a sync per add, buffering things in memory as required for good performance. Replication in memory provides fault-tolerance. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-28 22:24:46
|
On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote: > I'd like to position this against document databases, so I'm hoping it > can be used as a primary storage. Are you thinking of storage outside of Lucene stored fields too then? > Logging is attractive, since it permits easy replaying of logs when > shipping updates between nodes. Perhaps we can instead use queries to > enumerate changes, but that requires more thought. Unless all fields are stored, it would be a lengthly process trying to extract a single document that had been added to an index. Using log replay would seem to be more general purpose as it could more easily accommodate other side effects in a system (any changes made outside of a lucene index). > As for disk versus memory: if we only send updates to a single node in a > document's range, then we should sync them to disk. If we instead send > updates to multiple nodes in the range, then it's probably okay not to > sync, since we already assume that not all nodes in a range will fail at > once. The downside of this is that documents could be lost in the case > of a datacenter-wide powerfailure, but I think that's acceptable. > > Performance will suffer considerably if we have to sync on each add. So > my inclination is to attempt to add documents to several nodes in the > range and not require a sync per add, buffering things in memory as > required for good performance. Replication in memory provides > fault-tolerance. Sounds good. I recall from the google filesystem paper how they sent to multiple nodes in a chain (client->A->B->C) rather than having the client send in parallel (client->(A,B,C)) which made a lot of sense at the time (maximizing single NIC bandwidth, etc). Perhaps too much detail right now, but it's worth keeping in mind. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-29 00:28:49
|
On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote: > I'd like to position this against document databases, so I'm hoping it > can be used as a primary storage. A copy of a document will be stored in a stored field, right? I think positioning this against document databases is nice but here are a couple of things worth noting: First, keeping both a doc and its inverted form in an index means storing the doc and indexing the doc are done in the same "transaction". A traditional document database often store a doc first and then index it later (hopefully soon). Second, a traditional document database often supports updating a doc's "metadata" such as author or date. We don't support this or we say a document is name-value pairs and we reconstruct from stored fields and support such update? > > I'm tempted to lean toward #3 since logs are needed to sync up nodes > > (back to question #1). > > It would be a nice feature if we could arrange so that, in most cases, > the client that adds a document sees it in search results immediately. > We cannot guarantee that all other clients will see it. Some sort of > immediate indexing of the document is required to support this feature, > but in-memory is sufficient. We may not implement this feature right > off, but we should keep it in mind. > > Logging is attractive, since it permits easy replaying of logs when > shipping updates between nodes. Perhaps we can instead use queries to > enumerate changes, but that requires more thought. If we have a copy of each doc in the stored field, then as you said later, we can just log the operation id and revision, then retrieve the doc as necessary from the index. > As for disk versus memory: if we only send updates to a single node in a > document's range, then we should sync them to disk. If we instead send > updates to multiple nodes in the range, then it's probably okay not to > sync, since we already assume that not all nodes in a range will fail at > once. The downside of this is that documents could be lost in the case > of a datacenter-wide powerfailure, but I think that's acceptable. > > Performance will suffer considerably if we have to sync on each add. So > my inclination is to attempt to add documents to several nodes in the > range and not require a sync per add, buffering things in memory as > required for good performance. Replication in memory provides > fault-tolerance. #5 with non-sync light-weight logging? Should work. Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-29 21:10:21
|
Ning Li wrote: > A copy of a document will be stored in a stored field, right? If we don't keep complete documents in the log, then I figured that all fields would be stored, so that document retrieved from the index could be directly added to another index. > First, keeping both a doc and its inverted form in an index > means storing the doc and indexing the doc are done in the > same "transaction". A traditional document database often > store a doc first and then index it later (hopefully soon). It should be possible to configure Lucene so that adding a document never triggers any blocking merges. So folks would have to wait for the single-document inversion, but no more. Today, it won't yet be visible to search for a bit, until a commit, but perhaps in a future version of Lucene it might be possible to search unflushed segments. > Second, a traditional document database often supports > updating a doc's "metadata" such as author or date. We > don't support this or we say a document is name-value > pairs and we reconstruct from stored fields and support > such update? I assume that incrementally updateable fields must not be indexed? In any case, I hope that indexing is fast enough that updating by add+delete is okay. CouchDB seems to get away with it... > #5 with non-sync light-weight logging? Should work. Yes. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-28 21:39:41
|
Yonik Seeley wrote: > I think one problem we are running into is due to using the same hash > ring for both partitioning the key space and selecting placement of > those keys. Putting many virtual nodes on the hash ring to make the > hash function fair, in conjunction with using that hash ring to do > replication, causes all the indicies to be mixed together. That's a feature, not a bug! Seriously, I'm not sure what your concern here is. If documents are indexed in the N nodes clockwise from their point, then we do get documents indexed together, but we don't need to search but every Nth index (in most cases). So I don't see the problem. What am I missing? When a new node is introduced to the ring, then it will need to ask its neighbors for the documents assigned to it, and ranges for search should not be re-assigned until it has completed retrieving and indexing these. Similarly, when a node is removed or declared dead then others will be assigned new ranges to construct from neighbors. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-28 22:06:56
|
On Thu, Feb 28, 2008 at 4:39 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > I think one problem we are running into is due to using the same hash > > ring for both partitioning the key space and selecting placement of > > those keys. Putting many virtual nodes on the hash ring to make the > > hash function fair, in conjunction with using that hash ring to do > > replication, causes all the indicies to be mixed together. > > That's a feature, not a bug! Seriously, I'm not sure what your concern > here is. If documents are indexed in the N nodes clockwise from their > point, then we do get documents indexed together, but we don't need to > search but every Nth index (in most cases). So I don't see the problem. > What am I missing? Given that you and Ning don't see it, it's highly likely I'm the one with the disconnect. w/o virtual nodes, everything makes sense to me... I can see how we can query every 3rd host to cover the entire index once. I'm using "virtual nodes" in the sense used here: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html To be explicit, with no virtual nodes, and N=3, I think this is what we would end up with if we added a single node per host: HostA: indexA, indexC, indexD HostB: indexB, indexD, indexA HostC: indexC, indexA, indexB HostD: indexD, indexB, indexC So we can query two nodes and get complete coverage (with appropriate filtering). Now if we addded 100 nodes (virtual nodes) per host in order to even key distribution, only a single host will have all of the keys for a node (defined as all of it's virtual nodes). HostA: indexA, big mix, big mix HostB: indexB, big mix, big mix HostC: indexC, big mix, big mix HostD: indexD, big mix, big mix we could no longer say that nodeC is available from both hostA and hostB, because there are 100 nodeCs. So now it seems like we must query all hosts to get complete coverage, and if HostA goes off-line, we must query all other hosts to get coverage of indexA. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-28 23:00:32
|
Yonik Seeley wrote: > Now if we addded 100 nodes (virtual nodes) per host in order to even > key distribution, only a single host will have all of the keys for a > node (defined as all of it's virtual nodes). > > HostA: indexA, big mix, big mix > HostB: indexB, big mix, big mix > HostC: indexC, big mix, big mix > HostD: indexD, big mix, big mix I'm assuming a 1:1 mapping between index and node. So if a host has 100 nodes on it, then it would have 100 indexes. I'm also imagining more like 4 nodes/indexes per host, not 100. Does that address your concern? A related issue is that we don't want to place nodes with overlapping ranges on the same host, since that would defeat the purpose of replication. The master must assign new nodes to the ring with this restriction. It should generally split the largest interval that's also distant from other intervals assigned to that host. On a tiny cluster it might refuse to add ranges if that would result in overlapping ranges on the same host, however this still might occur if hosts fail. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-28 23:34:16
|
On Thu, Feb 28, 2008 at 6:00 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > Now if we addded 100 nodes (virtual nodes) per host in order to even > > key distribution, only a single host will have all of the keys for a > > node (defined as all of it's virtual nodes). > > > > HostA: indexA, big mix, big mix > > HostB: indexB, big mix, big mix > > HostC: indexC, big mix, big mix > > HostD: indexD, big mix, big mix > > I'm assuming a 1:1 mapping between index and node. So if a host has 100 > nodes on it, then it would have 100 indexes. I'm also imagining more > like 4 nodes/indexes per host, not 100. Does that address your concern? Then the issue then would be that there are not enough nodes per host to evenly partition the keys. That's why I suggested separate key partitioning. AFAIK, they ended up separating out key partitioning in Dynamo too (but for different reasons). I'm still missing what we gain by having them coupled. > A related issue is that we don't want to place nodes with overlapping > ranges on the same host, since that would defeat the purpose of > replication. The master must assign new nodes to the ring with this > restriction. It should generally split the largest interval that's also > distant from other intervals assigned to that host. On a tiny cluster > it might refuse to add ranges if that would result in overlapping ranges > on the same host, however this still might occur if hosts fail. This isn't consistent hashing, in my understanding of it though, since placement of some nodes depends on others. With just a list of nodes, it doesn't seem like a different host would be able to reconstruct the ring. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-28 21:31:18
|
Getting back to this at long last! Sorry for the delays... Ning Li wrote: > First, let's sync a couple of index-related terms: Good idea. I think "node" and "shard" are confusing, and perhaps can be avoided in this context. > 1) Shard: The distinct starting positions of all the virtual nodes > divide the ring into shards. For example, starting positions > (A B C D E) divide the ring into 5 shards: AB, BC, CD, DE, DA. I prefer to call these "ranges". "Shard" to me sounds like something physical, and ranges are not physical. > 2) Index on a virtual node (suggest a name?): A virtual node > serves a number of continuous shards. For example, with > 3-way replication, the indexes on the virtual nodes are: > AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. I was just using "index" for this. An index is a set of files on a host that corresponds to a range of ids. Let's also use "node" for "virtual node" and "host" for a set of virtual nodes running on the same host. A node corresponds to a point on the ring and may be assigned a range and maintain an index for that range. The range assigned to a node may change over time and it will have to adjust its index accordingly. > Now, should an index on a virtual node be implemented as > one Lucene index or N Lucene indexes (one per shard)? My hunch is one index per range. That way we can search a set of indexes that completes the ring, and search maximally large segments. Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-28 23:19:58
|
Subject: Terms "Node" and "host" are clear now. "Range" and "index" are still a bit confusing. On Thu, Feb 28, 2008 at 4:31 PM, Doug Cutting <cu...@ap...> wrote: > > 1) Shard: The distinct starting positions of all the virtual nodes > > divide the ring into shards. For example, starting positions > > (A B C D E) divide the ring into 5 shards: AB, BC, CD, DE, DA. > > I prefer to call these "ranges". "Shard" to me sounds like something > physical, and ranges are not physical. > > > 2) Index on a virtual node (suggest a name?): A virtual node > > serves a number of continuous shards. For example, with > > 3-way replication, the indexes on the virtual nodes are: > > AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. If AB is called a "range", then "AB-BC-CD" is a "node-range"? Feel free to give better names. :) > > Now, should an index on a virtual node be implemented as > > one Lucene index or N Lucene indexes (one per shard)? > > My hunch is one index per range. That way we can search a set of > indexes that completes the ring, and search maximally large segments. You mean one "index" per "node-range" and we need filtering, right? Since later you said "I'm assuming a 1:1 mapping between index and node." Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-28 23:42:48
|
Ning Li wrote: >> > 2) Index on a virtual node (suggest a name?): A virtual node >> > serves a number of continuous shards. For example, with >> > 3-way replication, the indexes on the virtual nodes are: >> > AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. > > If AB is called a "range", then "AB-BC-CD" is a "node-range"? > Feel free to give better names. :) If the rule is to index things in the three clockwise nodes, then the range of node D is (B-D], since it includes all documents after B and through D in its index. If node B crashes, D's range would become (A-D]. The master would tell it to update its index to reflect that new range, and it would notify the master when that update is complete. It would probably keep an IndexReader open on the (B-D] version of its index so that it could keep servicing queries from clients while it retrieves and indexes the documents in (A-B], and for a time after that, until all clients have retrieved the new ring map from the master. So a node may be serving multiple ranges at a time for search, and its index may contain multiple ranges. If no documents are added for a time, then things should stabilize and search ranges should map 1:1 to nodes and indexes. The node is the process, the index is its datastructure. Does that hold together? Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-28 23:52:54
|
On Thu, Feb 28, 2008 at 6:42 PM, Doug Cutting <cu...@ap...> wrote: > Does that hold together? Yes. I thought you meant that. I'm not good at naming things. :) Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 13:51:09
|
On Thu, Feb 28, 2008 at 5:45 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > Are you thinking of storage outside of Lucene stored fields too then? > > I think we'll have to maintain logs outside of Lucene. Beyond that, I > don't see that we'd need to use more than stored fields. Do you? It depends on the documents I guess... if they are big, putting them in the index can be a burden because they get copied on every segment merge, and loading the other stored fields takes longer. There are also two levels of "Document"... things like PDF, Word, etc, also need to be parsed and have fields extracted make a lucene-style Document. I assume that's out of scope for this project though. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 16:46:29
|
On Fri, Feb 29, 2008 at 11:23 AM, Ning Li <nin...@gm...> wrote: > On Fri, Feb 29, 2008 at 9:08 AM, Yonik Seeley <yo...@ap...> wrote: > > We'll need to make sure that it handles this case: > > Say C syncs from B, then B syncs from A and get some new docs. Now A > > goes away for a while... it seems like C will need to get the new docs > > from B next time it syncs. If the docs that B got from A are treated > > like any other new additions, this should work right? > > I think this is how it works with replication=3: > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > on range D-E, and syncs with nodes D & E on range E-F. > > Now node D goes away. Node C continues to serve range C-F until it gets > range F-G from node E or node F and starts to serve range C-G. Then, > node C syncs with nodes A & B on range C-D (same as before), syncs with > nodes B and E on range D-F, and syncs with nodes E & F on range F-G. > > Right? The issue I'm concerned about is losing documents that are only partially replicated and there are changes to the configuration. It's not clear to me if you were answering that or not. I guess that's why we need a good simulator! -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-29 17:03:58
|
On Fri, Feb 29, 2008 at 11:46 AM, Yonik Seeley <yo...@ap...> wrote: > The issue I'm concerned about is losing documents that are only > partially replicated and there are changes to the configuration. It's > not clear to me if you were answering that or not. I guess that's why > we need a good simulator! The changes on node D that have been propagated to any other node are not lost. The changes on node D that haven't been propagated to any other node are lost. Given that we probably choose #5 (which sends a change to multiple nodes), the chance of a change getting lost is low. Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 17:16:50
|
On Fri, Feb 29, 2008 at 12:04 PM, Ning Li <nin...@gm...> wrote: > On Fri, Feb 29, 2008 at 11:46 AM, Yonik Seeley <yo...@ap...> wrote: > > The issue I'm concerned about is losing documents that are only > > partially replicated and there are changes to the configuration. It's > > not clear to me if you were answering that or not. I guess that's why > > we need a good simulator! > > The changes on node D that have been propagated to any other node > are not lost. Well, that's the goal, I'm just not clear that it's all figured out yet so that doesn't happen. Changes to node configuration open up holes where docs could slip through if we're not careful. But I don't think there is any inherent flaw in what we've all discussed so far, so we don't have to discuss this particular point further. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-29 21:24:43
|
Ning Li wrote: > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: >> A unique updatetable index per document would be nice, but I'm not yet >> entirely convinced it is practical. > > Not if short glitches are not acceptable. In BigTable, a tablet is served > by a single tablet server. I wonder if they find it to be a problem. BigTable points towards a different architecture, where all modifications are logged to a shared filesystem, and a single node handles both updates and searches for that range of ids. Perhaps we should consider this more seriously. We want to scale flexibly both in collection size and in search traffic. If search traffic is low, then indexes might be large, and if search traffic is high, indexes might be smaller and replication might be higher. But, with no search node replication, system performance tops out a the rate that a node can process queries on a tiny index, which is not infinite. So you'd probably want to add read-only replicas onto the BigTable model. But then, when you have lots of writes, you don't fully utilize your cluster, and our writes are much more compute intensive than BigTable writes. I think configuring a cluster in this model would be more complicated and less fluid. Finally, as you observed, there would be hiccups whenever a node fails. Hiccups affect a small percentage of BigTable clients, only those touching the tablet on the failed node. But, in distributed search, every query touches a large portion of the nodes. So, in a 1000 node cluster, a failure might delay .1% of BigTable users, but might delay 33% of distributed search users (assuming 3-way replication). So search can be much more sensitive to this. So I'm not convinced that the BigTable model is as appropriate for distributed full-text search as consistent hashing. Thoughts? Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-29 22:19:36
|
On Fri, Feb 29, 2008 at 4:24 PM, Doug Cutting <cu...@ap...> wrote: > Finally, as you observed, there would be hiccups whenever a node fails. > Hiccups affect a small percentage of BigTable clients, only those > touching the tablet on the failed node. But, in distributed search, > every query touches a large portion of the nodes. So, in a 1000 node > cluster, a failure might delay .1% of BigTable users, but might delay > 33% of distributed search users (assuming 3-way replication). So search > can be much more sensitive to this. > > So I'm not convinced that the BigTable model is as appropriate for > distributed full-text search as consistent hashing. Thoughts? I used BigTable as an example for single-write-replica. I didn't intend to use BigTable's single-read/write-replica model for distributed full-text search. I was thinking more like a single-write-multi-read-replica model (using consistent hashing). I considered your proposal as a multi-write/read-replica model. As I said before, the multi-write/read-replica model is more powerful than the single-write-multi-read-replica model. I was just worried about its complexity. But now I think it can be achieved without being too complicated. :) Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-29 21:57:30
|
Yonik Seeley wrote: > It depends on the documents I guess... if they are big, putting them > in the index can be a burden because they get copied on every segment > merge, and loading the other stored fields takes longer. Didn't Mike change that? Segments can now point to fields in a separate file, according to: http://lucene.apache.org/java/docs/fileformats.html#Segments%20File I think that's so that they don't have to be copied with every merge. > There are also two levels of "Document"... things like PDF, Word, etc, > also need to be parsed and have fields extracted make a lucene-style > Document. I assume that's out of scope for this project though. Yes, I think an application could implement that with, e.g., a binary field for the raw data, another field for the mime type, and a third for the extracted text to index. The raw data and text might be compressed. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 22:46:26
|
On Fri, Feb 29, 2008 at 4:57 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > It depends on the documents I guess... if they are big, putting them > > in the index can be a burden because they get copied on every segment > > merge, and loading the other stored fields takes longer. > > Didn't Mike change that? Segments can now point to fields in a separate > file, according to: > > http://lucene.apache.org/java/docs/fileformats.html#Segments%20File > > I think that's so that they don't have to be copied with every merge. Ah right... they won't be copied at all for a single indexing session. So full index builds won't be impacted much, just merges due to incremental adds/changes. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-29 23:54:43
|
Ning Li wrote: >> You've switched to counter-clockwise replication, but I think that's > > Oops, I always used clockwise replication. :) Just to be clear: when a new document D arrives, the question is what nodes on the ring do we send it to? The N clockwise (greater) or counterclockwise from (less than) D? The most recent messages assume counter-clockwise (less than). A document between C and D would be in the indexes on nodes A B and C. The Dynamo paper says "clockwise", but I think that direction makes things more confusing and that we should instead use counterclockwise. Note that the index at node C contains documents in range C-F, three nodes clockwise, but the Dynamo paper uses the term clockwise when talking about assigning instances to nodes, not the range that each node contains. Are we in agreement about this? Doug |
|
From: Ning L. <nin...@gm...> - 2008-03-01 00:21:05
|
On Fri, Feb 29, 2008 at 6:54 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > >> You've switched to counter-clockwise replication, but I think that's > > > > Oops, I always used clockwise replication. :) > > Just to be clear: when a new document D arrives, the question is what > nodes on the ring do we send it to? The N clockwise (greater) or > counterclockwise from (less than) D? The most recent messages assume > counter-clockwise (less than). A document between C and D would be in > the indexes on nodes A B and C. The Dynamo paper says "clockwise", but > I think that direction makes things more confusing and that we should > instead use counterclockwise. Note that the index at node C contains > documents in range C-F, three nodes clockwise, but the Dynamo paper uses > the term clockwise when talking about assigning instances to nodes, not > the range that each node contains. Are we in agreement about this? I'm confused with which is clockwise and which is counter-clockwise now. :( I thought node C serves D-E and E-F in addition to C-D, so it's clockwise. You are saying because range C-D is on nodes A, B and C, so it should be called counter-clockwise. :) Sounds reasonable. In any case, I always thought and I think we agree that node B serves range B-E, node C serves range C-F, etc. So we had the same assumption in the most recent discussions. Ning |
|
From: Yonik S. <yo...@ap...> - 2008-03-01 01:50:00
|
On Fri, Feb 29, 2008 at 7:21 PM, Ning Li <nin...@gm...> wrote: > I'm confused with which is clockwise and which is counter-clockwise now. :( I noticed the switch, too. It just depends if your reference is the node or the key. a doc that falls on node A will be replicated in B and C (that's clockwise, like dynamo). If you are looking for a replica of C, it will be at A and B (counter-clockwise). I don't think it really matters what the direction ends up being though. A node will be replicated to N unique hosts in a consistent (but arbitrary) direction. Personally, it makes more sense to thing of it as HostA containing NodeA, NodeB, NodeC (or IndexA, IndexB, IndexC). It's replicas of nodes rather than overlapping ranges. -Yonik |