You can subscribe to this list here.
| 2008 |
Jan
|
Feb
(60) |
Mar
(65) |
Apr
(44) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Yonik S. <yo...@ap...> - 2008-02-29 17:16:50
|
On Fri, Feb 29, 2008 at 12:04 PM, Ning Li <nin...@gm...> wrote: > On Fri, Feb 29, 2008 at 11:46 AM, Yonik Seeley <yo...@ap...> wrote: > > The issue I'm concerned about is losing documents that are only > > partially replicated and there are changes to the configuration. It's > > not clear to me if you were answering that or not. I guess that's why > > we need a good simulator! > > The changes on node D that have been propagated to any other node > are not lost. Well, that's the goal, I'm just not clear that it's all figured out yet so that doesn't happen. Changes to node configuration open up holes where docs could slip through if we're not careful. But I don't think there is any inherent flaw in what we've all discussed so far, so we don't have to discuss this particular point further. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-29 17:03:58
|
On Fri, Feb 29, 2008 at 11:46 AM, Yonik Seeley <yo...@ap...> wrote: > The issue I'm concerned about is losing documents that are only > partially replicated and there are changes to the configuration. It's > not clear to me if you were answering that or not. I guess that's why > we need a good simulator! The changes on node D that have been propagated to any other node are not lost. The changes on node D that haven't been propagated to any other node are lost. Given that we probably choose #5 (which sends a change to multiple nodes), the chance of a change getting lost is low. Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 16:46:29
|
On Fri, Feb 29, 2008 at 11:23 AM, Ning Li <nin...@gm...> wrote: > On Fri, Feb 29, 2008 at 9:08 AM, Yonik Seeley <yo...@ap...> wrote: > > We'll need to make sure that it handles this case: > > Say C syncs from B, then B syncs from A and get some new docs. Now A > > goes away for a while... it seems like C will need to get the new docs > > from B next time it syncs. If the docs that B got from A are treated > > like any other new additions, this should work right? > > I think this is how it works with replication=3: > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > on range D-E, and syncs with nodes D & E on range E-F. > > Now node D goes away. Node C continues to serve range C-F until it gets > range F-G from node E or node F and starts to serve range C-G. Then, > node C syncs with nodes A & B on range C-D (same as before), syncs with > nodes B and E on range D-F, and syncs with nodes E & F on range F-G. > > Right? The issue I'm concerned about is losing documents that are only partially replicated and there are changes to the configuration. It's not clear to me if you were answering that or not. I guess that's why we need a good simulator! -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 16:27:51
|
On Thu, Feb 14, 2008 at 7:00 PM, Doug Cutting <cu...@ap...> wrote: > I've attached up what I hope is a minimally sufficient user-level API to > start implementing. The idea is to include only what's required, like > facet counts, but otherwise keep it simple. For this phase it doesn't even seem like facets are required. Wouldn't simple queries exercise the read-side of things enough for testing and developing the distributed strategy? -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-29 16:23:42
|
On Fri, Feb 29, 2008 at 9:08 AM, Yonik Seeley <yo...@ap...> wrote: > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: > > I'd imagined each node periodically querying its neighbors for changes > > in the range they share. We shouldn't rely on clock synchronization, so > > each node would keep the last revision of each neighbor that it has > > sync'd with. So, the first time they connect, they pass revision zero > > and receive all updates for their overlap. The next time they only need > > to retrieve updates since the last. > > Sounds good! > > We'll need to make sure that it handles this case: > Say C syncs from B, then B syncs from A and get some new docs. Now A > goes away for a while... it seems like C will need to get the new docs > from B next time it syncs. If the docs that B got from A are treated > like any other new additions, this should work right? I think this is how it works with replication=3: Node A serves range A-D, node B serves range B-E, node C serves range C-F... Node C syncs with nodes A & B on range C-D, syncs with nodes B and D on range D-E, and syncs with nodes D & E on range E-F. Now node D goes away. Node C continues to serve range C-F until it gets range F-G from node E or node F and starts to serve range C-G. Then, node C syncs with nodes A & B on range C-D (same as before), syncs with nodes B and E on range D-F, and syncs with nodes E & F on range F-G. Right? Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 14:08:38
|
On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: > I'd imagined each node periodically querying its neighbors for changes > in the range they share. We shouldn't rely on clock synchronization, so > each node would keep the last revision of each neighbor that it has > sync'd with. So, the first time they connect, they pass revision zero > and receive all updates for their overlap. The next time they only need > to retrieve updates since the last. Sounds good! We'll need to make sure that it handles this case: Say C syncs from B, then B syncs from A and get some new docs. Now A goes away for a while... it seems like C will need to get the new docs from B next time it syncs. If the docs that B got from A are treated like any other new additions, this should work right? -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 13:51:09
|
On Thu, Feb 28, 2008 at 5:45 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > Are you thinking of storage outside of Lucene stored fields too then? > > I think we'll have to maintain logs outside of Lucene. Beyond that, I > don't see that we'd need to use more than stored fields. Do you? It depends on the documents I guess... if they are big, putting them in the index can be a burden because they get copied on every segment merge, and loading the other stored fields takes longer. There are also two levels of "Document"... things like PDF, Word, etc, also need to be parsed and have fields extracted make a lucene-style Document. I assume that's out of scope for this project though. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 13:27:43
|
On Thu, Feb 28, 2008 at 7:08 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > I'm still missing what we gain by having them coupled. > > I don't understand the alternative. We want to divide things evenly, > but also be able to gracefully incorporate new hosts and dead hosts > without re-allocating everything. We must arrange things so that the > ranges served by a node remain disjoint for this reason too, so that a > host failure doesn't increase more than around 25% the load of any other > single host. Here is an example of how to separate key partitioning from replication (single node per host to make things easier for now): Key partitioning: when hostX is added to the cluster, add 100 nodes (vnodeX1, vnodeX2, vnodeX3...) to the hash ring for partitioning keys. This steals a little bit from every other node. There is no replication concept in this ring. Replication: add a single nodeX to the replication hash ring. The two clockwise nodes replicate this node, and the two nodes in the other direction should be replicated by this node. This still works if each node serves M indexes (4 is what you suggested I think). Add 4 nodes to the replication ring, and for each of those nodes, determine it's key-space with the key ring. With these concerns separated, either could be changed or adjusted independently. I'm not advocating that specific method of key partitioning though... while it is more fair (a new node steals documents from all other nodes), that could also be a drawback depending on exactly how the system (rebalancing) works. > > This isn't consistent hashing, in my understanding of it though, since > > placement of some nodes depends on others. > > Okay, it's not consistent hashing, whatever. > > I don't see how to do this without a master coordinating things (e.g., > telling clients what to search). The master shouldn't be involved in > individual searches or document additions, nor should it ideally have > much persistent state, but beyond that, I don't see a reason not to let > it arrange things optimally. OK, that changes things... I wasn't clear on exactly why consistent hashing was used. In this case it sounds like more of an implementation detail rather than part of the interface (which it is in some other systems). > > With just a list of nodes, it doesn't seem like a different host would > > be able to reconstruct the ring. > > Right, the master would need to give clients the list of nodes, each > with the range of documents that its currently able to search, and > (perhaps separately) the range of documents its currently able to index. Would this info be obtainable from any node? It seems like this should be possible. > I guess, if we wanted to diverge more from consistent hashing, we > could have each node serve a set of ranges, to give the master more > freedom to shuffle things when a node fails. That'd be okay with me. When a new master comes up, it seems like nodes would need to report their actual ranges anyway. > If we don't use consistent hashing then we have to develop some other > scheme for allocation and deallocation of ids to nodes that has all of > the good properties we need from consistent hashing and additional > improvements. So it seems reasonable to start with consistent hashing > and diverge only as needed. That's fine... it sounds like from everything above that this won't be exposed to anyone but the master anyway. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 11:24:22
|
On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote:
> Perhaps we'd want two formats for updates sent between nodes: outline
> and full, where outline just contains a sequence of <{ADD|DEL}, id,
> revision>. Then the retrieving node can process these and determine
> which revisions of which ids it needs, then retrieve those as a second step.
Yes, I think outline mode is definitely needed.
One way to do full mode would be to perhaps reuse the normal update
interface (rather than request certain docs, request that update be
invoked with certain docs... triggering logging, but not replication).
More of an implementation detail I guess.
-Yonik
|
|
From: Ning L. <nin...@gm...> - 2008-02-29 01:41:53
|
On Thu, Feb 28, 2008 at 7:56 PM, Ning Li <nin...@gm...> wrote: > It seems good. I need to figure out completely the protocol when nodes > join and leave the system. Doug, the approach works, and is not too complicated. I like it. :) Ning |
|
From: Ning L. <nin...@gm...> - 2008-02-29 00:59:14
|
On Thu, Feb 28, 2008 at 7:56 PM, Ning Li <nin...@gm...> wrote: > Perhaps just the full mode until we see a demand for the outline mode. I see the need for the outline mode! Ning |
|
From: Ning L. <nin...@gm...> - 2008-02-29 00:55:58
|
On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote:
> A unique updatetable index per document would be nice, but I'm not yet
> entirely convinced it is practical.
Not if short glitches are not acceptable. In BigTable, a tablet is served
by a single tablet server. I wonder if they find it to be a problem.
> I'd imagined each node periodically querying its neighbors for changes
> in the range they share. We shouldn't rely on clock synchronization, so
> each node would keep the last revision of each neighbor that it has
> sync'd with. So, the first time they connect, they pass revision zero
> and receive all updates for their overlap. The next time they only need
> to retrieve updates since the last.
This works...
> Documents could also have an application-specified revision. This would
> greatly simplify reconciliation, since we could use these to resolve all
> disputes in a predictable way: higher revision wins.
I like this.
> Perhaps we'd want two formats for updates sent between nodes: outline
> and full, where outline just contains a sequence of <{ADD|DEL}, id,
> revision>. Then the retrieving node can process these and determine
> which revisions of which ids it needs, then retrieve those as a second step.
Perhaps just the full mode until we see a demand for the outline mode.
> This approach is tolerant of network partitoning, and not too
> complicated. What do you think?
It seems good. I need to figure out completely the protocol when nodes
join and leave the system.
Ning
|
|
From: Ning L. <nin...@gm...> - 2008-02-29 00:28:49
|
On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote: > I'd like to position this against document databases, so I'm hoping it > can be used as a primary storage. A copy of a document will be stored in a stored field, right? I think positioning this against document databases is nice but here are a couple of things worth noting: First, keeping both a doc and its inverted form in an index means storing the doc and indexing the doc are done in the same "transaction". A traditional document database often store a doc first and then index it later (hopefully soon). Second, a traditional document database often supports updating a doc's "metadata" such as author or date. We don't support this or we say a document is name-value pairs and we reconstruct from stored fields and support such update? > > I'm tempted to lean toward #3 since logs are needed to sync up nodes > > (back to question #1). > > It would be a nice feature if we could arrange so that, in most cases, > the client that adds a document sees it in search results immediately. > We cannot guarantee that all other clients will see it. Some sort of > immediate indexing of the document is required to support this feature, > but in-memory is sufficient. We may not implement this feature right > off, but we should keep it in mind. > > Logging is attractive, since it permits easy replaying of logs when > shipping updates between nodes. Perhaps we can instead use queries to > enumerate changes, but that requires more thought. If we have a copy of each doc in the stored field, then as you said later, we can just log the operation id and revision, then retrieve the doc as necessary from the index. > As for disk versus memory: if we only send updates to a single node in a > document's range, then we should sync them to disk. If we instead send > updates to multiple nodes in the range, then it's probably okay not to > sync, since we already assume that not all nodes in a range will fail at > once. The downside of this is that documents could be lost in the case > of a datacenter-wide powerfailure, but I think that's acceptable. > > Performance will suffer considerably if we have to sync on each add. So > my inclination is to attempt to add documents to several nodes in the > range and not require a sync per add, buffering things in memory as > required for good performance. Replication in memory provides > fault-tolerance. #5 with non-sync light-weight logging? Should work. Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-29 00:08:36
|
Yonik Seeley wrote: >> I'm assuming a 1:1 mapping between index and node. So if a host has 100 >> nodes on it, then it would have 100 indexes. I'm also imagining more >> like 4 nodes/indexes per host, not 100. Does that address your concern? > > Then the issue then would be that there are not enough nodes per host > to evenly partition the keys. We'll see... I'm hoping we can get the master to help out with allocations to keep things fairly even. The master could even periodically reassign nodes that are underloaded to overloaded areas of the ring. > I'm still missing what we gain by having them coupled. I don't understand the alternative. We want to divide things evenly, but also be able to gracefully incorporate new hosts and dead hosts without re-allocating everything. We must arrange things so that the ranges served by a node remain disjoint for this reason too, so that a host failure doesn't increase more than around 25% the load of any other single host. Are you arguing that non-random range boundaries would work better than random ones? (Maybe I should go re-read your message...) I don't think that's at odds with anything I've proposed. I think the master needs to be in charge of placing nodes on the ring, determining how many nodes a host serves, etc. > This isn't consistent hashing, in my understanding of it though, since > placement of some nodes depends on others. Okay, it's not consistent hashing, whatever. I don't see how to do this without a master coordinating things (e.g., telling clients what to search). The master shouldn't be involved in individual searches or document additions, nor should it ideally have much persistent state, but beyond that, I don't see a reason not to let it arrange things optimally. > With just a list of nodes, it doesn't seem like a different host would > be able to reconstruct the ring. Right, the master would need to give clients the list of nodes, each with the range of documents that its currently able to search, and (perhaps separately) the range of documents its currently able to index. I guess, if we wanted to diverge more from consistent hashing, we could have each node serve a set of ranges, to give the master more freedom to shuffle things when a node fails. That'd be okay with me. If we don't use consistent hashing then we have to develop some other scheme for allocation and deallocation of ids to nodes that has all of the good properties we need from consistent hashing and additional improvements. So it seems reasonable to start with consistent hashing and diverge only as needed. Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-28 23:52:54
|
On Thu, Feb 28, 2008 at 6:42 PM, Doug Cutting <cu...@ap...> wrote: > Does that hold together? Yes. I thought you meant that. I'm not good at naming things. :) Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-28 23:42:48
|
Ning Li wrote: >> > 2) Index on a virtual node (suggest a name?): A virtual node >> > serves a number of continuous shards. For example, with >> > 3-way replication, the indexes on the virtual nodes are: >> > AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. > > If AB is called a "range", then "AB-BC-CD" is a "node-range"? > Feel free to give better names. :) If the rule is to index things in the three clockwise nodes, then the range of node D is (B-D], since it includes all documents after B and through D in its index. If node B crashes, D's range would become (A-D]. The master would tell it to update its index to reflect that new range, and it would notify the master when that update is complete. It would probably keep an IndexReader open on the (B-D] version of its index so that it could keep servicing queries from clients while it retrieves and indexes the documents in (A-B], and for a time after that, until all clients have retrieved the new ring map from the master. So a node may be serving multiple ranges at a time for search, and its index may contain multiple ranges. If no documents are added for a time, then things should stabilize and search ranges should map 1:1 to nodes and indexes. The node is the process, the index is its datastructure. Does that hold together? Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-28 23:34:16
|
On Thu, Feb 28, 2008 at 6:00 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > Now if we addded 100 nodes (virtual nodes) per host in order to even > > key distribution, only a single host will have all of the keys for a > > node (defined as all of it's virtual nodes). > > > > HostA: indexA, big mix, big mix > > HostB: indexB, big mix, big mix > > HostC: indexC, big mix, big mix > > HostD: indexD, big mix, big mix > > I'm assuming a 1:1 mapping between index and node. So if a host has 100 > nodes on it, then it would have 100 indexes. I'm also imagining more > like 4 nodes/indexes per host, not 100. Does that address your concern? Then the issue then would be that there are not enough nodes per host to evenly partition the keys. That's why I suggested separate key partitioning. AFAIK, they ended up separating out key partitioning in Dynamo too (but for different reasons). I'm still missing what we gain by having them coupled. > A related issue is that we don't want to place nodes with overlapping > ranges on the same host, since that would defeat the purpose of > replication. The master must assign new nodes to the ring with this > restriction. It should generally split the largest interval that's also > distant from other intervals assigned to that host. On a tiny cluster > it might refuse to add ranges if that would result in overlapping ranges > on the same host, however this still might occur if hosts fail. This isn't consistent hashing, in my understanding of it though, since placement of some nodes depends on others. With just a list of nodes, it doesn't seem like a different host would be able to reconstruct the ring. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-28 23:19:58
|
Subject: Terms "Node" and "host" are clear now. "Range" and "index" are still a bit confusing. On Thu, Feb 28, 2008 at 4:31 PM, Doug Cutting <cu...@ap...> wrote: > > 1) Shard: The distinct starting positions of all the virtual nodes > > divide the ring into shards. For example, starting positions > > (A B C D E) divide the ring into 5 shards: AB, BC, CD, DE, DA. > > I prefer to call these "ranges". "Shard" to me sounds like something > physical, and ranges are not physical. > > > 2) Index on a virtual node (suggest a name?): A virtual node > > serves a number of continuous shards. For example, with > > 3-way replication, the indexes on the virtual nodes are: > > AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. If AB is called a "range", then "AB-BC-CD" is a "node-range"? Feel free to give better names. :) > > Now, should an index on a virtual node be implemented as > > one Lucene index or N Lucene indexes (one per shard)? > > My hunch is one index per range. That way we can search a set of > indexes that completes the ring, and search maximally large segments. You mean one "index" per "node-range" and we need filtering, right? Since later you said "I'm assuming a 1:1 mapping between index and node." Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-28 23:00:32
|
Yonik Seeley wrote: > Now if we addded 100 nodes (virtual nodes) per host in order to even > key distribution, only a single host will have all of the keys for a > node (defined as all of it's virtual nodes). > > HostA: indexA, big mix, big mix > HostB: indexB, big mix, big mix > HostC: indexC, big mix, big mix > HostD: indexD, big mix, big mix I'm assuming a 1:1 mapping between index and node. So if a host has 100 nodes on it, then it would have 100 indexes. I'm also imagining more like 4 nodes/indexes per host, not 100. Does that address your concern? A related issue is that we don't want to place nodes with overlapping ranges on the same host, since that would defeat the purpose of replication. The master must assign new nodes to the ring with this restriction. It should generally split the largest interval that's also distant from other intervals assigned to that host. On a tiny cluster it might refuse to add ranges if that would result in overlapping ranges on the same host, however this still might occur if hosts fail. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-28 22:45:12
|
Yonik Seeley wrote: > Are you thinking of storage outside of Lucene stored fields too then? I think we'll have to maintain logs outside of Lucene. Beyond that, I don't see that we'd need to use more than stored fields. Do you? I can see two approaches: log the entire document, or just log the operation id, and revision, then extract the document data as needed from the index. There may be cases where a neighbor asks for a document that's no longer in the index, since it may have been deleted since the neighbor asked for updates. If we can either make sure this doesn't happen, or make things tolerant of it, then we could get away with only storing documents in the index, which would be nice. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-28 22:37:31
|
Ning Li wrote:
> In the case of one updateable replica, if the updateable replica is down,
> we need to make another replica updateable. The down side is that
> the shard is not updateable during the switch. But it should be short
> and thus be fine for some applications. The complexity involved with
> all updateable replicas is conflict resolution...
A unique updatetable index per document would be nice, but I'm not yet
entirely convinced it is practical.
Say your client fetches the ring of nodes from the master, then starts
adding documents. In the meantime, the master updates the list of
nodes, and a second client gets the new, modified ring. In this case,
two clients could update different nodes for the same document.
Another scenario is that a switch is congested, and some clients are
unable to connect to an index to update it, while other clients can
connect to it. The clients that cannot connect update a different node
for the same documents.
Perhaps we could guard against these by having clients "lease" the ring
from the master. Then the master could make sure that it doesn't issue
a new version of the ring until all leases to the old version have
expired. The master could choose to republish the ring every five
minutes. A node that leases the ring four minutes into a lease window
would only get a one minute lease. But all nodes would then hammer the
master for a new version at the same time, every five minutes. Can you
think of a better way?
Unless we can come up with a foolproof mechanism or somesuch, I think
we'll have to handle the case where multiple indexes are updated in
their overlap and must reconcile these updates.
I'd imagined each node periodically querying its neighbors for changes
in the range they share. We shouldn't rely on clock synchronization, so
each node would keep the last revision of each neighbor that it has
sync'd with. So, the first time they connect, they pass revision zero
and receive all updates for their overlap. The next time they only need
to retrieve updates since the last.
Documents could also have an application-specified revision. This would
greatly simplify reconciliation, since we could use these to resolve all
disputes in a predictable way: higher revision wins.
Perhaps we'd want two formats for updates sent between nodes: outline
and full, where outline just contains a sequence of <{ADD|DEL}, id,
revision>. Then the retrieving node can process these and determine
which revisions of which ids it needs, then retrieve those as a second step.
This approach is tolerant of network partitoning, and not too
complicated. What do you think?
Doug
|
|
From: Yonik S. <yo...@ap...> - 2008-02-28 22:24:46
|
On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote: > I'd like to position this against document databases, so I'm hoping it > can be used as a primary storage. Are you thinking of storage outside of Lucene stored fields too then? > Logging is attractive, since it permits easy replaying of logs when > shipping updates between nodes. Perhaps we can instead use queries to > enumerate changes, but that requires more thought. Unless all fields are stored, it would be a lengthly process trying to extract a single document that had been added to an index. Using log replay would seem to be more general purpose as it could more easily accommodate other side effects in a system (any changes made outside of a lucene index). > As for disk versus memory: if we only send updates to a single node in a > document's range, then we should sync them to disk. If we instead send > updates to multiple nodes in the range, then it's probably okay not to > sync, since we already assume that not all nodes in a range will fail at > once. The downside of this is that documents could be lost in the case > of a datacenter-wide powerfailure, but I think that's acceptable. > > Performance will suffer considerably if we have to sync on each add. So > my inclination is to attempt to add documents to several nodes in the > range and not require a sync per add, buffering things in memory as > required for good performance. Replication in memory provides > fault-tolerance. Sounds good. I recall from the google filesystem paper how they sent to multiple nodes in a chain (client->A->B->C) rather than having the client send in parallel (client->(A,B,C)) which made a lot of sense at the time (maximizing single NIC bandwidth, etc). Perhaps too much detail right now, but it's worth keeping in mind. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-28 22:06:56
|
On Thu, Feb 28, 2008 at 4:39 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > I think one problem we are running into is due to using the same hash > > ring for both partitioning the key space and selecting placement of > > those keys. Putting many virtual nodes on the hash ring to make the > > hash function fair, in conjunction with using that hash ring to do > > replication, causes all the indicies to be mixed together. > > That's a feature, not a bug! Seriously, I'm not sure what your concern > here is. If documents are indexed in the N nodes clockwise from their > point, then we do get documents indexed together, but we don't need to > search but every Nth index (in most cases). So I don't see the problem. > What am I missing? Given that you and Ning don't see it, it's highly likely I'm the one with the disconnect. w/o virtual nodes, everything makes sense to me... I can see how we can query every 3rd host to cover the entire index once. I'm using "virtual nodes" in the sense used here: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html To be explicit, with no virtual nodes, and N=3, I think this is what we would end up with if we added a single node per host: HostA: indexA, indexC, indexD HostB: indexB, indexD, indexA HostC: indexC, indexA, indexB HostD: indexD, indexB, indexC So we can query two nodes and get complete coverage (with appropriate filtering). Now if we addded 100 nodes (virtual nodes) per host in order to even key distribution, only a single host will have all of the keys for a node (defined as all of it's virtual nodes). HostA: indexA, big mix, big mix HostB: indexB, big mix, big mix HostC: indexC, big mix, big mix HostD: indexD, big mix, big mix we could no longer say that nodeC is available from both hostA and hostB, because there are 100 nodeCs. So now it seems like we must query all hosts to get complete coverage, and if HostA goes off-line, we must query all other hosts to get coverage of indexA. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-28 22:01:17
|
Yonik Seeley wrote: > On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: >> At the same time, should we also discuss what the update >> model should be: >> 1 One updatable replica vs. all updatable replicas. The former >> is simple. The latter is powerful. Is there sufficient need for >> the latter to justify its complexity? > > We should always be able to update (so if the "updateable" replica is > down, we need to be able to update another replica). I agree. To support network partitions, all nodes must accept updates for documents in their range. >> 2 The atomicity of an insert/delete/update operation. When >> an insert/delete/update operation is done, does it mean: >> 1) the new doc is indexed in the memory of the node >> 2) the new doc is indexed on the local disk of the node >> 3) the new doc is logged on the local disk of the node >> 4) the new doc is logged in some fault-tolerant shared FS >> (e.g. HDFS) >> 5) the new doc is indexed in the memory of at least X >> nodes >> The probability of the operation getting lost is from high >> to low: 1), then 2) and 3), then 4) and 5). > > [ ... ] I think the decision partially depends on how much of > a document storage system this is, vs just an index that can be > rebuilt. I'd like to position this against document databases, so I'm hoping it can be used as a primary storage. > I'm tempted to lean toward #3 since logs are needed to sync up nodes > (back to question #1). It would be a nice feature if we could arrange so that, in most cases, the client that adds a document sees it in search results immediately. We cannot guarantee that all other clients will see it. Some sort of immediate indexing of the document is required to support this feature, but in-memory is sufficient. We may not implement this feature right off, but we should keep it in mind. Logging is attractive, since it permits easy replaying of logs when shipping updates between nodes. Perhaps we can instead use queries to enumerate changes, but that requires more thought. As for disk versus memory: if we only send updates to a single node in a document's range, then we should sync them to disk. If we instead send updates to multiple nodes in the range, then it's probably okay not to sync, since we already assume that not all nodes in a range will fail at once. The downside of this is that documents could be lost in the case of a datacenter-wide powerfailure, but I think that's acceptable. Performance will suffer considerably if we have to sync on each add. So my inclination is to attempt to add documents to several nodes in the range and not require a sync per add, buffering things in memory as required for good performance. Replication in memory provides fault-tolerance. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-28 21:39:41
|
Yonik Seeley wrote: > I think one problem we are running into is due to using the same hash > ring for both partitioning the key space and selecting placement of > those keys. Putting many virtual nodes on the hash ring to make the > hash function fair, in conjunction with using that hash ring to do > replication, causes all the indicies to be mixed together. That's a feature, not a bug! Seriously, I'm not sure what your concern here is. If documents are indexed in the N nodes clockwise from their point, then we do get documents indexed together, but we don't need to search but every Nth index (in most cases). So I don't see the problem. What am I missing? When a new node is introduced to the ring, then it will need to ask its neighbors for the documents assigned to it, and ranges for search should not be re-assigned until it has completed retrieving and indexing these. Similarly, when a node is removed or declared dead then others will be assigned new ranges to construct from neighbors. Doug |