|
From: Doug C. <cu...@ap...> - 2008-02-08 23:37:37
|
It would be good to have a system that supports online updates, so that we can support the widest variety of users. We don't want to force Solr to reinvent things to become scalable. That said, there will be some compromises. The compromise I'm proposing is that we permit conflicting changes in some situations (i.e., network partitioning) and that we silently resolve those conflicts in a predictable way. This isn't perfect, but I don't see another way to both meet the scalability and reliability goals and to support online updates. We can hopefully structure things so that conflicts are rare, but when they occur, some kinds of updates may be lost. Do others think this is a reasonable starting point for a design? I added some initial thoughts about sharding to the wiki at: http://bailey.wiki.sourceforge.net/ http://bailey.wiki.sourceforge.net/HowToShard Some things I'm still not clear on: - Can we get away with a master that has minimal persistent state? - How do we split and merge shards? A shard is identified by a range of docids. As we split and merge, at some point, some nodes will have fragmentary shards (sub-ranges of a shard) and/or super-shards (ranges spanning multiple shards). The master needs to somehow make sense of this. If the system is restarted when some nodes have split a shard and some have not, then some ranges may be spanned by either a single node or multiple nodes. The system cannot start until there's a continuous path through the lattice, spanning all docids. So the master is the manager of a lattice, asking nodes to make changes to the lattice, reflecting those changes in the lattice once they're complete, and presenting the lattice to search clients. Nodes in the lattice are points on the docid circle. Arcs are nodes that maintain an index for that range of document ids. Each node must know about other nodes whose ranges intersect its ranges, so that it can propagate changes. Does this make sense? How do we keep track of which changes have been propagated? If there were never splits or merges, we might get away with using node identity: if node X has sent its changes A through C to node Y, then the next time it sends changes to Y, it could start with change D. But once things start splitting, merging and otherwise migrating, this gets harder. Any ideas? Am I barking up the wrong tree altogether? Thanks! Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-08 23:47:52
|
On Feb 8, 2008 6:37 PM, Doug Cutting <cu...@ap...> wrote: > It would be good to have a system that supports online updates, so that > we can support the widest variety of users. We don't want to force Solr > to reinvent things to become scalable. > > That said, there will be some compromises. The compromise I'm proposing > is that we permit conflicting changes in some situations (i.e., network > partitioning) and that we silently resolve those conflicts in a > predictable way. This isn't perfect, but I don't see another way to > both meet the scalability and reliability goals and to support online > updates. We can hopefully structure things so that conflicts are rare, > but when they occur, some kinds of updates may be lost. > > Do others think this is a reasonable starting point for a design? +1 -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-10 19:07:15
|
On Feb 8, 2008 3:37 PM, Doug Cutting <cu...@ap...> wrote: > So the master is the manager of a lattice While I'm trying to figure out completely how it would work, would consistent hashing and making the master the manager of a ring work and be simpler? (Reference consistent hashing and Dynamo's partitioning algorithm.) As in consistent hashing, the entire range of docids (or their hashes) is treated as a ring. There is no clear concept of a shard here. A node is assigned a value (starting position) on the ring and it hosts an index for a range of docids starting from the starting position. The system cannot start until any range of docids is covered by at least one node. During normal state, any range of docids is covered by multiple nodes, depending on the level of replication. So the master becomes the manager of a ring in this case. Given the ring, a search client sends a query to a (hopefully minimal) set of nodes whose ranges cover the ring. When sending the query to a node in the set, the search client also specifies the sub-range of docids on which the query should be executed. This is to make sure that any range of docids is queried and only queried once. The above requires that a Lucene index can efficiently support query on a sub- range of docids - application/system docids, not Lucene docids. That can be achieved by a bit modification to Lucene so that within a segment, Lucene docids are assigned in the same order as their corresponding application/system docids during build/merge. A nice side-effect of this is that it becomes efficient to delete a sub-range of docids from an index as well. Thoughts? Regards, Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-14 22:23:42
|
Ning Li wrote: > So the master becomes the manager of a ring in this case. Given > the ring, a search client sends a query to a (hopefully > minimal) set of nodes whose ranges cover the ring. When sending > the query to a node in the set, the search client also > specifies the sub-range of docids on which the query should be > executed. This is to make sure that any range of docids is > queried and only queried once. I'd been thinking of something that was like consistent hashing, but that tried to keep indexes mostly disjoint. But perhaps consistent hashing itself will work well here. If each document is replicated in, say, the two clockwise indexes from the index serving its range, then one need only query every third index to achieve complete coverage, right? Things get tricky when indexes are added or removed from the ring, when the number of nodes in the ring isn't divisible by three, etc. Some range filtering will be required in these cases, but not in most. Hopefully we could find a way so that any range-filtering that's required is spread around the ring, to avoid hot-spots. Having nodes serve multiple indexes, at different points on the ring will help some with hot-spots too. With N-way replication, and only querying every Nth node, when a node fails, 2 other indexes, one on each side, must be queried to cover the missing range, since no other single index covers that range. But, there are N-1 different pairs of indexes that do cover the range. So the load on neighbors will increase by (N-1)/2 on average. If N=3, neighbor nodes would get 50% more queries. If N=4, neighbors would get 33% more queries, etc. If each node serves M indexes, then this impact would be diminished. So if N=3 and M=4 (each node serves four indexes) then a neighbor node's load would increase by just 12.5%, which is pretty managable. Another approach might be to query overlapping ranges and filter in the client. With N-way replication you'd query every N/2th index. Search results would include facet counts for sub-ranges, so they could be correctly merged. If N=4, querying every other index, then, when a node fails, no other indexes need be re-queried. Similarly for N=6 querying every third index. Here you take a big hit up front, always searching twice as big of an index as you need to, but avoid the latencies of re-querying. Does any of this make sense? > The above requires that a Lucene index can efficiently support > query on a sub- range of docids - application/system docids, > not Lucene docids. If simply implemented with a filter based on a FieldCache, this is fast, but the expense is still that of searching the entire index. > That can be achieved by a bit modification > to Lucene so that within a segment, Lucene docids are assigned > in the same order as their corresponding application/system > docids during build/merge. A nice side-effect of this is that > it becomes efficient to delete a sub-range of docids from an > index as well. I don't see how that's easy. Lucene assumes that newly indexed ids are always greater than previously indexed ids, and that assumption is fairly deep. Segments could be re-sorted I guess, and postings merged rather than appended. But that'd be a substantial change to Lucene. Is that what you had in mind? Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-15 02:38:58
|
On Thu, Feb 14, 2008 at 5:23 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > So the master becomes the manager of a ring in this case. Given > > the ring, a search client sends a query to a (hopefully > > minimal) set of nodes whose ranges cover the ring. When sending > > the query to a node in the set, the search client also > > specifies the sub-range of docids on which the query should be > > executed. This is to make sure that any range of docids is > > queried and only queried once. > > I'd been thinking of something that was like consistent hashing, but > that tried to keep indexes mostly disjoint. But perhaps consistent > hashing itself will work well here. > > If each document is replicated in, say, the two clockwise indexes from > the index serving its range, then one need only query every third index > to achieve complete coverage, right? Things get tricky when indexes are > added or removed from the ring, when the number of nodes in the ring > isn't divisible by three, etc. Some range filtering will be required in > these cases, but not in most. Hopefully we could find a way so that any > range-filtering that's required is spread around the ring, to avoid > hot-spots. Starting at a random node and querying every Nth node, and then filtering the last overlapping parts seems like it should spread that evenly. > Having nodes serve multiple indexes, at different points on > the ring will help some with hot-spots too. > > With N-way replication, and only querying every Nth node, when a node > fails, 2 other indexes, one on each side, must be queried to cover the > missing range, since no other single index covers that range. But, > there are N-1 different pairs of indexes that do cover the range. So > the load on neighbors will increase by (N-1)/2 on average. If N=3, > neighbor nodes would get 50% more queries. If N=4, neighbors would get > 33% more queries, etc. OK this makes sense so far... for a 9 node system with N=3, you query 3 nodes with indicies each 1/3 of the size of the complete index. > If each node serves M indexes, then this impact > would be diminished. So if N=3 and M=4 (each node serves four indexes) > then a neighbor node's load would increase by just 12.5%, which is > pretty managable. I got lost at "M" a little... What's an index in this context? Does it mean additional virtual nodes on the hash circle that map to the same physical node (an index being the arc between two adjacent points)? That's normally needed to make consistent hashing fair anyway, right? > > The above requires that a Lucene index can efficiently support > > query on a sub- range of docids - application/system docids, > > not Lucene docids. > > If simply implemented with a filter based on a FieldCache, this is fast, > but the expense is still that of searching the entire index. Just thinking about the static case, one could keep separate docs in separate indicies. A multireader over 3 indices wouldn't be too bad. But then if you start adding and removing nodes, it gets messy. if the load increase is low enough (12.5% isn't too bad) straightforward filtering is probably the best. It seems like we need filtering capabilities anyway in the cases when nodes are being rebalanced due to the number of nodes changing. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-15 04:51:30
|
<bai...@li...>On Thu, Feb 14, 2008 at 5:23 PM, Doug Cutting <cu...@ap...> wrote: > If each document is replicated in, say, the two clockwise indexes from > the index serving its range, then one need only query every third index > to achieve complete coverage, right? Things get tricky when indexes are > added or removed from the ring, when the number of nodes in the ring > isn't divisible by three, etc. Some range filtering will be required in > these cases, but not in most. Hopefully we could find a way so that any > range-filtering that's required is spread around the ring, to avoid > hot-spots. Having nodes serve multiple indexes, at different points on > the ring will help some with hot-spots too. Agree. > If N=4, neighbors would get > 33% more queries, etc. If each node serves M indexes, then this impact > would be diminished. So if N=3 and M=4 (each node serves four indexes) > then a neighbor node's load would increase by just 12.5%, which is > pretty managable. I think Yonik's interpretation is correct, right? > Another approach might be to query overlapping ranges and filter in the > client. With N-way replication you'd query every N/2th index. Search > results would include facet counts for sub-ranges, so they could be > correctly merged. If N=4, querying every other index, then, when a node > fails, no other indexes need be re-queried. Similarly for N=6 querying > every third index. Here you take a big hit up front, always searching > twice as big of an index as you need to, but avoid the latencies of > re-querying. Whether this is worthy depends on the expected percentage of queries that need to be re-queried on some indexes... > > The above requires that a Lucene index can efficiently support > > query on a sub- range of docids - application/system docids, > > not Lucene docids. > > If simply implemented with a filter based on a FieldCache, this is fast, > but the expense is still that of searching the entire index. Yes, this should be good enough. > > ... so that within a segment, Lucene docids are assigned > > in the same order as their corresponding application/system > > docids during build/merge... > > I don't see how that's easy. Lucene assumes that newly indexed ids are > always greater than previously indexed ids, and that assumption is > fairly deep. Segments could be re-sorted I guess, and postings merged > rather than appended. But that'd be a substantial change to Lucene. Is > that what you had in mind? I was thinking of keeping Lucene docids in the same order as their application/system docids within a segment, not across segments. The merge algorithm will be different but segments don't have to be re-sorted during merge. To take advantage of this, however, we'll have to query on each segment individually and then merge results. (Query on SegmentReader and merge results instead of querying on MultiSegmentReader.) We'll need the same result merge algorithm on clients. But until we implement that algorithm, let's use filtering. Regards, Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-28 22:45:12
|
Yonik Seeley wrote: > Are you thinking of storage outside of Lucene stored fields too then? I think we'll have to maintain logs outside of Lucene. Beyond that, I don't see that we'd need to use more than stored fields. Do you? I can see two approaches: log the entire document, or just log the operation id, and revision, then extract the document data as needed from the index. There may be cases where a neighbor asks for a document that's no longer in the index, since it may have been deleted since the neighbor asked for updates. If we can either make sure this doesn't happen, or make things tolerant of it, then we could get away with only storing documents in the index, which would be nice. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-29 00:08:36
|
Yonik Seeley wrote: >> I'm assuming a 1:1 mapping between index and node. So if a host has 100 >> nodes on it, then it would have 100 indexes. I'm also imagining more >> like 4 nodes/indexes per host, not 100. Does that address your concern? > > Then the issue then would be that there are not enough nodes per host > to evenly partition the keys. We'll see... I'm hoping we can get the master to help out with allocations to keep things fairly even. The master could even periodically reassign nodes that are underloaded to overloaded areas of the ring. > I'm still missing what we gain by having them coupled. I don't understand the alternative. We want to divide things evenly, but also be able to gracefully incorporate new hosts and dead hosts without re-allocating everything. We must arrange things so that the ranges served by a node remain disjoint for this reason too, so that a host failure doesn't increase more than around 25% the load of any other single host. Are you arguing that non-random range boundaries would work better than random ones? (Maybe I should go re-read your message...) I don't think that's at odds with anything I've proposed. I think the master needs to be in charge of placing nodes on the ring, determining how many nodes a host serves, etc. > This isn't consistent hashing, in my understanding of it though, since > placement of some nodes depends on others. Okay, it's not consistent hashing, whatever. I don't see how to do this without a master coordinating things (e.g., telling clients what to search). The master shouldn't be involved in individual searches or document additions, nor should it ideally have much persistent state, but beyond that, I don't see a reason not to let it arrange things optimally. > With just a list of nodes, it doesn't seem like a different host would > be able to reconstruct the ring. Right, the master would need to give clients the list of nodes, each with the range of documents that its currently able to search, and (perhaps separately) the range of documents its currently able to index. I guess, if we wanted to diverge more from consistent hashing, we could have each node serve a set of ranges, to give the master more freedom to shuffle things when a node fails. That'd be okay with me. If we don't use consistent hashing then we have to develop some other scheme for allocation and deallocation of ids to nodes that has all of the good properties we need from consistent hashing and additional improvements. So it seems reasonable to start with consistent hashing and diverge only as needed. Doug |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 13:27:43
|
On Thu, Feb 28, 2008 at 7:08 PM, Doug Cutting <cu...@ap...> wrote: > Yonik Seeley wrote: > > I'm still missing what we gain by having them coupled. > > I don't understand the alternative. We want to divide things evenly, > but also be able to gracefully incorporate new hosts and dead hosts > without re-allocating everything. We must arrange things so that the > ranges served by a node remain disjoint for this reason too, so that a > host failure doesn't increase more than around 25% the load of any other > single host. Here is an example of how to separate key partitioning from replication (single node per host to make things easier for now): Key partitioning: when hostX is added to the cluster, add 100 nodes (vnodeX1, vnodeX2, vnodeX3...) to the hash ring for partitioning keys. This steals a little bit from every other node. There is no replication concept in this ring. Replication: add a single nodeX to the replication hash ring. The two clockwise nodes replicate this node, and the two nodes in the other direction should be replicated by this node. This still works if each node serves M indexes (4 is what you suggested I think). Add 4 nodes to the replication ring, and for each of those nodes, determine it's key-space with the key ring. With these concerns separated, either could be changed or adjusted independently. I'm not advocating that specific method of key partitioning though... while it is more fair (a new node steals documents from all other nodes), that could also be a drawback depending on exactly how the system (rebalancing) works. > > This isn't consistent hashing, in my understanding of it though, since > > placement of some nodes depends on others. > > Okay, it's not consistent hashing, whatever. > > I don't see how to do this without a master coordinating things (e.g., > telling clients what to search). The master shouldn't be involved in > individual searches or document additions, nor should it ideally have > much persistent state, but beyond that, I don't see a reason not to let > it arrange things optimally. OK, that changes things... I wasn't clear on exactly why consistent hashing was used. In this case it sounds like more of an implementation detail rather than part of the interface (which it is in some other systems). > > With just a list of nodes, it doesn't seem like a different host would > > be able to reconstruct the ring. > > Right, the master would need to give clients the list of nodes, each > with the range of documents that its currently able to search, and > (perhaps separately) the range of documents its currently able to index. Would this info be obtainable from any node? It seems like this should be possible. > I guess, if we wanted to diverge more from consistent hashing, we > could have each node serve a set of ranges, to give the master more > freedom to shuffle things when a node fails. That'd be okay with me. When a new master comes up, it seems like nodes would need to report their actual ranges anyway. > If we don't use consistent hashing then we have to develop some other > scheme for allocation and deallocation of ids to nodes that has all of > the good properties we need from consistent hashing and additional > improvements. So it seems reasonable to start with consistent hashing > and diverge only as needed. That's fine... it sounds like from everything above that this won't be exposed to anyone but the master anyway. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-10 20:14:47
|
On Feb 10, 2008 2:07 PM, Ning Li <nin...@gm...> wrote: > As in consistent hashing, the entire range of docids (or their hashes) is > treated > as a ring. There is no clear concept of a shard here. It seems to me like consistent hashing could happily live with the concept of a shard. All docids that belong to a single physical node form a shard. All replicas of a shard could be the same. I've not thought through all the implications though... I'm still reading the Dynamo paper myself (and after that will be looking into zookeeper I suppose). -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-11 02:33:36
|
We can have a concept of a shard. I meant to say it'd be different from the shard in my postings on Lucene general. Here, we can say the distinct starting positions of all the nodes divide the ring into shards. A node serves one or multiple continuous shards. So, a node serves a range of docids. Regards, Ning On Feb 10, 2008 12:14 PM, Yonik Seeley <yo...@ap...> wrote: > On Feb 10, 2008 2:07 PM, Ning Li <nin...@gm...> wrote: > > As in consistent hashing, the entire range of docids (or their hashes) > is > > treated > > as a ring. There is no clear concept of a shard here. > > It seems to me like consistent hashing could happily live with the > concept of a shard. > All docids that belong to a single physical node form a shard. > All replicas of a shard could be the same. > I've not thought through all the implications though... I'm still > reading the Dynamo paper myself (and after that will be looking into > zookeeper I suppose). > > -Yonik > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > bailey-developers mailing list > bai...@li... > https://lists.sourceforge.net/lists/listinfo/bailey-developers > |
|
From: Ning L. <nin...@gm...> - 2008-02-15 21:35:21
|
First, let's sync a couple of index-related terms: 1) Shard: The distinct starting positions of all the virtual nodes divide the ring into shards. For example, starting positions (A B C D E) divide the ring into 5 shards: AB, BC, CD, DE, DA. 2) Index on a virtual node (suggest a name?): A virtual node serves a number of continuous shards. For example, with 3-way replication, the indexes on the virtual nodes are: AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. Now, should an index on a virtual node be implemented as one Lucene index or N Lucene indexes (one per shard)? Using N Lucene indexes has its up side. It's less expensive when a virtual node is added or removed - often a shard is shipped or deleted instead of shipping a virtual node index or splitting and shipping... And often we don't have to filter when querying a sub-range of a virtual node index. The down side is that a virtual node has to manage multiple Lucene indexes... What do you think? Regards, Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-26 18:21:49
|
I think one problem we are running into is due to using the same hash
ring for both partitioning the key space and selecting placement of
those keys. Putting many virtual nodes on the hash ring to make the
hash function fair, in conjunction with using that hash ring to do
replication, causes all the indicies to be mixed together.
We may need to separate partitioning of the key space into shards, and
the decision of where to put those shards.
One example using consistent hashing:
When a new system ("newhost") is added to the cluster, one
consistent hash ring could be used to define a new shard (many virtual
nodes to make all shards about the same size... newhost_shard.1
newhost_shard.2 newhost_shard.3...), and a different consistent hash
ring used to select what other shards to replicate in addition to it's
own (add "newhost_shard" to the hash ring, and if N=3 then newhost
will also mirror the previous two shards on the ring.)
I'm no longer sure of the benefit of using consistent hashing though
(at least for both parts).
A simpler way to split up the key space would be to just divide it up
into Q equal sized pieces, and just keep Q sufficiently larger than S
(the number of systems). If one didn't want to pick a maximum Q
beforehand, then Q could always be doubled to keep it sufficiently
large.
A shard could always be uniquely identified by a single integer... the
top 5 bits being how many times the key space was divided by 2, and
the bottom 27 bits being the slice number.
So, to calculate what shard a key belongs in:
sliceNumber = key.hashCode >>> nSplits
shardId == (nSplits<<27) | shardNum
No global state necessary to define what a shard is, doubling the
number of shards is easy, anyone can easily determine if a key is in a
shard or not, etc.
That still leaves the problem of allocating shards to nodes unsolved
for the time being... but I'm just brainstorming out loud at this
point.
-Yonik
|
|
From: Ning L. <nin...@gm...> - 2008-02-27 20:58:14
|
On Tue, Feb 26, 2008 at 1:21 PM, Yonik Seeley <yo...@ap...> wrote:
> I think one problem we are running into is due to using the same hash
> ring for both partitioning the key space and selecting placement of
> those keys. Putting many virtual nodes on the hash ring to make the
> hash function fair, in conjunction with using that hash ring to do
> replication, causes all the indicies to be mixed together.
What we have discussed addressed partitioning and replication:
1 We used consistent hashing to partition keys into ranges.
We may call a range "shard". The important thing is, it is
a one-to-one mapping between ranges and virtual nodes.
2 We described one replication scheme on how a virtual node
replicates its range/shard on N-1 virtual nodes with replication
level set to N.
You described how to partition keys into ranges/shards. And
how ranges/shards map to virtual nodes is to be figured out.
I think what we have discussed is a reasonable scheme. But
I'm always open to new ideas. :)
At the same time, should we also discuss what the update
model should be:
1 One updatable replica vs. all updatable replicas. The former
is simple. The latter is powerful. Is there sufficient need for
the latter to justify its complexity?
2 The atomicity of an insert/delete/update operation. When
an insert/delete/update operation is done, does it mean:
1) the new doc is indexed in the memory of the node
2) the new doc is indexed on the local disk of the node
3) the new doc is logged on the local disk of the node
4) the new doc is logged in some fault-tolerant shared FS
(e.g. HDFS)
5) the new doc is indexed in the memory of at least X
nodes
The probability of the operation getting lost is from high
to low: 1), then 2) and 3), then 4) and 5).
Regards,
Ning
|
|
From: Yonik S. <yo...@ap...> - 2008-02-27 22:03:33
|
On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: > On Tue, Feb 26, 2008 at 1:21 PM, Yonik Seeley <yo...@ap...> wrote: > > I think one problem we are running into is due to using the same hash > > ring for both partitioning the key space and selecting placement of > > those keys. Putting many virtual nodes on the hash ring to make the > > hash function fair, in conjunction with using that hash ring to do > > replication, causes all the indicies to be mixed together. > > What we have discussed addressed partitioning and replication: > 1 We used consistent hashing to partition keys into ranges. > We may call a range "shard". The important thing is, it is > a one-to-one mapping between ranges and virtual nodes. > 2 We described one replication scheme on how a virtual node > replicates its range/shard on N-1 virtual nodes with replication > level set to N. As I read the previous messages, it seems like both #1 and #2 are using the same hash circle. It seems like using virtual nodes (as opposed to real nodes) to determine replication means that you can't query every N real nodes and cover the entire index, right? It seems like the replica for real node A is distributed over all other nodes, regardless of what N is. -Yonik |
|
From: Yonik S. <yo...@ap...> - 2008-02-27 23:04:48
|
On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: > At the same time, should we also discuss what the update > model should be: > 1 One updatable replica vs. all updatable replicas. The former > is simple. The latter is powerful. Is there sufficient need for > the latter to justify its complexity? We should always be able to update (so if the "updateable" replica is down, we need to be able to update another replica). Given that, what is the extra complexity of those two choices? > 2 The atomicity of an insert/delete/update operation. When > an insert/delete/update operation is done, does it mean: > 1) the new doc is indexed in the memory of the node > 2) the new doc is indexed on the local disk of the node > 3) the new doc is logged on the local disk of the node > 4) the new doc is logged in some fault-tolerant shared FS > (e.g. HDFS) > 5) the new doc is indexed in the memory of at least X > nodes > The probability of the operation getting lost is from high > to low: 1), then 2) and 3), then 4) and 5). Very good questions...and the answers are probably tied into how we solve question #1. #5 would seem to be higher performance than #3, but is it a strong enough guarantee? The power could still go out to a whole rack of systems at once. I think the decision partially depends on how much of a document storage system this is, vs just an index that can be rebuilt. Of course, the bigger an index gets, the more infeasible it becomes to do a complete rebuild. I'm tempted to lean toward #3 since logs are needed to sync up nodes (back to question #1). -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-28 20:11:45
|
On Wed, Feb 27, 2008 at 6:04 PM, Yonik Seeley <yo...@ap...> wrote: > On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: > > > At the same time, should we also discuss what the update > > model should be: > > 1 One updatable replica vs. all updatable replicas. The former > > is simple. The latter is powerful. Is there sufficient need for > > the latter to justify its complexity? > > We should always be able to update (so if the "updateable" replica is > down, we need to be able to update another replica). Given that, what > is the extra complexity of those two choices? In the case of one updateable replica, if the updateable replica is down, we need to make another replica updateable. The down side is that the shard is not updateable during the switch. But it should be short and thus be fine for some applications. The complexity involved with all updateable replicas is conflict resolution... > > 2 The atomicity of an insert/delete/update operation. When > > an insert/delete/update operation is done, does it mean: > > 1) the new doc is indexed in the memory of the node > > 2) the new doc is indexed on the local disk of the node > > 3) the new doc is logged on the local disk of the node > > 4) the new doc is logged in some fault-tolerant shared FS > > (e.g. HDFS) > > 5) the new doc is indexed in the memory of at least X > > nodes > > The probability of the operation getting lost is from high > > to low: 1), then 2) and 3), then 4) and 5). > > Very good questions...and the answers are probably tied into how we > solve question #1. > #5 would seem to be higher performance than #3, but is it a strong > enough guarantee? I think you implied and I agree, that #4 is a stronger guarantee than #5. But you think #3 is a stronger guarantee than #5? What if one replica in #5 must be from a different rack? In such a system, if a node is down, the log on the local disk is gone with it - we don't know when the node will be up and the log be available again. If we choose to support all updateable replicas, the problem with simply indexing and no logging (#1, #2 and #5) is what changes to propagate? In case of one updateable replica, we simply propagate delta index files. > I think the decision partially depends on how much of > a document storage system this is, vs just an index that can be > rebuilt. This is a very good point and I have been thinking about this. If we choose to support all updateable replicas and provide a relatively strong atomicity support, this can easily be a data storage system. Ning |
|
From: Doug C. <cu...@ap...> - 2008-02-28 22:37:31
|
Ning Li wrote:
> In the case of one updateable replica, if the updateable replica is down,
> we need to make another replica updateable. The down side is that
> the shard is not updateable during the switch. But it should be short
> and thus be fine for some applications. The complexity involved with
> all updateable replicas is conflict resolution...
A unique updatetable index per document would be nice, but I'm not yet
entirely convinced it is practical.
Say your client fetches the ring of nodes from the master, then starts
adding documents. In the meantime, the master updates the list of
nodes, and a second client gets the new, modified ring. In this case,
two clients could update different nodes for the same document.
Another scenario is that a switch is congested, and some clients are
unable to connect to an index to update it, while other clients can
connect to it. The clients that cannot connect update a different node
for the same documents.
Perhaps we could guard against these by having clients "lease" the ring
from the master. Then the master could make sure that it doesn't issue
a new version of the ring until all leases to the old version have
expired. The master could choose to republish the ring every five
minutes. A node that leases the ring four minutes into a lease window
would only get a one minute lease. But all nodes would then hammer the
master for a new version at the same time, every five minutes. Can you
think of a better way?
Unless we can come up with a foolproof mechanism or somesuch, I think
we'll have to handle the case where multiple indexes are updated in
their overlap and must reconcile these updates.
I'd imagined each node periodically querying its neighbors for changes
in the range they share. We shouldn't rely on clock synchronization, so
each node would keep the last revision of each neighbor that it has
sync'd with. So, the first time they connect, they pass revision zero
and receive all updates for their overlap. The next time they only need
to retrieve updates since the last.
Documents could also have an application-specified revision. This would
greatly simplify reconciliation, since we could use these to resolve all
disputes in a predictable way: higher revision wins.
Perhaps we'd want two formats for updates sent between nodes: outline
and full, where outline just contains a sequence of <{ADD|DEL}, id,
revision>. Then the retrieving node can process these and determine
which revisions of which ids it needs, then retrieve those as a second step.
This approach is tolerant of network partitoning, and not too
complicated. What do you think?
Doug
|
|
From: Ning L. <nin...@gm...> - 2008-02-29 00:55:58
|
On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote:
> A unique updatetable index per document would be nice, but I'm not yet
> entirely convinced it is practical.
Not if short glitches are not acceptable. In BigTable, a tablet is served
by a single tablet server. I wonder if they find it to be a problem.
> I'd imagined each node periodically querying its neighbors for changes
> in the range they share. We shouldn't rely on clock synchronization, so
> each node would keep the last revision of each neighbor that it has
> sync'd with. So, the first time they connect, they pass revision zero
> and receive all updates for their overlap. The next time they only need
> to retrieve updates since the last.
This works...
> Documents could also have an application-specified revision. This would
> greatly simplify reconciliation, since we could use these to resolve all
> disputes in a predictable way: higher revision wins.
I like this.
> Perhaps we'd want two formats for updates sent between nodes: outline
> and full, where outline just contains a sequence of <{ADD|DEL}, id,
> revision>. Then the retrieving node can process these and determine
> which revisions of which ids it needs, then retrieve those as a second step.
Perhaps just the full mode until we see a demand for the outline mode.
> This approach is tolerant of network partitoning, and not too
> complicated. What do you think?
It seems good. I need to figure out completely the protocol when nodes
join and leave the system.
Ning
|
|
From: Ning L. <nin...@gm...> - 2008-02-29 00:59:14
|
On Thu, Feb 28, 2008 at 7:56 PM, Ning Li <nin...@gm...> wrote: > Perhaps just the full mode until we see a demand for the outline mode. I see the need for the outline mode! Ning |
|
From: Ning L. <nin...@gm...> - 2008-02-29 01:41:53
|
On Thu, Feb 28, 2008 at 7:56 PM, Ning Li <nin...@gm...> wrote: > It seems good. I need to figure out completely the protocol when nodes > join and leave the system. Doug, the approach works, and is not too complicated. I like it. :) Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-29 11:24:22
|
On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote:
> Perhaps we'd want two formats for updates sent between nodes: outline
> and full, where outline just contains a sequence of <{ADD|DEL}, id,
> revision>. Then the retrieving node can process these and determine
> which revisions of which ids it needs, then retrieve those as a second step.
Yes, I think outline mode is definitely needed.
One way to do full mode would be to perhaps reuse the normal update
interface (rather than request certain docs, request that update be
invoked with certain docs... triggering logging, but not replication).
More of an implementation detail I guess.
-Yonik
|
|
From: Yonik S. <yo...@ap...> - 2008-02-29 14:08:38
|
On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: > I'd imagined each node periodically querying its neighbors for changes > in the range they share. We shouldn't rely on clock synchronization, so > each node would keep the last revision of each neighbor that it has > sync'd with. So, the first time they connect, they pass revision zero > and receive all updates for their overlap. The next time they only need > to retrieve updates since the last. Sounds good! We'll need to make sure that it handles this case: Say C syncs from B, then B syncs from A and get some new docs. Now A goes away for a while... it seems like C will need to get the new docs from B next time it syncs. If the docs that B got from A are treated like any other new additions, this should work right? -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-29 16:23:42
|
On Fri, Feb 29, 2008 at 9:08 AM, Yonik Seeley <yo...@ap...> wrote: > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: > > I'd imagined each node periodically querying its neighbors for changes > > in the range they share. We shouldn't rely on clock synchronization, so > > each node would keep the last revision of each neighbor that it has > > sync'd with. So, the first time they connect, they pass revision zero > > and receive all updates for their overlap. The next time they only need > > to retrieve updates since the last. > > Sounds good! > > We'll need to make sure that it handles this case: > Say C syncs from B, then B syncs from A and get some new docs. Now A > goes away for a while... it seems like C will need to get the new docs > from B next time it syncs. If the docs that B got from A are treated > like any other new additions, this should work right? I think this is how it works with replication=3: Node A serves range A-D, node B serves range B-E, node C serves range C-F... Node C syncs with nodes A & B on range C-D, syncs with nodes B and D on range D-E, and syncs with nodes D & E on range E-F. Now node D goes away. Node C continues to serve range C-F until it gets range F-G from node E or node F and starts to serve range C-G. Then, node C syncs with nodes A & B on range C-D (same as before), syncs with nodes B and E on range D-F, and syncs with nodes E & F on range F-G. Right? Ning |
|
From: Ning L. <nin...@gm...> - 2008-02-29 19:17:15
|
On Fri, Feb 29, 2008 at 11:23 AM, Ning Li <nin...@gm...> wrote: > On Fri, Feb 29, 2008 at 9:08 AM, Yonik Seeley <yo...@ap...> wrote: > > On Thu, Feb 28, 2008 at 5:37 PM, Doug Cutting <cu...@ap...> wrote: > > > I'd imagined each node periodically querying its neighbors for changes > > > in the range they share. We shouldn't rely on clock synchronization, so > > > each node would keep the last revision of each neighbor that it has > > > sync'd with. So, the first time they connect, they pass revision zero > > > and receive all updates for their overlap. The next time they only need > > > to retrieve updates since the last. > > > > Sounds good! > > > > We'll need to make sure that it handles this case: > > Say C syncs from B, then B syncs from A and get some new docs. Now A > > goes away for a while... it seems like C will need to get the new docs > > from B next time it syncs. If the docs that B got from A are treated > > like any other new additions, this should work right? > > I think this is how it works with replication=3: > Node A serves range A-D, node B serves range B-E, node C serves range C-F... > Node C syncs with nodes A & B on range C-D, syncs with nodes B and D > on range D-E, and syncs with nodes D & E on range E-F. > > Now node D goes away. Node C continues to serve range C-F until it gets > range F-G from node E or node F and starts to serve range C-G. Then, > node C syncs with nodes A & B on range C-D (same as before), syncs with > nodes B and E on range D-F, and syncs with nodes E & F on range F-G. I see the following as a possible implementation: Each node logs its changes in a light-weight log. A log is a sequence of log entries, with the entry number monotonically increasing. Let's say node X periodically syncs with node Y. A sync works as follows: Node X starts with entry p in node Y's log where node X synced with node Y last time. After node X finishes syncing (node X puts actual changes into its own log), it sets its sync entry w/ node Y to entry q. When node D goes away, node C copies range F-G from node E or node F, let's say node E. For the copy, node E's own log is at entry s and node E is synced with node F at node F's entry t. Node C does the copy but does not log the copy. After node C finishes the copy and before it switches from serving range C-F to serving range C-G, it sets its sync entry with node E to min(its current sync entry with node E, entry s) and sets its sync entry with node F to entry t. Is this efficient enough? Ning |