You can subscribe to this list here.
| 2008 |
Jan
|
Feb
(60) |
Mar
(65) |
Apr
(44) |
May
(6) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|---|
|
From: Doug C. <cu...@ap...> - 2008-02-28 21:31:18
|
Getting back to this at long last! Sorry for the delays... Ning Li wrote: > First, let's sync a couple of index-related terms: Good idea. I think "node" and "shard" are confusing, and perhaps can be avoided in this context. > 1) Shard: The distinct starting positions of all the virtual nodes > divide the ring into shards. For example, starting positions > (A B C D E) divide the ring into 5 shards: AB, BC, CD, DE, DA. I prefer to call these "ranges". "Shard" to me sounds like something physical, and ranges are not physical. > 2) Index on a virtual node (suggest a name?): A virtual node > serves a number of continuous shards. For example, with > 3-way replication, the indexes on the virtual nodes are: > AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. I was just using "index" for this. An index is a set of files on a host that corresponds to a range of ids. Let's also use "node" for "virtual node" and "host" for a set of virtual nodes running on the same host. A node corresponds to a point on the ring and may be assigned a range and maintain an index for that range. The range assigned to a node may change over time and it will have to adjust its index accordingly. > Now, should an index on a virtual node be implemented as > one Lucene index or N Lucene indexes (one per shard)? My hunch is one index per range. That way we can search a set of indexes that completes the ring, and search maximally large segments. Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-28 20:11:45
|
On Wed, Feb 27, 2008 at 6:04 PM, Yonik Seeley <yo...@ap...> wrote: > On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: > > > At the same time, should we also discuss what the update > > model should be: > > 1 One updatable replica vs. all updatable replicas. The former > > is simple. The latter is powerful. Is there sufficient need for > > the latter to justify its complexity? > > We should always be able to update (so if the "updateable" replica is > down, we need to be able to update another replica). Given that, what > is the extra complexity of those two choices? In the case of one updateable replica, if the updateable replica is down, we need to make another replica updateable. The down side is that the shard is not updateable during the switch. But it should be short and thus be fine for some applications. The complexity involved with all updateable replicas is conflict resolution... > > 2 The atomicity of an insert/delete/update operation. When > > an insert/delete/update operation is done, does it mean: > > 1) the new doc is indexed in the memory of the node > > 2) the new doc is indexed on the local disk of the node > > 3) the new doc is logged on the local disk of the node > > 4) the new doc is logged in some fault-tolerant shared FS > > (e.g. HDFS) > > 5) the new doc is indexed in the memory of at least X > > nodes > > The probability of the operation getting lost is from high > > to low: 1), then 2) and 3), then 4) and 5). > > Very good questions...and the answers are probably tied into how we > solve question #1. > #5 would seem to be higher performance than #3, but is it a strong > enough guarantee? I think you implied and I agree, that #4 is a stronger guarantee than #5. But you think #3 is a stronger guarantee than #5? What if one replica in #5 must be from a different rack? In such a system, if a node is down, the log on the local disk is gone with it - we don't know when the node will be up and the log be available again. If we choose to support all updateable replicas, the problem with simply indexing and no logging (#1, #2 and #5) is what changes to propagate? In case of one updateable replica, we simply propagate delta index files. > I think the decision partially depends on how much of > a document storage system this is, vs just an index that can be > rebuilt. This is a very good point and I have been thinking about this. If we choose to support all updateable replicas and provide a relatively strong atomicity support, this can easily be a data storage system. Ning |
|
From: Mike K. <mik...@gm...> - 2008-02-28 00:24:05
|
Update: Yonik answers my questions via private email, just in case anyone else want dying to jump in there. Sounds exciting, -Mike On 27-Feb-08, at 3:01 PM, Mike Klaas wrote: > This is list is still small enough that I feel that I should > introduce myself. I'm a Solr committer and CTO of a small internet > search startup that is still in stealth mode. Yonik pointed out the > project to me. It sounds quite interesting, not least of which > because our company has built a large Solr-based search cluster > running on EC2 with at least some of the properties you are aiming > for. > > The one thing we don't have is dynamic automatic failover and > replication. One reason is that it's hard, and the second is that > for us, it has always been better to use more boxes for a larger > corpus rather than replicating a smaller one. Instead, we store the > indices in S3 and restore from backup when a machine fails (which is > not rarely). > > I do have a couple questions about the project: > - it isn't clear to me how the goals are substantially different > from those of nutch. Is it mostly in the relaxation of its > application to web search? > - why sourceforge rather than an apache project? > - it sounds like the intent is to build upon Solr, which I think is > a great idea, but isn't mentioned in the top-level goals section. > Is Solr an optional component? > > cheers, > -Mike |
|
From: Yonik S. <yo...@ap...> - 2008-02-27 23:04:48
|
On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: > At the same time, should we also discuss what the update > model should be: > 1 One updatable replica vs. all updatable replicas. The former > is simple. The latter is powerful. Is there sufficient need for > the latter to justify its complexity? We should always be able to update (so if the "updateable" replica is down, we need to be able to update another replica). Given that, what is the extra complexity of those two choices? > 2 The atomicity of an insert/delete/update operation. When > an insert/delete/update operation is done, does it mean: > 1) the new doc is indexed in the memory of the node > 2) the new doc is indexed on the local disk of the node > 3) the new doc is logged on the local disk of the node > 4) the new doc is logged in some fault-tolerant shared FS > (e.g. HDFS) > 5) the new doc is indexed in the memory of at least X > nodes > The probability of the operation getting lost is from high > to low: 1), then 2) and 3), then 4) and 5). Very good questions...and the answers are probably tied into how we solve question #1. #5 would seem to be higher performance than #3, but is it a strong enough guarantee? The power could still go out to a whole rack of systems at once. I think the decision partially depends on how much of a document storage system this is, vs just an index that can be rebuilt. Of course, the bigger an index gets, the more infeasible it becomes to do a complete rebuild. I'm tempted to lean toward #3 since logs are needed to sync up nodes (back to question #1). -Yonik |
|
From: Mike K. <mik...@gm...> - 2008-02-27 23:01:48
|
This is list is still small enough that I feel that I should introduce myself. I'm a Solr committer and CTO of a small internet search startup that is still in stealth mode. Yonik pointed out the project to me. It sounds quite interesting, not least of which because our company has built a large Solr-based search cluster running on EC2 with at least some of the properties you are aiming for. The one thing we don't have is dynamic automatic failover and replication. One reason is that it's hard, and the second is that for us, it has always been better to use more boxes for a larger corpus rather than replicating a smaller one. Instead, we store the indices in S3 and restore from backup when a machine fails (which is not rarely). I do have a couple questions about the project: - it isn't clear to me how the goals are substantially different from those of nutch. Is it mostly in the relaxation of its application to web search? - why sourceforge rather than an apache project? - it sounds like the intent is to build upon Solr, which I think is a great idea, but isn't mentioned in the top-level goals section. Is Solr an optional component? cheers, -Mike |
|
From: Yonik S. <yo...@ap...> - 2008-02-27 22:03:33
|
On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: > On Tue, Feb 26, 2008 at 1:21 PM, Yonik Seeley <yo...@ap...> wrote: > > I think one problem we are running into is due to using the same hash > > ring for both partitioning the key space and selecting placement of > > those keys. Putting many virtual nodes on the hash ring to make the > > hash function fair, in conjunction with using that hash ring to do > > replication, causes all the indicies to be mixed together. > > What we have discussed addressed partitioning and replication: > 1 We used consistent hashing to partition keys into ranges. > We may call a range "shard". The important thing is, it is > a one-to-one mapping between ranges and virtual nodes. > 2 We described one replication scheme on how a virtual node > replicates its range/shard on N-1 virtual nodes with replication > level set to N. As I read the previous messages, it seems like both #1 and #2 are using the same hash circle. It seems like using virtual nodes (as opposed to real nodes) to determine replication means that you can't query every N real nodes and cover the entire index, right? It seems like the replica for real node A is distributed over all other nodes, regardless of what N is. -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-27 20:58:14
|
On Tue, Feb 26, 2008 at 1:21 PM, Yonik Seeley <yo...@ap...> wrote:
> I think one problem we are running into is due to using the same hash
> ring for both partitioning the key space and selecting placement of
> those keys. Putting many virtual nodes on the hash ring to make the
> hash function fair, in conjunction with using that hash ring to do
> replication, causes all the indicies to be mixed together.
What we have discussed addressed partitioning and replication:
1 We used consistent hashing to partition keys into ranges.
We may call a range "shard". The important thing is, it is
a one-to-one mapping between ranges and virtual nodes.
2 We described one replication scheme on how a virtual node
replicates its range/shard on N-1 virtual nodes with replication
level set to N.
You described how to partition keys into ranges/shards. And
how ranges/shards map to virtual nodes is to be figured out.
I think what we have discussed is a reasonable scheme. But
I'm always open to new ideas. :)
At the same time, should we also discuss what the update
model should be:
1 One updatable replica vs. all updatable replicas. The former
is simple. The latter is powerful. Is there sufficient need for
the latter to justify its complexity?
2 The atomicity of an insert/delete/update operation. When
an insert/delete/update operation is done, does it mean:
1) the new doc is indexed in the memory of the node
2) the new doc is indexed on the local disk of the node
3) the new doc is logged on the local disk of the node
4) the new doc is logged in some fault-tolerant shared FS
(e.g. HDFS)
5) the new doc is indexed in the memory of at least X
nodes
The probability of the operation getting lost is from high
to low: 1), then 2) and 3), then 4) and 5).
Regards,
Ning
|
|
From: Yonik S. <yo...@ap...> - 2008-02-26 18:21:49
|
I think one problem we are running into is due to using the same hash
ring for both partitioning the key space and selecting placement of
those keys. Putting many virtual nodes on the hash ring to make the
hash function fair, in conjunction with using that hash ring to do
replication, causes all the indicies to be mixed together.
We may need to separate partitioning of the key space into shards, and
the decision of where to put those shards.
One example using consistent hashing:
When a new system ("newhost") is added to the cluster, one
consistent hash ring could be used to define a new shard (many virtual
nodes to make all shards about the same size... newhost_shard.1
newhost_shard.2 newhost_shard.3...), and a different consistent hash
ring used to select what other shards to replicate in addition to it's
own (add "newhost_shard" to the hash ring, and if N=3 then newhost
will also mirror the previous two shards on the ring.)
I'm no longer sure of the benefit of using consistent hashing though
(at least for both parts).
A simpler way to split up the key space would be to just divide it up
into Q equal sized pieces, and just keep Q sufficiently larger than S
(the number of systems). If one didn't want to pick a maximum Q
beforehand, then Q could always be doubled to keep it sufficiently
large.
A shard could always be uniquely identified by a single integer... the
top 5 bits being how many times the key space was divided by 2, and
the bottom 27 bits being the slice number.
So, to calculate what shard a key belongs in:
sliceNumber = key.hashCode >>> nSplits
shardId == (nSplits<<27) | shardNum
No global state necessary to define what a shard is, doubling the
number of shards is easy, anyone can easily determine if a key is in a
shard or not, etc.
That still leaves the problem of allocating shards to nodes unsolved
for the time being... but I'm just brainstorming out loud at this
point.
-Yonik
|
|
From: Ning L. <nin...@gm...> - 2008-02-15 23:15:33
|
On Thu, Feb 14, 2008 at 7:00 PM, Doug Cutting <cu...@ap...> wrote: > I've attached up what I hope is a minimally sufficient user-level API to > start implementing. The idea is to include only what's required, like > facet counts, but otherwise keep it simple. Client implements Indexer and Searcher interfaces, right? But who calls those methods of Master? Regards, Ning |
|
From: Ning L. <nin...@gm...> - 2008-02-15 21:35:21
|
First, let's sync a couple of index-related terms: 1) Shard: The distinct starting positions of all the virtual nodes divide the ring into shards. For example, starting positions (A B C D E) divide the ring into 5 shards: AB, BC, CD, DE, DA. 2) Index on a virtual node (suggest a name?): A virtual node serves a number of continuous shards. For example, with 3-way replication, the indexes on the virtual nodes are: AB-BC-CD, BC-CD-DE, CD-DE-EA, DE-EA-AB, EA-AB-BC. Now, should an index on a virtual node be implemented as one Lucene index or N Lucene indexes (one per shard)? Using N Lucene indexes has its up side. It's less expensive when a virtual node is added or removed - often a shard is shipped or deleted instead of shipping a virtual node index or splitting and shipping... And often we don't have to filter when querying a sub-range of a virtual node index. The down side is that a virtual node has to manage multiple Lucene indexes... What do you think? Regards, Ning |
|
From: Ning L. <nin...@gm...> - 2008-02-15 04:51:30
|
<bai...@li...>On Thu, Feb 14, 2008 at 5:23 PM, Doug Cutting <cu...@ap...> wrote: > If each document is replicated in, say, the two clockwise indexes from > the index serving its range, then one need only query every third index > to achieve complete coverage, right? Things get tricky when indexes are > added or removed from the ring, when the number of nodes in the ring > isn't divisible by three, etc. Some range filtering will be required in > these cases, but not in most. Hopefully we could find a way so that any > range-filtering that's required is spread around the ring, to avoid > hot-spots. Having nodes serve multiple indexes, at different points on > the ring will help some with hot-spots too. Agree. > If N=4, neighbors would get > 33% more queries, etc. If each node serves M indexes, then this impact > would be diminished. So if N=3 and M=4 (each node serves four indexes) > then a neighbor node's load would increase by just 12.5%, which is > pretty managable. I think Yonik's interpretation is correct, right? > Another approach might be to query overlapping ranges and filter in the > client. With N-way replication you'd query every N/2th index. Search > results would include facet counts for sub-ranges, so they could be > correctly merged. If N=4, querying every other index, then, when a node > fails, no other indexes need be re-queried. Similarly for N=6 querying > every third index. Here you take a big hit up front, always searching > twice as big of an index as you need to, but avoid the latencies of > re-querying. Whether this is worthy depends on the expected percentage of queries that need to be re-queried on some indexes... > > The above requires that a Lucene index can efficiently support > > query on a sub- range of docids - application/system docids, > > not Lucene docids. > > If simply implemented with a filter based on a FieldCache, this is fast, > but the expense is still that of searching the entire index. Yes, this should be good enough. > > ... so that within a segment, Lucene docids are assigned > > in the same order as their corresponding application/system > > docids during build/merge... > > I don't see how that's easy. Lucene assumes that newly indexed ids are > always greater than previously indexed ids, and that assumption is > fairly deep. Segments could be re-sorted I guess, and postings merged > rather than appended. But that'd be a substantial change to Lucene. Is > that what you had in mind? I was thinking of keeping Lucene docids in the same order as their application/system docids within a segment, not across segments. The merge algorithm will be different but segments don't have to be re-sorted during merge. To take advantage of this, however, we'll have to query on each segment individually and then merge results. (Query on SegmentReader and merge results instead of querying on MultiSegmentReader.) We'll need the same result merge algorithm on clients. But until we implement that algorithm, let's use filtering. Regards, Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-15 02:38:58
|
On Thu, Feb 14, 2008 at 5:23 PM, Doug Cutting <cu...@ap...> wrote: > Ning Li wrote: > > So the master becomes the manager of a ring in this case. Given > > the ring, a search client sends a query to a (hopefully > > minimal) set of nodes whose ranges cover the ring. When sending > > the query to a node in the set, the search client also > > specifies the sub-range of docids on which the query should be > > executed. This is to make sure that any range of docids is > > queried and only queried once. > > I'd been thinking of something that was like consistent hashing, but > that tried to keep indexes mostly disjoint. But perhaps consistent > hashing itself will work well here. > > If each document is replicated in, say, the two clockwise indexes from > the index serving its range, then one need only query every third index > to achieve complete coverage, right? Things get tricky when indexes are > added or removed from the ring, when the number of nodes in the ring > isn't divisible by three, etc. Some range filtering will be required in > these cases, but not in most. Hopefully we could find a way so that any > range-filtering that's required is spread around the ring, to avoid > hot-spots. Starting at a random node and querying every Nth node, and then filtering the last overlapping parts seems like it should spread that evenly. > Having nodes serve multiple indexes, at different points on > the ring will help some with hot-spots too. > > With N-way replication, and only querying every Nth node, when a node > fails, 2 other indexes, one on each side, must be queried to cover the > missing range, since no other single index covers that range. But, > there are N-1 different pairs of indexes that do cover the range. So > the load on neighbors will increase by (N-1)/2 on average. If N=3, > neighbor nodes would get 50% more queries. If N=4, neighbors would get > 33% more queries, etc. OK this makes sense so far... for a 9 node system with N=3, you query 3 nodes with indicies each 1/3 of the size of the complete index. > If each node serves M indexes, then this impact > would be diminished. So if N=3 and M=4 (each node serves four indexes) > then a neighbor node's load would increase by just 12.5%, which is > pretty managable. I got lost at "M" a little... What's an index in this context? Does it mean additional virtual nodes on the hash circle that map to the same physical node (an index being the arc between two adjacent points)? That's normally needed to make consistent hashing fair anyway, right? > > The above requires that a Lucene index can efficiently support > > query on a sub- range of docids - application/system docids, > > not Lucene docids. > > If simply implemented with a filter based on a FieldCache, this is fast, > but the expense is still that of searching the entire index. Just thinking about the static case, one could keep separate docs in separate indicies. A multireader over 3 indices wouldn't be too bad. But then if you start adding and removing nodes, it gets messy. if the load increase is low enough (12.5% isn't too bad) straightforward filtering is probably the best. It seems like we need filtering capabilities anyway in the cases when nodes are being rebalanced due to the number of nodes changing. -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-15 00:12:47
|
Doug Cutting wrote: > First I'll implement a completely synchronous version, then write the > simulator/tester. Once that's working, we can start designing a > threaded, pseudo-distributed version [ ... ] The synchronous implementation will also be useful as the "model" when testing: each change and query can be made to both the static model and to the "distributed" implementation, comparing the results. The model may sometimes return slightly different results, e.g., when updates are not yet fully synchronized, so testing will log differences from the model rather than fail entirely. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-15 00:06:37
|
Ning Li wrote: > Doug Cutting wrote: > > Does this sound like a good strategy? > > Sounds good! I've attached up what I hope is a minimally sufficient user-level API to start implementing. The idea is to include only what's required, like facet counts, but otherwise keep it simple. First I'll implement a completely synchronous version, then write the simulator/tester. Once that's working, we can start designing a threaded, pseudo-distributed version, that implements the interfaces in two layers: a "client-side" layer that transparently retries, etc., and a "server-side" layer that might be flaky, throwing exceptions, hanging, etc. and where servers chat amongst themselves. Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-14 23:31:14
|
Doug Cutting wrote: > Does this sound like a good strategy? Sounds good! > I should add: sorry I've been AWOL most of this week. I've been down > with a bad cold. I really am eager to get started on this project! Hope you'll recover soon! |
|
From: Doug C. <cu...@ap...> - 2008-02-14 22:56:41
|
Doug Cutting wrote: > I'm on the road tomorrow, but I can start proposing APIs early next > week [ ...] I should add: sorry I've been AWOL most of this week. I've been down with a bad cold. I really am eager to get started on this project! Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-14 22:49:18
|
We might get starte by using threads and method calls instead of RPC. This should help us get our design straight before we invest in a "real" implementation. So we might: 1. Write a simple client API: addDoc(), removeDoc(), updateDoc(), query(). 2. Write a multi-threaded test program that uses this, simulating a large, active index. It randomly adds, removes and updates randomly created documents, and periodically queries and checks that results are correct. 3. Implement the "distributed" system using threads per node. 4. Randomly kill nodes during the simulation. We might even avoid using Lucene at this point, but simply use Java Collections for indexes. Documents would have a few fields with atomic values (no full text). Does this sound like a good strategy? I'm on the road tomorrow, but I can start proposing APIs early next week, and then maybe get some help coding the simulation. Once we have things working, then we can start replacing method calls with RPCs and putting Lucene or Solr underneath and see if it still holds up, incrementally moving towards a complete system, always with a test suite. Doug |
|
From: Doug C. <cu...@ap...> - 2008-02-14 22:23:42
|
Ning Li wrote: > So the master becomes the manager of a ring in this case. Given > the ring, a search client sends a query to a (hopefully > minimal) set of nodes whose ranges cover the ring. When sending > the query to a node in the set, the search client also > specifies the sub-range of docids on which the query should be > executed. This is to make sure that any range of docids is > queried and only queried once. I'd been thinking of something that was like consistent hashing, but that tried to keep indexes mostly disjoint. But perhaps consistent hashing itself will work well here. If each document is replicated in, say, the two clockwise indexes from the index serving its range, then one need only query every third index to achieve complete coverage, right? Things get tricky when indexes are added or removed from the ring, when the number of nodes in the ring isn't divisible by three, etc. Some range filtering will be required in these cases, but not in most. Hopefully we could find a way so that any range-filtering that's required is spread around the ring, to avoid hot-spots. Having nodes serve multiple indexes, at different points on the ring will help some with hot-spots too. With N-way replication, and only querying every Nth node, when a node fails, 2 other indexes, one on each side, must be queried to cover the missing range, since no other single index covers that range. But, there are N-1 different pairs of indexes that do cover the range. So the load on neighbors will increase by (N-1)/2 on average. If N=3, neighbor nodes would get 50% more queries. If N=4, neighbors would get 33% more queries, etc. If each node serves M indexes, then this impact would be diminished. So if N=3 and M=4 (each node serves four indexes) then a neighbor node's load would increase by just 12.5%, which is pretty managable. Another approach might be to query overlapping ranges and filter in the client. With N-way replication you'd query every N/2th index. Search results would include facet counts for sub-ranges, so they could be correctly merged. If N=4, querying every other index, then, when a node fails, no other indexes need be re-queried. Similarly for N=6 querying every third index. Here you take a big hit up front, always searching twice as big of an index as you need to, but avoid the latencies of re-querying. Does any of this make sense? > The above requires that a Lucene index can efficiently support > query on a sub- range of docids - application/system docids, > not Lucene docids. If simply implemented with a filter based on a FieldCache, this is fast, but the expense is still that of searching the entire index. > That can be achieved by a bit modification > to Lucene so that within a segment, Lucene docids are assigned > in the same order as their corresponding application/system > docids during build/merge. A nice side-effect of this is that > it becomes efficient to delete a sub-range of docids from an > index as well. I don't see how that's easy. Lucene assumes that newly indexed ids are always greater than previously indexed ids, and that assumption is fairly deep. Segments could be re-sorted I guess, and postings merged rather than appended. But that'd be a substantial change to Lucene. Is that what you had in mind? Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-11 02:33:36
|
We can have a concept of a shard. I meant to say it'd be different from the shard in my postings on Lucene general. Here, we can say the distinct starting positions of all the nodes divide the ring into shards. A node serves one or multiple continuous shards. So, a node serves a range of docids. Regards, Ning On Feb 10, 2008 12:14 PM, Yonik Seeley <yo...@ap...> wrote: > On Feb 10, 2008 2:07 PM, Ning Li <nin...@gm...> wrote: > > As in consistent hashing, the entire range of docids (or their hashes) > is > > treated > > as a ring. There is no clear concept of a shard here. > > It seems to me like consistent hashing could happily live with the > concept of a shard. > All docids that belong to a single physical node form a shard. > All replicas of a shard could be the same. > I've not thought through all the implications though... I'm still > reading the Dynamo paper myself (and after that will be looking into > zookeeper I suppose). > > -Yonik > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > bailey-developers mailing list > bai...@li... > https://lists.sourceforge.net/lists/listinfo/bailey-developers > |
|
From: Yonik S. <yo...@ap...> - 2008-02-10 20:14:47
|
On Feb 10, 2008 2:07 PM, Ning Li <nin...@gm...> wrote: > As in consistent hashing, the entire range of docids (or their hashes) is > treated > as a ring. There is no clear concept of a shard here. It seems to me like consistent hashing could happily live with the concept of a shard. All docids that belong to a single physical node form a shard. All replicas of a shard could be the same. I've not thought through all the implications though... I'm still reading the Dynamo paper myself (and after that will be looking into zookeeper I suppose). -Yonik |
|
From: Ning L. <nin...@gm...> - 2008-02-10 19:07:15
|
On Feb 8, 2008 3:37 PM, Doug Cutting <cu...@ap...> wrote: > So the master is the manager of a lattice While I'm trying to figure out completely how it would work, would consistent hashing and making the master the manager of a ring work and be simpler? (Reference consistent hashing and Dynamo's partitioning algorithm.) As in consistent hashing, the entire range of docids (or their hashes) is treated as a ring. There is no clear concept of a shard here. A node is assigned a value (starting position) on the ring and it hosts an index for a range of docids starting from the starting position. The system cannot start until any range of docids is covered by at least one node. During normal state, any range of docids is covered by multiple nodes, depending on the level of replication. So the master becomes the manager of a ring in this case. Given the ring, a search client sends a query to a (hopefully minimal) set of nodes whose ranges cover the ring. When sending the query to a node in the set, the search client also specifies the sub-range of docids on which the query should be executed. This is to make sure that any range of docids is queried and only queried once. The above requires that a Lucene index can efficiently support query on a sub- range of docids - application/system docids, not Lucene docids. That can be achieved by a bit modification to Lucene so that within a segment, Lucene docids are assigned in the same order as their corresponding application/system docids during build/merge. A nice side-effect of this is that it becomes efficient to delete a sub-range of docids from an index as well. Thoughts? Regards, Ning |
|
From: Yonik S. <yo...@ap...> - 2008-02-08 23:47:52
|
On Feb 8, 2008 6:37 PM, Doug Cutting <cu...@ap...> wrote: > It would be good to have a system that supports online updates, so that > we can support the widest variety of users. We don't want to force Solr > to reinvent things to become scalable. > > That said, there will be some compromises. The compromise I'm proposing > is that we permit conflicting changes in some situations (i.e., network > partitioning) and that we silently resolve those conflicts in a > predictable way. This isn't perfect, but I don't see another way to > both meet the scalability and reliability goals and to support online > updates. We can hopefully structure things so that conflicts are rare, > but when they occur, some kinds of updates may be lost. > > Do others think this is a reasonable starting point for a design? +1 -Yonik |
|
From: Doug C. <cu...@ap...> - 2008-02-08 23:37:37
|
It would be good to have a system that supports online updates, so that we can support the widest variety of users. We don't want to force Solr to reinvent things to become scalable. That said, there will be some compromises. The compromise I'm proposing is that we permit conflicting changes in some situations (i.e., network partitioning) and that we silently resolve those conflicts in a predictable way. This isn't perfect, but I don't see another way to both meet the scalability and reliability goals and to support online updates. We can hopefully structure things so that conflicts are rare, but when they occur, some kinds of updates may be lost. Do others think this is a reasonable starting point for a design? I added some initial thoughts about sharding to the wiki at: http://bailey.wiki.sourceforge.net/ http://bailey.wiki.sourceforge.net/HowToShard Some things I'm still not clear on: - Can we get away with a master that has minimal persistent state? - How do we split and merge shards? A shard is identified by a range of docids. As we split and merge, at some point, some nodes will have fragmentary shards (sub-ranges of a shard) and/or super-shards (ranges spanning multiple shards). The master needs to somehow make sense of this. If the system is restarted when some nodes have split a shard and some have not, then some ranges may be spanned by either a single node or multiple nodes. The system cannot start until there's a continuous path through the lattice, spanning all docids. So the master is the manager of a lattice, asking nodes to make changes to the lattice, reflecting those changes in the lattice once they're complete, and presenting the lattice to search clients. Nodes in the lattice are points on the docid circle. Arcs are nodes that maintain an index for that range of document ids. Each node must know about other nodes whose ranges intersect its ranges, so that it can propagate changes. Does this make sense? How do we keep track of which changes have been propagated? If there were never splits or merges, we might get away with using node identity: if node X has sent its changes A through C to node Y, then the next time it sends changes to Y, it could start with change D. But once things start splitting, merging and otherwise migrating, this gets harder. Any ideas? Am I barking up the wrong tree altogether? Thanks! Doug |
|
From: Ning L. <nin...@gm...> - 2008-02-08 23:01:00
|
On Feb 7, 2008 3:16 PM, Doug Cutting <cu...@ap...> wrote: > What do you think? Should we just direct interested folks on > ge...@lu... to this site? > +1 |
|
From: Doug C. <cu...@ap...> - 2008-02-07 23:17:39
|
Thanks for adding yourselves to this list! I'm now wondering whether it's worth trying to keep under the radar. We don't want to exclude folks who might be interested, but I also don't want to get too much attention too soon. On balance however, I suspect more openness is better than less. What do you think? Should we just direct interested folks on ge...@lu... to this site? Cheers, Doug |