|
From: Doug C. <cu...@ap...> - 2008-02-14 22:23:42
|
Ning Li wrote: > So the master becomes the manager of a ring in this case. Given > the ring, a search client sends a query to a (hopefully > minimal) set of nodes whose ranges cover the ring. When sending > the query to a node in the set, the search client also > specifies the sub-range of docids on which the query should be > executed. This is to make sure that any range of docids is > queried and only queried once. I'd been thinking of something that was like consistent hashing, but that tried to keep indexes mostly disjoint. But perhaps consistent hashing itself will work well here. If each document is replicated in, say, the two clockwise indexes from the index serving its range, then one need only query every third index to achieve complete coverage, right? Things get tricky when indexes are added or removed from the ring, when the number of nodes in the ring isn't divisible by three, etc. Some range filtering will be required in these cases, but not in most. Hopefully we could find a way so that any range-filtering that's required is spread around the ring, to avoid hot-spots. Having nodes serve multiple indexes, at different points on the ring will help some with hot-spots too. With N-way replication, and only querying every Nth node, when a node fails, 2 other indexes, one on each side, must be queried to cover the missing range, since no other single index covers that range. But, there are N-1 different pairs of indexes that do cover the range. So the load on neighbors will increase by (N-1)/2 on average. If N=3, neighbor nodes would get 50% more queries. If N=4, neighbors would get 33% more queries, etc. If each node serves M indexes, then this impact would be diminished. So if N=3 and M=4 (each node serves four indexes) then a neighbor node's load would increase by just 12.5%, which is pretty managable. Another approach might be to query overlapping ranges and filter in the client. With N-way replication you'd query every N/2th index. Search results would include facet counts for sub-ranges, so they could be correctly merged. If N=4, querying every other index, then, when a node fails, no other indexes need be re-queried. Similarly for N=6 querying every third index. Here you take a big hit up front, always searching twice as big of an index as you need to, but avoid the latencies of re-querying. Does any of this make sense? > The above requires that a Lucene index can efficiently support > query on a sub- range of docids - application/system docids, > not Lucene docids. If simply implemented with a filter based on a FieldCache, this is fast, but the expense is still that of searching the entire index. > That can be achieved by a bit modification > to Lucene so that within a segment, Lucene docids are assigned > in the same order as their corresponding application/system > docids during build/merge. A nice side-effect of this is that > it becomes efficient to delete a sub-range of docids from an > index as well. I don't see how that's easy. Lucene assumes that newly indexed ids are always greater than previously indexed ids, and that assumption is fairly deep. Segments could be re-sorted I guess, and postings merged rather than appended. But that'd be a substantial change to Lucene. Is that what you had in mind? Doug |