Re: [bailey-developers] lattice master

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ning Li wrote:
> So the master becomes the manager of a ring in this case. Given
> the ring, a search client sends a query to a (hopefully
> minimal) set of nodes whose ranges cover the ring.  When sending
> the query to a node in the set, the search client also
> specifies the sub-range of docids on which the query should be
> executed.  This is to make sure that any range of docids is
> queried and only queried once.

I'd been thinking of something that was like consistent hashing, but 
that tried to keep indexes mostly disjoint.  But perhaps consistent 
hashing itself will work well here.

If each document is replicated in, say, the two clockwise indexes from 
the index serving its range, then one need only query every third index 
to achieve complete coverage, right?  Things get tricky when indexes are 
added or removed from the ring, when the number of nodes in the ring 
isn't divisible by three, etc.  Some range filtering will be required in 
these cases, but not in most.  Hopefully we could find a way so that any 
range-filtering that's required is spread around the ring, to avoid 
hot-spots.  Having nodes serve multiple indexes, at different points on 
the ring will help some with hot-spots too.

With N-way replication, and only querying every Nth node, when a node 
fails, 2 other indexes, one on each side, must be queried to cover the 
missing range, since no other single index covers that range.  But, 
there are N-1 different pairs of indexes that do cover the range.  So 
the load on neighbors will increase by (N-1)/2 on average.  If N=3, 
neighbor nodes would get 50% more queries.  If N=4, neighbors would get 
33% more queries, etc.  If each node serves M indexes, then this impact 
would be diminished.  So if N=3 and M=4 (each node serves four indexes) 
then a neighbor node's load would increase by just 12.5%, which is 
pretty managable.

Another approach might be to query overlapping ranges and filter in the 
client.  With N-way replication you'd query every N/2th index.  Search 
results would include facet counts for sub-ranges, so they could be 
correctly merged.  If N=4, querying every other index, then, when a node 
fails, no other indexes need be re-queried.  Similarly for N=6 querying 
every third index.  Here you take a big hit up front, always searching 
twice as big of an index as you need to, but avoid the latencies of 
re-querying.

Does any of this make sense?

> The above requires that a Lucene index can efficiently support
> query on a sub- range of docids - application/system docids,
> not Lucene docids. 

If simply implemented with a filter based on a FieldCache, this is fast, 
but the expense is still that of searching the entire index.

> That can be achieved by a bit modification
> to Lucene so that within a segment, Lucene docids are assigned
> in the same order as their corresponding application/system
> docids during build/merge. A nice side-effect of this is that
> it becomes efficient to delete a sub-range of docids from an
> index as well.

I don't see how that's easy.  Lucene assumes that newly indexed ids are 
always greater than previously indexed ids, and that assumption is 
fairly deep.  Segments could be re-sorted I guess, and postings merged 
rather than appended.  But that'd be a substantial change to Lucene.  Is 
that what you had in mind?

Doug