Re: [bailey-developers] lattice master

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

 <bai...@li...>On Thu, Feb 14, 2008 at 5:23 PM,
Doug Cutting <cu...@ap...> wrote:
> If each document is replicated in, say, the two clockwise indexes from
> the index serving its range, then one need only query every third index
> to achieve complete coverage, right?  Things get tricky when indexes are
> added or removed from the ring, when the number of nodes in the ring
> isn't divisible by three, etc.  Some range filtering will be required in
> these cases, but not in most.  Hopefully we could find a way so that any
> range-filtering that's required is spread around the ring, to avoid
> hot-spots.  Having nodes serve multiple indexes, at different points on
> the ring will help some with hot-spots too.
Agree.

> If N=4, neighbors would get
> 33% more queries, etc.  If each node serves M indexes, then this impact
> would be diminished.  So if N=3 and M=4 (each node serves four indexes)
> then a neighbor node's load would increase by just 12.5%, which is
> pretty managable.

I think Yonik's interpretation is correct, right?

> Another approach might be to query overlapping ranges and filter in the
> client.  With N-way replication you'd query every N/2th index.  Search
> results would include facet counts for sub-ranges, so they could be
> correctly merged.  If N=4, querying every other index, then, when a node
> fails, no other indexes need be re-queried.  Similarly for N=6 querying
> every third index.  Here you take a big hit up front, always searching
> twice as big of an index as you need to, but avoid the latencies of
> re-querying.

Whether this is worthy depends on the expected percentage of queries
that need to be re-queried on some indexes...

> > The above requires that a Lucene index can efficiently support
> > query on a sub- range of docids - application/system docids,
> > not Lucene docids.
>
> If simply implemented with a filter based on a FieldCache, this is fast,
> but the expense is still that of searching the entire index.

Yes, this should be good enough.

> > ... so that within a segment, Lucene docids are assigned
> > in the same order as their corresponding application/system
> > docids during build/merge...
>
> I don't see how that's easy.  Lucene assumes that newly indexed ids are
> always greater than previously indexed ids, and that assumption is
> fairly deep.  Segments could be re-sorted I guess, and postings merged
> rather than appended.  But that'd be a substantial change to Lucene.  Is
> that what you had in mind?

I was thinking of keeping Lucene docids in the same order as
their application/system docids within a segment, not across
segments. The merge algorithm will be different but segments
don't have to be re-sorted during merge. To take advantage of this,
however, we'll have to query on each segment individually and then
merge results. (Query on SegmentReader and merge results
instead of querying on MultiSegmentReader.) We'll need the same
result merge algorithm on clients. But until we implement that
algorithm, let's use filtering.

Regards,
Ning