Re: [bailey-developers] lattice master

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Feb 14, 2008 at 5:23 PM, Doug Cutting <cu...@ap...> wrote:
> Ning Li wrote:
>  > So the master becomes the manager of a ring in this case. Given
>  > the ring, a search client sends a query to a (hopefully
>  > minimal) set of nodes whose ranges cover the ring.  When sending
>  > the query to a node in the set, the search client also
>  > specifies the sub-range of docids on which the query should be
>  > executed.  This is to make sure that any range of docids is
>  > queried and only queried once.
>
>  I'd been thinking of something that was like consistent hashing, but
>  that tried to keep indexes mostly disjoint.  But perhaps consistent
>  hashing itself will work well here.
>
>  If each document is replicated in, say, the two clockwise indexes from
>  the index serving its range, then one need only query every third index
>  to achieve complete coverage, right?  Things get tricky when indexes are
>  added or removed from the ring, when the number of nodes in the ring
>  isn't divisible by three, etc.  Some range filtering will be required in
>  these cases, but not in most.  Hopefully we could find a way so that any
>  range-filtering that's required is spread around the ring, to avoid
>  hot-spots.

Starting at a random node and querying every Nth node, and then
filtering the last overlapping parts seems like it should spread that
evenly.

>  Having nodes serve multiple indexes, at different points on
>  the ring will help some with hot-spots too.
>
>  With N-way replication, and only querying every Nth node, when a node
>  fails, 2 other indexes, one on each side, must be queried to cover the
>  missing range, since no other single index covers that range.  But,
>  there are N-1 different pairs of indexes that do cover the range.  So
>  the load on neighbors will increase by (N-1)/2 on average.  If N=3,
>  neighbor nodes would get 50% more queries.  If N=4, neighbors would get
>  33% more queries, etc.

OK this makes sense so far... for a 9 node system with N=3, you query
3 nodes with indicies each 1/3 of the size of the complete index.

>  If each node serves M indexes, then this impact
>  would be diminished.  So if N=3 and M=4 (each node serves four indexes)
>  then a neighbor node's load would increase by just 12.5%, which is
>  pretty managable.

I got lost at "M" a little...  What's an index in this context?  Does
it mean additional virtual nodes on the hash circle that map to the
same physical node (an index being the arc between two adjacent
points)?  That's normally needed to make consistent hashing fair
anyway, right?

>  > The above requires that a Lucene index can efficiently support
>  > query on a sub- range of docids - application/system docids,
>  > not Lucene docids.
>
>  If simply implemented with a filter based on a FieldCache, this is fast,
>  but the expense is still that of searching the entire index.

Just thinking about the static case, one could keep separate docs in
separate indicies.
A multireader over 3 indices wouldn't be too bad.  But then if you
start adding and removing nodes, it gets messy.  if the load increase
is low enough (12.5% isn't too bad) straightforward filtering is
probably the best.

It seems like we need filtering capabilities anyway in the cases when
nodes are being rebalanced due to the number of nodes changing.

-Yonik