[bailey-developers] lattice master

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

It would be good to have a system that supports online updates, so that 
we can support the widest variety of users.  We don't want to force Solr 
to reinvent things to become scalable.

That said, there will be some compromises.  The compromise I'm proposing 
is that we permit conflicting changes in some situations (i.e., network 
partitioning) and that we silently resolve those conflicts in a 
predictable way.  This isn't perfect, but I don't see another way to 
both meet the scalability and reliability goals and to support online 
updates.  We can hopefully structure things so that conflicts are rare, 
but when they occur, some kinds of updates may be lost.

Do others think this is a reasonable starting point for a design?

I added some initial thoughts about sharding to the wiki at:

http://bailey.wiki.sourceforge.net/
http://bailey.wiki.sourceforge.net/HowToShard

Some things I'm still not clear on:

- Can we get away with a master that has minimal persistent state?
- How do we split and merge shards?

A shard is identified by a range of docids.  As we split and merge, at 
some point, some nodes will have fragmentary shards (sub-ranges of a 
shard) and/or super-shards (ranges spanning multiple shards).  The 
master needs to somehow make sense of this.  If the system is restarted 
when some nodes have split a shard and some have not, then some ranges 
may be spanned by either a single node or multiple nodes.  The system 
cannot start until there's a continuous path through the lattice, 
spanning all docids.

So the master is the manager of a lattice, asking nodes to make changes 
to the lattice, reflecting those changes in the lattice once they're 
complete, and presenting the lattice to search clients.

Nodes in the lattice are points on the docid circle.  Arcs are nodes 
that maintain an index for that range of document ids.  Each node must 
know about other nodes whose ranges intersect its ranges, so that it can 
propagate changes.

Does this make sense?

How do we keep track of which changes have been propagated?  If there 
were never splits or merges, we might get away with using node identity: 
if node X has sent its changes A through C to node Y, then the next time 
it sends changes to Y, it could start with change D.  But once things 
start splitting, merging and otherwise migrating, this gets harder.

Any ideas?  Am I barking up the wrong tree altogether?

Thanks!

Doug