|
From: Doug C. <cu...@ap...> - 2008-02-08 23:37:37
|
It would be good to have a system that supports online updates, so that we can support the widest variety of users. We don't want to force Solr to reinvent things to become scalable. That said, there will be some compromises. The compromise I'm proposing is that we permit conflicting changes in some situations (i.e., network partitioning) and that we silently resolve those conflicts in a predictable way. This isn't perfect, but I don't see another way to both meet the scalability and reliability goals and to support online updates. We can hopefully structure things so that conflicts are rare, but when they occur, some kinds of updates may be lost. Do others think this is a reasonable starting point for a design? I added some initial thoughts about sharding to the wiki at: http://bailey.wiki.sourceforge.net/ http://bailey.wiki.sourceforge.net/HowToShard Some things I'm still not clear on: - Can we get away with a master that has minimal persistent state? - How do we split and merge shards? A shard is identified by a range of docids. As we split and merge, at some point, some nodes will have fragmentary shards (sub-ranges of a shard) and/or super-shards (ranges spanning multiple shards). The master needs to somehow make sense of this. If the system is restarted when some nodes have split a shard and some have not, then some ranges may be spanned by either a single node or multiple nodes. The system cannot start until there's a continuous path through the lattice, spanning all docids. So the master is the manager of a lattice, asking nodes to make changes to the lattice, reflecting those changes in the lattice once they're complete, and presenting the lattice to search clients. Nodes in the lattice are points on the docid circle. Arcs are nodes that maintain an index for that range of document ids. Each node must know about other nodes whose ranges intersect its ranges, so that it can propagate changes. Does this make sense? How do we keep track of which changes have been propagated? If there were never splits or merges, we might get away with using node identity: if node X has sent its changes A through C to node Y, then the next time it sends changes to Y, it could start with change D. But once things start splitting, merging and otherwise migrating, this gets harder. Any ideas? Am I barking up the wrong tree altogether? Thanks! Doug |