|
From: Doug C. <cu...@ap...> - 2008-02-28 22:01:17
|
Yonik Seeley wrote: > On Wed, Feb 27, 2008 at 3:58 PM, Ning Li <nin...@gm...> wrote: >> At the same time, should we also discuss what the update >> model should be: >> 1 One updatable replica vs. all updatable replicas. The former >> is simple. The latter is powerful. Is there sufficient need for >> the latter to justify its complexity? > > We should always be able to update (so if the "updateable" replica is > down, we need to be able to update another replica). I agree. To support network partitions, all nodes must accept updates for documents in their range. >> 2 The atomicity of an insert/delete/update operation. When >> an insert/delete/update operation is done, does it mean: >> 1) the new doc is indexed in the memory of the node >> 2) the new doc is indexed on the local disk of the node >> 3) the new doc is logged on the local disk of the node >> 4) the new doc is logged in some fault-tolerant shared FS >> (e.g. HDFS) >> 5) the new doc is indexed in the memory of at least X >> nodes >> The probability of the operation getting lost is from high >> to low: 1), then 2) and 3), then 4) and 5). > > [ ... ] I think the decision partially depends on how much of > a document storage system this is, vs just an index that can be > rebuilt. I'd like to position this against document databases, so I'm hoping it can be used as a primary storage. > I'm tempted to lean toward #3 since logs are needed to sync up nodes > (back to question #1). It would be a nice feature if we could arrange so that, in most cases, the client that adds a document sees it in search results immediately. We cannot guarantee that all other clients will see it. Some sort of immediate indexing of the document is required to support this feature, but in-memory is sufficient. We may not implement this feature right off, but we should keep it in mind. Logging is attractive, since it permits easy replaying of logs when shipping updates between nodes. Perhaps we can instead use queries to enumerate changes, but that requires more thought. As for disk versus memory: if we only send updates to a single node in a document's range, then we should sync them to disk. If we instead send updates to multiple nodes in the range, then it's probably okay not to sync, since we already assume that not all nodes in a range will fail at once. The downside of this is that documents could be lost in the case of a datacenter-wide powerfailure, but I think that's acceptable. Performance will suffer considerably if we have to sync on each add. So my inclination is to attempt to add documents to several nodes in the range and not require a sync per add, buffering things in memory as required for good performance. Replication in memory provides fault-tolerance. Doug |