|
From: Ning L. <nin...@gm...> - 2008-02-29 00:28:49
|
On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote: > I'd like to position this against document databases, so I'm hoping it > can be used as a primary storage. A copy of a document will be stored in a stored field, right? I think positioning this against document databases is nice but here are a couple of things worth noting: First, keeping both a doc and its inverted form in an index means storing the doc and indexing the doc are done in the same "transaction". A traditional document database often store a doc first and then index it later (hopefully soon). Second, a traditional document database often supports updating a doc's "metadata" such as author or date. We don't support this or we say a document is name-value pairs and we reconstruct from stored fields and support such update? > > I'm tempted to lean toward #3 since logs are needed to sync up nodes > > (back to question #1). > > It would be a nice feature if we could arrange so that, in most cases, > the client that adds a document sees it in search results immediately. > We cannot guarantee that all other clients will see it. Some sort of > immediate indexing of the document is required to support this feature, > but in-memory is sufficient. We may not implement this feature right > off, but we should keep it in mind. > > Logging is attractive, since it permits easy replaying of logs when > shipping updates between nodes. Perhaps we can instead use queries to > enumerate changes, but that requires more thought. If we have a copy of each doc in the stored field, then as you said later, we can just log the operation id and revision, then retrieve the doc as necessary from the index. > As for disk versus memory: if we only send updates to a single node in a > document's range, then we should sync them to disk. If we instead send > updates to multiple nodes in the range, then it's probably okay not to > sync, since we already assume that not all nodes in a range will fail at > once. The downside of this is that documents could be lost in the case > of a datacenter-wide powerfailure, but I think that's acceptable. > > Performance will suffer considerably if we have to sync on each add. So > my inclination is to attempt to add documents to several nodes in the > range and not require a sync per add, buffering things in memory as > required for good performance. Replication in memory provides > fault-tolerance. #5 with non-sync light-weight logging? Should work. Ning |