Re: [bailey-developers] lattice master

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote:
>  I'd like to position this against document databases, so I'm hoping it
>  can be used as a primary storage.

A copy of a document will be stored in a stored field, right?

I think positioning this against document databases is nice but
here are a couple of things worth noting:

First, keeping both a doc and its inverted form in an index
means storing the doc and indexing the doc are done in the
same "transaction". A traditional document database often
store a doc first and then index it later (hopefully soon).

Second, a traditional document database often supports
updating a doc's "metadata" such as author or date. We
don't support this or we say a document is name-value
pairs and we reconstruct from stored fields and support
such update?

>  > I'm tempted to lean toward #3 since logs are needed to sync up nodes
>  > (back to question #1).
>
>  It would be a nice feature if we could arrange so that, in most cases,
>  the client that adds a document sees it in search results immediately.
>  We cannot guarantee that all other clients will see it.  Some sort of
>  immediate indexing of the document is required to support this feature,
>  but in-memory is sufficient.  We may not implement this feature right
>  off, but we should keep it in mind.
>
>  Logging is attractive, since it permits easy replaying of logs when
>  shipping updates between nodes.  Perhaps we can instead use queries to
>  enumerate changes, but that requires more thought.

If we have a copy of each doc in the stored field, then as
you said later, we can just log the operation id and revision,
then retrieve the doc as necessary from the index.

>  As for disk versus memory: if we only send updates to a single node in a
>  document's range, then we should sync them to disk.  If we instead send
>  updates to multiple nodes in the range, then it's probably okay not to
>  sync, since we already assume that not all nodes in a range will fail at
>  once.  The downside of this is that documents could be lost in the case
>  of a datacenter-wide powerfailure, but I think that's acceptable.
>
>  Performance will suffer considerably if we have to sync on each add.  So
>  my inclination is to attempt to add documents to several nodes in the
>  range and not require a sync per add, buffering things in memory as
>  required for good performance.  Replication in memory provides
>  fault-tolerance.

#5 with non-sync light-weight logging? Should work.

Ning