Re: [bailey-developers] lattice master

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Feb 28, 2008 at 5:01 PM, Doug Cutting <cu...@ap...> wrote:
>  I'd like to position this against document databases, so I'm hoping it
>  can be used as a primary storage.

Are you thinking of storage outside of Lucene stored fields too then?

>  Logging is attractive, since it permits easy replaying of logs when
>  shipping updates between nodes.  Perhaps we can instead use queries to
>  enumerate changes, but that requires more thought.

Unless all fields are stored, it would be a lengthly process trying to
extract a single document that had been added to an index.  Using log
replay would seem to be more general purpose as it could more easily
accommodate other side effects in a system (any changes made outside
of a lucene index).

>  As for disk versus memory: if we only send updates to a single node in a
>  document's range, then we should sync them to disk.  If we instead send
>  updates to multiple nodes in the range, then it's probably okay not to
>  sync, since we already assume that not all nodes in a range will fail at
>  once.  The downside of this is that documents could be lost in the case
>  of a datacenter-wide powerfailure, but I think that's acceptable.
>
>  Performance will suffer considerably if we have to sync on each add.  So
>  my inclination is to attempt to add documents to several nodes in the
>  range and not require a sync per add, buffering things in memory as
>  required for good performance.  Replication in memory provides
>  fault-tolerance.

Sounds good.
I recall from the google filesystem paper how they sent to multiple
nodes in a chain (client->A->B->C) rather than having the client send
in parallel (client->(A,B,C)) which made a lot of sense at the time
(maximizing single NIC bandwidth, etc).  Perhaps too much detail right
now, but it's worth keeping in mind.

-Yonik