Re: [bailey-developers] Features and their design

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

DESIGN

1 System composition and functionality
  The system consists of a "master", "nodes", and "clients".

  We view the docid hash value space as a "ring". Each "node"
  maintains an "index" on a "range" of documents on the ring.
  The "master" keeps track of the mapping between the nodes
  and the ranges they index.

  The "node" here refers to a virtual/logical node. A physical
  "host" hosts a number of nodes.

  The system supports the following operations:
    - addDoc(doc), deleteDoc(doc), updateDoc(doc)
    - search(query)

  An "application" conducts the operations through a "client".
  A client communicates with the master to obtain the mapping
  between the nodes and the ranges, and communicates with the
  nodes directly to add/delete/update a document or conduct
  a search.

2 Partitioning
  We choose to use consistent hashing for load balancing.
  A node X is assigned a "position" on the ring. The "next"
  node/position on the ring is X+1. The range of node X is
  [X, X+1).

  This is similar to range partitioning.

3 Replication
  We replicate to achieve fault tolerance. N-way replication
  means that data is stored/indexed on N nodes. Assume N=3,
  the range of node X is now [X, X+3). The range [X, X+1) is
  now also served on node X-2 and node X-1 besides node X.

4 search(query)
  Based on the mapping between the nodes and the ranges,
  a client sends a search request to a number of nodes. The
  client specifies a search range for each node the request
  is sent to, so that 1) the search range for a node is
  covered by the node's range, 2) the search ranges together
  covers the ring once and only once, and 3) try to minimize
  the number of nodes the request is sent to.

  For example, if we have 7 nodes and 3-way replication,
  a search can be sent to node X with search range [X, X+3),
  node X+3 with search range [X+3, X+6), and node X+6 with
  search range [X+6, X+7), where X+7 == X.

  An alternative is to cover the ring R times (e.g. R=2)
  to avoid latencies of re-querying.

5 addDoc(doc)/deleteDoc(doc)/updateDoc(doc)
  We assume each document addition, deletion or update comes
  with an application-specified revision number for the
  document. Without this assumption, we'd have to do
  something much more complicated such as vector clocks.

  Each node maintains a light-weight log which is flushed
  to disk only periodically. A log is a sequence of log
  entries. Each log entry is <{ADD|DEL}, docid, revision>.

  A write request is sent to multiple nodes, all of whose
  ranges must include the document.

  When a node receives a write request of a document with
  revision R, it checks that revision R is newer than the
  current revision stored/indexed on the node. Then the node
  stores and indexes the new revision in memory and put a
  log entry in the log.

  A W-way write means at least W number of nodes have to
  successfully serve the write request for the write to be
  considered "complete". 1 < W <= N. W > 1 is to provide
  fault tolerance to write operations.

6 Consistency
  Each node syncs with the nodes with/on overlapping ranges
  - a.k.a log propogation. But since in most cases W+R <= N,
  the system provides eventual consistency.

  Since a node checks that a revision for a write request
  is newer than the current revision stored/indexed on the
  node, monotonic write consistency is supported.

  The implementation of client should support session
  consistency. And it'd be nice to support read-your-writes
  and monotonic read consistency.

  Now back to log propogation. Node X serves range [X, X+3)
  and periodically syncs:
    range [X,   X+1) with node X-2
    range [X,   X+2) with node X-1
    range [X+1, X+3) with node X+1
    range [X+2, X+3) with node X+2

  Here is how changes of range [X, X+1) on node X is
  propogated to node X-2. Assume during the last sync,
  changes upto log entry E on node X was propogated to
  node X-2 and the current log entry on node X is F.
    - Now log entries (E, F] are scaned sequentially, and
      those in the range [X, X+1) are sent to node X-2.
    - Node X-2 checks the log entries received from node X.
      For a log entry <{ADD|DEL}, docid, revision>, if the
      current revision on node X-2 is older, node X-2 puts
      the docid in a request list. Then node X-2 sends the
      request list of docids to node X.
    - Node X ships to node X-2 the stored copy of the
      documents on the request list.
    - Upon receiving the documents from node X, node X-2
      processes it as a normal write request, which means
      node X-2 will check the revision again to ensure
      monotonic write consistency, and log.

7 Add node / remove node
  A new position is assigned when a new node is added. Assume
  a new node X' is assigned the position X' between X and X+1.
  The neighbouring nodes will be affected and switch from:
    node X-2 serves range [X-2, X+1)
    node X-1 serves range [X-1, X+2)
    node X   serves range [X,   X+3)
  to:
    node X-2 serves range [X-2, X')
    node X-1 serves range [X-1, X+1)
    node X   serves range [X,   X+2)
    node X'  serves range [X',  X+3)

  Node X' prepares range [X', X+3) by copying index files
  from one or more neighbouring nodes. Then node X' starts
  to sync with the nodes with/on overlapping ranges.

  A flavor of two-phase commit is used to guarantee that
  no writes or write propogations are lost in the process.

Well, does it look all right? Feel free to add more details!

Cheers,
Ning

On Thu, Mar 6, 2008 at 5:47 PM, Ning Li <nin...@gm...> wrote:
> Here are the main features discussed so far:
>  1 Load balancing by using consistent hashing.
>  2 Fault tolerance by replications.
>  3 Online update by making all replicas updateable.
>  4 Eventual consistency.
>   - An update is sent to W replicas before it completes.
>   - Assume an application specifies doc version number.
>   - Using the terms in [1], we should support session
>     consistency and monotonic write consistency, and
>     maybe read-your-writes and monotonic read consistency.
>  5 A document database?
>   - We store documents anyway.
>   - We don't support sub-document updates.
>   - Do we support document versioning? Other features?
>
>  Here are a few comments on the features:
>  1 Consistent hashing uses hash values because hash values
>   distribute uniformly on the ring. Can we support
>   application-specified keys for the ring? The difference
>   is that the distribution may not be uniform so we need
>   to rebalance sometimes (remove a virtual node and insert
>   it somewhere else).
>  1 On the assumption that an application specifies document
>   version number. It greatly simplifies things, but is
>   it practical?
>  2 How much a document database we want it to be? I'm not
>   sure if CouchDB is a typical document database...
>
>
>  Design for the features is to come...
>
>  Ning
>