|
From: Ning L. <nin...@gm...> - 2008-03-07 23:29:54
|
DESIGN
1 System composition and functionality
The system consists of a "master", "nodes", and "clients".
We view the docid hash value space as a "ring". Each "node"
maintains an "index" on a "range" of documents on the ring.
The "master" keeps track of the mapping between the nodes
and the ranges they index.
The "node" here refers to a virtual/logical node. A physical
"host" hosts a number of nodes.
The system supports the following operations:
- addDoc(doc), deleteDoc(doc), updateDoc(doc)
- search(query)
An "application" conducts the operations through a "client".
A client communicates with the master to obtain the mapping
between the nodes and the ranges, and communicates with the
nodes directly to add/delete/update a document or conduct
a search.
2 Partitioning
We choose to use consistent hashing for load balancing.
A node X is assigned a "position" on the ring. The "next"
node/position on the ring is X+1. The range of node X is
[X, X+1).
This is similar to range partitioning.
3 Replication
We replicate to achieve fault tolerance. N-way replication
means that data is stored/indexed on N nodes. Assume N=3,
the range of node X is now [X, X+3). The range [X, X+1) is
now also served on node X-2 and node X-1 besides node X.
4 search(query)
Based on the mapping between the nodes and the ranges,
a client sends a search request to a number of nodes. The
client specifies a search range for each node the request
is sent to, so that 1) the search range for a node is
covered by the node's range, 2) the search ranges together
covers the ring once and only once, and 3) try to minimize
the number of nodes the request is sent to.
For example, if we have 7 nodes and 3-way replication,
a search can be sent to node X with search range [X, X+3),
node X+3 with search range [X+3, X+6), and node X+6 with
search range [X+6, X+7), where X+7 == X.
An alternative is to cover the ring R times (e.g. R=2)
to avoid latencies of re-querying.
5 addDoc(doc)/deleteDoc(doc)/updateDoc(doc)
We assume each document addition, deletion or update comes
with an application-specified revision number for the
document. Without this assumption, we'd have to do
something much more complicated such as vector clocks.
Each node maintains a light-weight log which is flushed
to disk only periodically. A log is a sequence of log
entries. Each log entry is <{ADD|DEL}, docid, revision>.
A write request is sent to multiple nodes, all of whose
ranges must include the document.
When a node receives a write request of a document with
revision R, it checks that revision R is newer than the
current revision stored/indexed on the node. Then the node
stores and indexes the new revision in memory and put a
log entry in the log.
A W-way write means at least W number of nodes have to
successfully serve the write request for the write to be
considered "complete". 1 < W <= N. W > 1 is to provide
fault tolerance to write operations.
6 Consistency
Each node syncs with the nodes with/on overlapping ranges
- a.k.a log propogation. But since in most cases W+R <= N,
the system provides eventual consistency.
Since a node checks that a revision for a write request
is newer than the current revision stored/indexed on the
node, monotonic write consistency is supported.
The implementation of client should support session
consistency. And it'd be nice to support read-your-writes
and monotonic read consistency.
Now back to log propogation. Node X serves range [X, X+3)
and periodically syncs:
range [X, X+1) with node X-2
range [X, X+2) with node X-1
range [X+1, X+3) with node X+1
range [X+2, X+3) with node X+2
Here is how changes of range [X, X+1) on node X is
propogated to node X-2. Assume during the last sync,
changes upto log entry E on node X was propogated to
node X-2 and the current log entry on node X is F.
- Now log entries (E, F] are scaned sequentially, and
those in the range [X, X+1) are sent to node X-2.
- Node X-2 checks the log entries received from node X.
For a log entry <{ADD|DEL}, docid, revision>, if the
current revision on node X-2 is older, node X-2 puts
the docid in a request list. Then node X-2 sends the
request list of docids to node X.
- Node X ships to node X-2 the stored copy of the
documents on the request list.
- Upon receiving the documents from node X, node X-2
processes it as a normal write request, which means
node X-2 will check the revision again to ensure
monotonic write consistency, and log.
7 Add node / remove node
A new position is assigned when a new node is added. Assume
a new node X' is assigned the position X' between X and X+1.
The neighbouring nodes will be affected and switch from:
node X-2 serves range [X-2, X+1)
node X-1 serves range [X-1, X+2)
node X serves range [X, X+3)
to:
node X-2 serves range [X-2, X')
node X-1 serves range [X-1, X+1)
node X serves range [X, X+2)
node X' serves range [X', X+3)
Node X' prepares range [X', X+3) by copying index files
from one or more neighbouring nodes. Then node X' starts
to sync with the nodes with/on overlapping ranges.
A flavor of two-phase commit is used to guarantee that
no writes or write propogations are lost in the process.
Well, does it look all right? Feel free to add more details!
Cheers,
Ning
On Thu, Mar 6, 2008 at 5:47 PM, Ning Li <nin...@gm...> wrote:
> Here are the main features discussed so far:
> 1 Load balancing by using consistent hashing.
> 2 Fault tolerance by replications.
> 3 Online update by making all replicas updateable.
> 4 Eventual consistency.
> - An update is sent to W replicas before it completes.
> - Assume an application specifies doc version number.
> - Using the terms in [1], we should support session
> consistency and monotonic write consistency, and
> maybe read-your-writes and monotonic read consistency.
> 5 A document database?
> - We store documents anyway.
> - We don't support sub-document updates.
> - Do we support document versioning? Other features?
>
> Here are a few comments on the features:
> 1 Consistent hashing uses hash values because hash values
> distribute uniformly on the ring. Can we support
> application-specified keys for the ring? The difference
> is that the distribution may not be uniform so we need
> to rebalance sometimes (remove a virtual node and insert
> it somewhere else).
> 1 On the assumption that an application specifies document
> version number. It greatly simplifies things, but is
> it practical?
> 2 How much a document database we want it to be? I'm not
> sure if CouchDB is a typical document database...
>
>
> Design for the features is to come...
>
> Ning
>
|