Introduction

Takeru INOUE

What's Kai?

Kai is a distributed key-value datastore, which is mainly inspired by Amazon's Dynamo. It brings high scalability and availability to your Web sites. You can manage variety of contents with Kai, as if Amazon's Dynamo stores shopping carts, catalogs, session states, and so forth. Currently, Kai is deployed in a commercial service, goo home, which is a Japanese social networking service of more than 10 million users.

Kai is an open-source project started on May 2008, and hosted on sourceforge.net, which can be found at http://kai.sf.net/ . Three active developers are currently involved. The specification and problems are discussed on the lists, some of which are in Japanese.

The name of Kai is derived from a hometown of one of the developers. In the age of provincial wars, more than 500 years ago, Kai had the strongest cavalry in Japan. The key of this strength was well-organized cluster of samurais (actually, the developer is a descendant of them). Similarly, the key of Kai (as a key-value store) is well-organized cluster of computers.

Unfortunately, it's not easy to find pages related to Kai because of the un-identifiability of the name. Search Kai with a second keyword, like erlang, sourceforge, or dynamo.

Features

Kai is a datastore consisting of a group of processes (nodes), namely, a computer cluster. Each value (data) is associated with a key and stored in the cluster with the configured replication level. The cluster processes requests from clients in a distributed manner, and returns corresponding values.

Kai cluster
Fig.1: An example of a Kai cluster with six nodes and two replicated key-value pairs. The ring shows a logical relation between the nodes, but does not represent the physical topology (this is a hash ring of the Consistent Hashing as denoted later).

Kai has the following features:
key-value store with memcache API,
high scalability and elasticity for low operation cost,
load balancing and reliability
high availability with low latency and eventual consistency, and
* actor model programming for parallel processing.

Key-value store with memcache API

Kai is a key-value datastore. While key-value stores are recently gathering much attention, they are not well defined. Here, we define them as servers providing a hash table; any value (bit-string) can be stored with a key, which is used to retrieve the value.

Unlike a traditional relational model, data model of key-value stores is very simple and easy to understand it. Developers can store any data as-is, including a JSON and an XML document as well as text and an image. The data is retrieved and de-serialized (if needed) before using it. Data is identified by the key, which can be any character string not including white spaces nor line breaks.

In Kai, clients manipulate their data with memcache API. As you know, memcached is another key-value store, and is used in many Web sites. There are a number of client implementations: Perl, Ruby, Python, PHP, Java, C#, C/C++, and Erlang. Moreover, the API is easily accessed even by telnet, since it is a text-based protocol.

% telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
set foo 0 0 3
bar
STORED
get foo
VALUE foo 0 3
bar

Kai supports basic operations of memcache API, including get(s), set, delete, quit, stats, and version. However, an expiration time cannot be specified (must be zero) for the set operation, since Kai is a persistent storage, not a cache.

For details of memcach API, see the following URL,
http://code.sixapart.com/svn/memcached/trunk/server/doc/protocol.txt .

High scalability and elasticity for low operation cost

Kai is a distributed datastore consisting of a cluster of several processes called nodes. It is based on peer-to-peer architecture, and so does not have a single master node, which would be a single point of failure. The storage capacity of the cluster is roughly determined by the number of cluster nodes. The nodes are easily added to or removed from the cluster without interrupting operation. This elasticity of the capacity obviates a prior capacity planning and eliminates up-front capital expense.

Kai's cluster scales well. We conjecture that Kai scales up to hundreds of nodes, though our experiments have been done with tens of nodes because of equipment limitations. To organize cluster nodes, membership information (e.g. a list of cluster nodes) is shared among all nodes by using the epidemic (gossip) protocol, which provides rapid and robust information spreading mechanism; a node periodically exchanges information to a randomly chosen node. This simple mechanism shows exponential growth of the number of nodes sharing the information. Moreover, this information spreading is not stopped even if some nodes are going down, as epidemic diseases spread unlimitedly.

Epidemic protocol
Fig. 2: An example of the epidemic protocol; The failure of node2 is detected and shared by spreading the information randomly.

While the epidemic protocol works well despite its simplicity, we plan to implement more sophisticated mechanism like Chord for more scalability in the future version of Kai.

Load balancing and reliability

In Kai, all data (key-value pairs) are partitioned and replicated among the cluster nodes. The partition and replication rules follow the consistent hashing algorithm.

The consistent hashing provides a hash table functionality that maps keys to nodes. In traditional hash tables (e.g. modulo operation), a change in the membership of cluster nodes causes remapping of nearly all keys. This means that almost all data is reallocated whenever adding or removing nodes. This reallocation process possibly imposes large load on the nodes. In contrast, only 1/n keys are remapped in the consistent hashing, where n is the number of cluster nodes.

Reallocation ratio
Fig. 3: Reallocation ratio in the modulo operation and the consistent hashing; in contrast to the modulo, reallocation ratio is improving (decreasing) with increasing the number of nodes in the consistent hashing.

In the consistent hashing, keys and nodes are placed on the hash ring according to their hash values. A key is mapped to N following nodes on the ring clockwise; a value corresponding to the key is replicated in the N nodes. If a node of the key is going down, the next node on the ring is informed by the epidemic protocol and gets the corresponding value from the other available nodes. In this way, the number of replications is kept consistent. The number of replications can be configured for each cluster.

Consistent hashing
Fig. 4: An example of the consistent hashing; the key is stored in the following N nodes, node2, node5, and node4.

Consistent hashing, cont'd
Fig. 5: An example of the consistent hashing; when node2 is going down, the key is stored in the next node, namely, node6.

High availability with low latency and eventual consistency

Kai achieves better latency; it is a couple of milliseconds even in the worst case (of course, it actually depends on data size and configurations). This low latency is achieved by the quorum protocol as well as the parallel processing as described later. The quorum protocol also provides strong consistency model among replicas.

In the quorum protocol, a value must be successfully replicated to W nodes at least. For retrieving the value, R of N replicas are read and compared their versions to determine the latest one. The value of R and W are subject to the following two constraints,

R + W > N,
W > N/2.

If the first constraint is satisfied, the write set and the read set always overlap. If the second constraint is satisfied, there is no possibility of conflicting writes when the write sets do not overlap.

Here, assuming N:R:W = 3:2:2 (a typical configuration of Kai), it is guaranteed that two (W) replicas are the latest. If any two (R) replicas are read, at least one replica is the latest and so you can get the latest version (version comparison is automatically done in the cluster). In this way, the quorum protocol provides strong consistency. Obsoleted replicas will be updated by the epidemic protocol. Moreover, even if one of the three replicas is overloaded or temporally unavailable, it is still possible to read and write it without latency degradation.

Quorum protocol
Fig. 6: An example of the quorum protocol; a request from the client is distributed to N=3 nodes, and a reply is returned as soon as R/W=2 replis are given.

There are rare cases in which strong consistency is not guaranteed. This is because membership information shared by the epidemic protocol does not follow strong consistency model; in adding or removing nodes, it takes some time to update membership information at all nodes. In this case, nodes with old information may try to replicate a value to wrong N nodes, and old replicas remain. However, the old information and replicas will be eventually updated by the epidemic protocol. This consistency model is called eventual consistency.

Actor model programming for parallel processing

Kai is developed in Erlang, a programming language for parallel processing. This feature is suitable for Kai, since requests of Kai should be distributed to N nodes in parallel for low latency.

Standard programming style in Erlang is called actor model; many actors (processes) collaborate with each other by message passing. Actor model solves long-standing problems in parallel programming, since it does not rely on the shared memory, which often causes painful bugs. Moreover, Erlang's processes are light-weighted, and implementation of message passing is well optimized. With Erlang, parallel programming is becoming easy and safe with low overhead.

Formulation of Kai

Finally, we formulate the features of Kai,

Kai = (Amazon's Dynamo + memcache API) / Erlang.

Kai is mainly inspired by Amazon's Dynamo, which provides elasticity, reliability, and availability. Clients manipulate their data by using memcache clients written in your favorite programming languages. Kai is developed in Erlang, which is good for parallel processing.

This article was originally published in Japanese at gihyo.jp June 2009.


Related

Wiki: Configuration
Wiki: Getting Started
Wiki: Home

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks