Wednesday, March 30, 2016

Dynamo: Amazon's Highly Available Key-value Store

Before we start, note that this paper is about Dynamo and not DynamoDB. DynamoDB builds upon the principles of Dynamo, but just to be clear, these are not the same.

With over 3000 citations, this paper can be seen to be hugely responsible for the barrage of NoSQL databases (Riak, Voldemort, Cassandra, and the likes) that followed. 

From the title itself, the point is driven home that Dynamo's focus is availability. More specifically, Dynamo is always available for writes (assuming a bound on failing servers, of course). To achieve high availability, Dynamo sacrifices consistency. That is, different nodes may get different values for the same key on performing a read.

Why a key-value store (NoSQL) and not your next-door traditional SQL database?

  1. Many applications can make do with just a primary-key interface. An RDBMS with its full scale SQL capability is an overkill for such applications.
  2. Added capabilities of RDBMS require expensive hardware and has an operational cost in terms of trained RDBMS admins.
  3. Many RDBMS favor consistency over availability - not appropriate for applications that want to be always on and can tolerate some laxity in consistency.
  4. It's difficult to scale out relational databases. (Paper's claim not mine!)
Dynamo favors availability over consistency and can be scaled out easily by adding more dynamo instances to handle bigger data sets or increased read/write request load. 

It should also be noted that Dynamo targets applications that need to store only small values (typically < 1 MB).

DESIGN DECISIONS:


  1. Since availability and consistency cannot be achieved simultaneously (see the CAP theorem), sacrifice consistency in favor of availability. Dynamo is designed to be an eventually consistent store, in which updates reach all the replicas eventually.
  2. Since the data store is not always consistent, there may exist different versions of the data. Such differing (possibly conflicting) versions need to be reconciled either during reading or writing. Because Dynamo's target is being always available for writes, such conflicts are resolved during reads.
  3. Dynamo has no master-slave relationships. No special responsibilities for any node. It's all peer-to-peer.

DYNAMO INTERFACE:


  1. get(key): Returns a list of object replicas (with conflicting versions) or a single object if no conflicts exist in the system. In addition, a context object is returned that includes metadata, such as version of the object.
  2. put(key, context, value): A hash on the key is used to decide the dynamo instances where the object replicas would be stored. Note that a single key-value pair is stored at multiple dynamo instances for high durability.


SYSTEM ARCHITECTURE:



1. PARTITIONING

Dynamo uses consistent hashing to distribute the keys among the dynamo instances. In particular, each node corresponds to multiple virtual nodes on the consistent hash ring to allow uniform distribution of keys (uniform redistribution as well, when nodes die/resurrect). Assigning different number of virtual nodes to a node on the hash ring based on different capacities handles heterogeneity among the servers.

2. REPLICATION

High durability is achieved in Dynamo by ensuring that each key-value pair is replicated on N dynamo servers. The hash of the key is used to select a coordinator node which ensures N-way replication.

3. DATA VERSIONING

Since Dynamo is an eventually consistent system, a put may return to the caller before the update has been applied to all the replicas. Therefore, multiple versions of an object (key-value pair) can co-exist in the system. Such conflicting versions are reconciled on a read, so that the system is always write-available.

4. EXECUTING GET & PUT

A get/put operation in Dynamo, happens via a so-called coordinator node. There are 3 parameters involved here:

  • T: Total number of nodes in the system
  • N: Total number of object replicas (number of distinct copies of a key-value pair)
  • R: Number of nodes that must participate in a read/get op for it to be successful.
  • W: Number of nodes that must participate in a write/put op for it to be successful.
Note that typically, T > N.

Each node in the Dynamo system maintains a preference list - a list of nodes to decide where each key-value pair should be stored (the top node being the coordinator node). On receiving a get request, the coordinator node (usually the first node in the preference list) sends the request to N top healthy nodes from the preference list (the preference list has more than N nodes to tolerate nodes dying away), and waits for at least R - 1 responses before returning to the caller. Some of the (R - 1) responses (or even the coordinator's local copy) may have stale values or may be in conflict, which are either resolved by Dynamo itself (using simple policies like latest-write-wins) or left to the client to reconcile by returning all the conflicting versions.

Similarly, for a put operation, the coordinator updates the version of the key-value pair and sends it to top N healthy nodes from the preference list. The write succeeds when the coordinator node receives response from W - 1 nodes (-1 because the coordinator node itself saves the new version as well). 

Note that because the write succeeds after responses only from W nodes, a following read may return a stale value (assuming W < T, of course).

5. SLOPPY QUORUM (?)

For the case where N = T, R + W > N can guarantee full consistency ie. any read reflects the latest write. This is because with R + W > N, any successful read includes at least one node that committed a previous write successfully. For a detailed explanation of this, refer here.

But for the Dynamo case, N < T (to be able to handle failures), and even if we have R + W > N, this does not really provide any consistency guarantees because top N healthy nodes (and not the same N) out of T nodes are selected for the read/write operation. Since the subset of N nodes itself is not guaranteed to be the same for any write followed by a read, there is no guarantee that R nodes of the read intersect with even one of the W nodes of a previous write. 

At the very basic level, it's the transient nature of the N nodes (top N healthy nodes out of T total nodes) that is responsible for the sloppiness in the consistency guarantees of Dynamo. That's why, Dynamo is only eventually consistent, and not fully consistent (even with R + W > N).

6. REPLICA SYNCHRONIZATION

Dynamo uses Merkle trees for faster synchronization of the object replicas and to also reduce the amount of data transferred. The leaves of the Merkle tree correspond to hashes of values of individual keys. The parents higher up contain the hash value of their children. Each Dynamo server maintains a separate Merkle tree for the key-range corresponding to a virtual node (Each dynamo node corresponds to multiple virtual nodes on the consistent hash ring, so each node has multiple Merkle trees). To detect differences, two nodes first compare the root of the Merkle trees. If they are same, then the key-ranges are in sync. If not, one moves progressively down the Merkle trees until all the differing keys are identified. Merkle trees thus help reduce the data transferred as well as faster detection of differences. 

It's not clear though from the paper how frequently Merkle trees are used for such synchronization.

7. LOCAL PERSISTENCE ENGINE

One of the interesting things about Dynamo is the ability to plug in different persistence engines. For example, to store values of few tens of KBs, one could use the Berkeley Database, while for objects of larger sizes one can go for MySQL. 

8. TUNING DYNAMO

By varying N, R and W, one can affect the durability, availability and consistency characteristics of Dynamo. For example, by setting N = T (total nodes), and W = N, one can achieve full consistency as the write won't complete until it has been committed on all the nodes. 

By setting R = 1, along with W = N = T on the other hand, one obtains a High-performance read engine (with full consistency).

By setting W = 1, one gets high availability (writes don't fail as long as there is a single node in the system) at the cost of durability (if that one node fails, the write is lost forever).

Typical values of (R = 2, W = 2, N = 3) are used in Amazon's applications.

9. OPTIMIZING WRITES/READS:

An interesting optimization in Dynamo is the use of so-called buffered writes. To improve latency of reads/writes (at the cost of durability), Dynamo allows in-memory writes (instead of writing to the persistent store). Similarly, for reads, the key is first checked for in the in-memory buffer. A writer thread in the background regularly flushes updates to the persistent store. Not having to deal with persistent store in every read/write was found to improve the 99.9th percentile write latency by a factor of 5. 

To decrease the risk of durability, the coordinator ensures that at least one of the W nodes selected by the coordinator performs the write on persistent store (and not just in the buffer). Of course, this still doesn't guarantee absolute durability, because an update may still be lost forever, if god forbid, the node performing the durable write goes down somehow. A loss of a single node in the system of N nodes can result in the loss of an update (that's pretty dicey I think!).

10. DIVERGENT VERSIONS

With its sloppy Quorum, one expects the existence of multiple versions of an object in a system. However, in practice (at Amazon), 99.94 % of requests saw exactly one version of an object, 0.00057% saw 2 versions, 0.000475 saw 3, and 0.00009% saw 4 versions. Which is a bit surprising ;)

11. LOAD BALANCING

I would prefer having a separate (albeit short) post for this covering the various variants of consistent hashing. Watch out.

KEY TAKEAWAYS


  1. A key takeaway here is that designing a highly available system requires sacrifices in terms of consistency. To achieve an always available for writes key-value store, we have to tolerate the existence of multiple versions of an object in the system (which have to be reconciled somehow). 
  2. Being sloppy (quorum-wise) can help achieve high availability.
  3. R, W, N, T can be varied as per one's needs. One can get a fully consistent system in Dynamo as well by setting W = N = T. 
  4. Consistent hashing is good. Use it to partition load.
  5. Merkle trees are good as well. Use them to detect differences between a list of key-values and to ensure minimum data transfer.
  6. One can construct highly available and scalable systems on top of things like MySQL (Dynamo uses it as a local persistence engine).

REFERENCE

[1] http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf