!(World Writable): Zookeeper: A Wait-Free Coordination Kernel

WHAT IS ZOOKEEPER ANYWAY?

A service for coordinating processes of a distributed application.

Some common forms of coordination required in a distributed application include leader election, name registry, configuration management, group membership etc. Zookeeper provides a set of primitives which can be used to implement solutions to such common coordination problems.

A common way of using Zookeeper happens to be through the Apache Curator Framework, which is a high-level API over Zookeeper and also provides recipes - ready to use, out of the box solutions for things like leader election, shared locks etc. See http://curator.apache.org/curator-recipes/index.html

TERMINOLOGY

Client: A user of the Zookeeper service

Server: A process providing Zookeeper service. Usually, we have an ensemble of such processes for replication and high availability.

Znode: An in-memory node to hold Zookeeper data.

Data Tree: The Znodes are arranged in a hierarchical name space much like the traditional file system. The complete set of Znodes is referred to as the Data Tree.

SERVICE OVERVIEW

Zookeeper provides an abstraction of data nodes (called Znodes) in a hierarchical namespace. A Znode can be referred to (for a read/write operation) using the full path (like in a UNIX system) of the said znode. For example, the node p3 in the following diagram can be referred to, using the path: /app1/p_3, where / is the path of the root znode and /app1 is the path of the node app1.

Clients connect to zookeeper via sessions and perform read/write operations on the Znodes using their full path.

There are 2 types of Znodes:

Regular: Clients manipulate regular znodes by creating and deleting them explicitly.
Ephemeral: Clients delete such Znodes explicitly or let the system remove them when the session that creates them terminates (deliberately or due to failure).

WATCHES

Zookeeper implements watches on Znodes so that clients can be notified of the changes in Znode value without requiring polling. When a client retrieves a value of a node, it can optionally set a watch on it, so that whenever the returned value changes, the client is notified. Note that watches are one-shot affairs and have to be set again once a notification for a changed value is received. Also, watches do NOT persist across sessions.

DATA MODEL

Although similar in spirit to the traditional file system, an important difference in Zookeeper is that the Znodes can be read or written upon only in full. One can not read or write them partially.

Also, Znodes are not designed for general purpose data storage. By default, Znodes can store only 1 MB data.

Znodes are also associated with timestamps (to record when a change was made) and versions (to allow for conditional updates)

CLIENT API

Following is the API exposed by Zookeeper service (Apache Curator builds on top of this vanilla API):

create(path, data, flags): Create a Znode with the ‘data’. Flags indicate whether the node is to be regular/ephemeral etc.
delete(path, version): Delete a Znode if its version matches the specified version.
exists(path, watch): Check for the existence of a Znode and set a watch if it does (to notify of its deletion).
getData(path, watch): The read operation on a Znode with the specified path. Sets a watch too on the Znode to provide back a notification if its value changes. Note that the notification does not contain the changed value, only that the value has changed. Client needs to do a read again to get the updated value.
setData(path, data, version): The proverbial write on a Znode. Conditional upon version matching.
getChildren(path, watch): Get the children of the Znode at the specified path
sync(path): Covered later.

An important point to note here is that there are no open/close calls in this API, since all methods use the full path name of the Znode. This not only simplifies the API, but also obviates storing unnecessary state on the servers. Win win all around!

All the above calls have an async as well as a sync version. The async version allows a client to have multiple calls outstanding on Zookeeper whilst having a low latency for each of them.

ZOOKEEPER GUARANTEES

Linearizable writes: All writes appear to have been done in some order. And that perceived order of the writes is the same across all the clients. That is, if a client observes (by doing reads) that a write W1 was performed followed by W2 , every other client in the system observes either W1 followed by W2, or just W2, but never does any client observe a W2 followed by W1.

Watch notifications are also processed along with writes. What this means is that if a client receives notifications in the order: N1 (for a write W1) followed by N2 (for a write W2), all other clients in the system receive notifications in the same order (or just the N1 notification, as it is sufficient to suggest that the value with the client has gone stale).
FIFO order of client operations: Operations executed by a client are performed in FIFO order. Note that the FIFO order is imposed upon the ops performed by a client, not upon the ops performed across clients.

PRIMITIVES USING THE ZOOKEEPER

1. NAME REGISTRY

A name registry is a registry (at a well-defined location) that publicizes the availability of services at particular ip address/port. Implementing such a registry is very simple with Zookeeper. All one needs is to have a well-defined znode under which each service creates a new znode with its own IP address/port when it becomes available. Any client that has done a getChildren on the well-defined Znode with a watch gets to know about the newly added/deleted Znode (under the well-defined znode) as and when services become available/go away. Clients can also set watches on the service znodes as well to get intimated about any changes in the address/port of the service.

2. GROUP MEMBERSHIP

In this situation, a process wants to know which all other processes are the members of the same group as itself. This is again as simple as each process creating an ephemeral node under a well-defined Znode with let's say it's own pid as the name of the znode. Any process that wants to knows the group members just reads all the children of the well-defined znode. Since each process creates an ephemeral node, the node disappears from the tree (and from the group as well) if its associated session dies due to a timeout/failure etc and thus only active members of the group are available as znodes under the well-defined Znode.

3. CONFIGURATION MANAGEMENT

Again, this is also as simple as writing the config information in a well-defined Znode, say Zc. Any clients reading the Znode Zc get notified of any configuration changes through watches.

4. DISTRIBUTED LOCKS

Omitted here for the sake of brevity (although this post is not "brief" by any yardstick). See the full paper here.

5. LEADER ELECTION

Hint: Similar to Group Membership

ZOOKEEPER IMPLEMENTATION

1. WRITES

In Zookeeper, a write involves first transforming the write operation into a transaction, then atomically broadcasting the transaction to all zookeeper servers, followed by committing the transaction to the in-memory database at each replica. Atomic broadcast of the transaction ensures that each non-faulty server eventually gets the broadcast message (which it can subsequently commit). Note that the atomic broadcast succeeds only if the broadcasted transaction reaches at least a quorum (majority) of servers. This ensures that a write operation persists across any number of failures as long as there eventually exists a quorum of non-faulty servers.

For durability purpose, any committed transaction is maintained in a replay log which is appended to before any changes are made to the in-memory database.

To be more specific, a write operation is forwarded to the Zookeeper cluster leader, which then performs the atomic broadcast of the transaction. A write operation for a client completes once the local server (to which the client is connected) reflects the write changes in its in-memory database. This means that when a write returns, other replicas in the Zookeeper cluster may not reflect the same value for the affected Znode. Although Zookeeper replicas never diverge, they may lag each other in terms of transactions applied.

Since, writes involve disk writes and coordination through atomic broadcast, these are relatively slow and costly.

2. READS

Reads in Zookeeper are served from the in-memory database of the locally connected server. These do not involve any disk reads or complex coordination, and hence are very fast (as compared to writes).

Note that since reads refer only the local copy of in-memory database, reads may lag in time, although more recent writes may have been committed. As such, Zookeeper is only an eventually consistent system in which the reads may return stale values for some time.

3. SLOW READS

Some applications may not be fine with getting stale values on a read. For such applications, a sync op is provided that ensures that all pending writes at the start of sync operation are applied before returning the result of a following read. This (sync --> read) is referred to as the slow read. However, note that this may still not return the most recent value in that it may miss any writes that happened after sync was issued and before the read happened. So, that's a gotcha.

KEY TAKEAWAYS

Don't reinvent the wheel. Use Zookeeper as a template for solutions to things like Leader election, Config management, Distributed locks, Group membership etc. See here for a more detailed list.
Zookeeper reads are fast. Writes not so much.
Zookeeper is an eventually consistent system. Reads may return stale values.
Zookeeper is sort of a key-value store, where values are not bigger than a MB.
Watches are good. Really good. Much better than polling.

REFERENCE

The original paper: http://static.cs.brown.edu/courses/cs227/archives/2012/papers/replication/hunt.pdf

!(World Writable)

Sunday, April 10, 2016

Zookeeper: A Wait-Free Coordination Kernel