Consensus is a fundamental problem in distributed computing and in this post we will see exactly why that is the case. In the spirit of modularity, we investigate the essence of distributed computation and seek fundamental building blocks with which we can compose larger systems. A consensus object is one such building block forming the core of a universal construction which provides a linearizable wait-free implementation of any other object given its sequential specification. In what follows, we shall define the notions of wait-freedom, linearizability and consensus. Furthermore, we discuss sequential specifications and what it means for one object to implement another. That consensus is at the heart of a universal construction is a reasonably natural notion. After all, the difficulty of distributed computation comes from the absence of common knowledge about the system and consensus gives us just that — common knowledge. To make our intuitions more precise, we begin by establishing a system model on which we base our assertions.
The system consists of objects accessed concurrently by processes, and both can be modeled by IO automata. In essence, an IO automaton is a state machine, consisting of a set of states and a set of events depicting transitions between states. A process is a sequential thread of control and its events model interactions with objects — an invoke event corresponds to the invocation of an operation on an object and a receive event corresponds to the receipt of a response from an object. An object is a data structure shared by processes and its events model invocations of operations by processes — an invoke event corresponds to the invocation of an operation and a respond event corresponds to the response. An operation is therefore delimited by two events — the invocation and the response. The invoke event on a process is an output event — an outgoing communication, while the invoke event on an object is an input event — an incoming communication. Similarly, the response event on a process in an input event, and the response event on an object is an output event. Notice how an invoke event on a process and object are output and input events respectively, forming a symmetric matching pair when they refer to the same process and object. Automata can be composed by matching their respective events. The states of the resulting composite automaton are tuples of states of the constituent automata, and the set of events is the union of events of the constituent automata. For example, a queue is an object with two operations: enqueue which given an object, adds it to the queue, and dequeue which returns the object at the head of the queue, or a null value if the queue is empty. A queue may be accessed concurrently by multiple processes which together form a concurrent system.
A collection of events resulting from execution of processes and objects in a system is called a history. Events at individual automata are totally ordered, which means that given any pair of events we can determine which came first. We can extend this total order of events to a partial order on operations. An operation O comes before operation P if the response event to O is received before the invocation event of P. The reason this ordering is partial is because some operation may overlap. That is to say, operations may be concurrent. Recall that we referred to a process as a sequential thread of control. We can formalize this notion using histories. Given a history H, a sub-history belonging to a specific automaton A is denoted H|A. A history of events for a process H|P consists of alternating, matching invocation-response pairs. In other words, a process consists of a totally ordered set of operations. A totally ordered history of operations is called a sequential history. The difficulty in implementing concurrent objects is that in general, histories are not sequential and operations from different processes can overlap, resulting in a possibility of ill-defined states. For example, the dequeue operation on a queue might first check if the queue is empty and if it isn’t return the item at the head of the queue, otherwise return null. However, if a process invokes the dequeue operation while a dequeue operation from another process is still in progress, they could be dequeue the same item, violating the invariant of a queue.
A sequential specification is a specification of an object assuming a sequential history. Under these circumstances, we can regard an object as a state machine with a transition function δ, such that δ(s,op(args)) — given a state and an invocation of an operation — returns a pair consisting of a new state s’ and a value res to return to the calling process. The sequential specification can then be defined as a set of pre-conditions and post-conditions on object states before and after the execution of operations. For example, we might say that the state of a queue object doesn’t contain an item before an enqueue operation, and does contain the enqueued item after an enqueue operation. Sequential specifications serve as a recipe for implementing an object in a universal construction.
A simple way to make a concurrent implementation of an object is through the use of locks. Locks or mutexes enclose critical sections wherein only a single process can execute at a given time. This allows the programmer to use sequential reasoning to assert the correctness of an algorithm. The use of locks however isn’t ideal — a faulty process can stall or arbitrarily delay the execution of other processes. The absence of locks in an implementation is called lock-freedom. An implementation is lock free if at least one thread is guaranteed to make progress. Use of a lock makes this guarantee impossible — if two threads are contending for a resource, and the one who acquires it first stalls, the second thread will be deadlocked. Wait-freedom is a stronger progress condition which guarantees that each process can make progress in a finite number of steps regardless of the behavior of other processes. Thus any wait-free implementation is also lock-free, but not necessarily so the other way around. In what follows, we will consider wait-free implementations of an object based on another object. Progress conditions are a particularly important consideration in a distributed system, where independent failures are more common and the scheduler may consist of multiple disparate components.
Wait-freedom is also an important notion for theoretical reasons because it applies to all processes and it is independent of the process scheduler — so long as processes are scheduled, wait-freedom guarantees progress for all processes. By contrast, lock-freedom only guarantees progress for some process. And a lock-based procedure relies on the scheduler to provide starvation-freedom, by ensuring that processes eventually leave the critical section. Wait-freedom is therefore starvation-freedom in presence of failures.
In order to be able to transfer the correctness of a sequential system to a concurrent system, we must define a correctness condition for concurrent objects. Linearizabilty is such a correctness condition and it states that a concurrent object is linearizable if it all possible histories involving that object and any number of processes can be linearized. A linearization of a concurrent history H is a valid sequential history S wherein the partial order on operations imposed by H is respected by S. Therefore, S completes the partial order defined by H, but agrees with H on orderings that H does define. Linearizabilty is useful because it allows us to transfer the reasoning about a sequential system to reasoning about a concurrent system, significantly simplifying analysis. Moreover, linearizabilty is a local property in that if a system consists of linearizable objects, the system is itself linearizable. In other words, linearizabilty composes.
Figure 1 bellow depicts two processes interacting with an object. The solid disks represent events, histories are horizontal lines, operations are delimited by invoke and respond events and form matching pairs on process and object histories. The interaction involves two operations, and the operations overlap. Even though we may impose a total order on events across all three histories, the two operations are concurrent. There are therefore two linearizations of this history — one where operation 1 happens first and another where operation 2 happens first.
A consensus object is a concurrent object with a single operation propose which behaves in accordance with the following sequential specification:
let propose value =
if state = null then
state <- value
A consensus object must adhere to the following conditions:
- Consistency: all processes invoking the consensus object receive the same value.
- Integrity: the value returned by the consensus object is a value proposed by some process.
- Termination: the operation is wait-free.
The first two conditions are safety properties — they assert that consensus does what one would expect it to do. The last condition is a liveness property — it asserts that the operation completes and tolerates failures in participating processes. The transition function δ in the sequential specification of a consensus object would take the state as the first argument, and the proposed value as the second argument and execute in accordance with the code above.
To show that there is a wait-free implementation of an object using another, we can map the objects to their consensus number and compare the numbers. A consensus number is the number of processes for which an object can solve consensus. If one object X can solve consensus for an equal to or greater than number of processes than object Y, then X can be used to implement Y. For example, a shared register cannot be used to solve consensus for even two processes. An test-and-set object can be used to solve consensus for at most two processes. A compare-and-exchange object on the other can be used to solve consensus for any number of processes — its consensus number is ∞. In this way, a hierarchy is formed enumerating objects by their synchronization power:
| Object | Consensus Number |
| MRSW register | 1 |
| test-and-set | 2 |
| cas | ∞ |
| queue w/ peek | ∞ |
Consensus numbers were introduced by Maurice Herlihy in Wait-Free Synchronization. It is a natural progression from his earlier work on Linearizability. A central result proved in the paper is the following theorem:
If [object] X has consensus number n, and Y has consensus number m < n, then there exists no wait-free implementation of X by Y in a system of more than m processes.
The proof proceeds by contradiction — it assumes the existence of an implementation, then represents the implementation as a composite IO automaton and finally shows the contradiction. This theorem is significant because it mathematically defines the implementable-by relation. Herein, it allows us to establish that a universal construction can wait-free implement any other object.
An object is universal if it can implement any other object given a sequential specification, and this implementation is both wait-free and linearizable. Such implementations are called universal constructions. One such object is the consensus object defined above. An implementation may also make use of any number of shared read/write registers.
The following is a universal construction due to Michel Raynal and it is based on the state-machine replication paradigm. Each process maintains a copy of the constructed object, and uses consensus object to keep the copies consistent. The construction consists of two parts— the operation and a background helper process. The operation assigns the proposal and waits for the background process to assign a result. The background process runs in a loop, checking if a proposal has been assigned. If it has, then it sends that proposal to a consensus object which returns the first received proposal for that round. The proposal itself consists of the operation and its arguments, and the proposing process. The background process then executes the state machine transition function on the latest state and the proposed operation. Each process keeps a local copy of the state and the consensus protocol ensures that operations are applied to local states in the same order. As such, all processes have a common view of the object. If the proposal that was returned by the consensus object is from the calling process, we assign the result so that it operation can return it to the caller.
The operation is defined as follows:
1. let perform args =
2. result[i] <- null
3. proposal[i] <- ("op(args)",i)
4. wait (result[i] != null)
The local variables result[i] and proposal[i] refer respectively to the result and proposal of process i. On line 3 process i stores its proposal which is proposed by the background process bellow.
The operation on line 3 is quoted to indicate that we’re referring to a representation of an operation rather than an invocation. In programming language, this can be implemented in a few different ways, such as using reflection, quotations as in F#, or using a free monad.
The background process:
1. while true do
2. if proposal[i] != null then
3. k[i] <- k[i] + 1
4. exec[i] <- CONS[k[i]].propose(proposal[i])
5. (s[i],res) <- δ(s[i],exec[i].op)
6. if (i = exec[i].proc) then
7. proposal[i] <- null
8. result[i] <- res
The local variable k[i] refers to the execution round for process i, CONS refers to the shared consensus object indexed by the round. The local variable exec stores the result of the proposal. The pair s and res store the state and the result of the state machine transition function δ applied to the proposed operation and the current state.
The problem with the construction above is that it is not wait-free. Depending on the scheduler, the background processing loop may never terminate. We can fathom an adverse schedule wherein a given process never “wins” consensus, thereby creating an unbounded delay. In order to make the construction wait-free we have to make a few adjustments. To account for the possibility of failures of other processes, each process can keep an array of sequence numbers representing the last operation applied by every other process. Only if sequence numbers of other processes are increasing — as communicated by a shared register — do they participate in proposals. Therefore, if a process fails, its proposals no longer count and can’t prevent other processes from making progress.
The operation is defined much in the same way as before, however each process stores the last operation it applied in a shared register LAST_OP and keeps track of the sequence number of the operations applied across other processes using the local variable last_sn.
1. let perform args =
2. result[i] <- null
3. LAST_OP[i] <- ("op(param)",last_sn[i][i] + 1)
4. wait (result[i] != null)
The background process constructs a proposal by collecting operations executed by processes since the last round. The local variable last_sn keeps track of the operations applied across all other processes. Then, for each pair of operation and process in the decided proposal, the background task executes the transition function δ and increments last_sn for the corresponding process. If the process is the calling process we assign the result as before.
1. while true do
2. prop[i] = 
3. for j in [1..n] do
4. if (LAST_OP[j].sn > last_sn[i][j]) then
5. prop[i] += (LAST_OP[j].op,j)
6. if (prop[i] != ) then
7. k[i] <- k[i] + 1
8. exec[i] <- CONS[k[i]].propose(prop[i])
9. for r = 1 to |exec[i]| do
10. (s[i],res) <- δ(s[i],exec[r].op)
12. let j = exec[i].proc
13. last_sn[i][j] <- last_sn[j] + 1
14. if (i = j) then
15. result[i] <- res
To prove that this construction is wait-free we must ascertain that for a non-failing process i, line 4 in the definition of the operation eventually completes, which means that line 15 in the background task must eventually execute. We claim that there is a decided proposal containing process i. If the claim is true, then at some point, the test on line 14 will be true and the desired outcome is achieved. We can prove the claim by considering the loop on line 3. It traverses the entire list of processes and includes the operation of each process in the proposal of the calling process. This means that at some point, the proposals of all processes contain process i, proving our claim.
- The Glitch Phenomenon — Leslie Lamport regards consensus to be an intrinsic puzzle of the universe and gave this idea a topological presentation, which of course the author is quite fond of. The idea is perhaps even more fundamental than the consensus object considered herein. Consider an arbiter — an electronic device which must decide which input signal from a processor arrived first in order to determine whom to grant access to a shared resource. If the requests arrive within a short duration of each-other, the arbiter may have a hard time deciding. While an engineering solution was found at some point the 70s, the underlying problem endures. Lamport provides a formalism wherein the device is represented as a continuous function between inputs and outputs, where inputs and outputs are themselves time-dependent function spaces (behaviors for those familiar with FRP). The continuity of this function representing the device is what ultimately allows for the possibility of an unstable state. It can be shown that the the space of outputs for which a timely decision is made is not connected — only becoming connected if time is allowed to tend to infinity. If we assume that the input space is connected, a continuous mapping between the input space and the output space requires use of a zero-function — which corresponds to an inability to decide.
- Event-Sourcing —Among other nice properties, event-sourcing enables state-machine replication. Recall that in the consensus hierarchy described above, a queue with a peek operation has an infinite consensus number. Such a queue is what we get from the log at the heart of an event-sourcing architecture. This log is not just a simple queue, because rather than providing a dequeue operation, it allows consuming processes to traverse the queue in order without mutating the queue itself. As a result, each process can reach a common state without interfering with other processes. With event-sourcing however, it is more typical for this common state to be reached asynchronously, thus foregoing linearizability with respect to the log, but it still maintains order and therefore sequential consistency.
We’ve defined a universal construction — a mechanism for providing a linearizable wait-free implementation of any object given its sequential specification. This result exhibits a great degree of generality and demonstrates the fundamental nature of a consensus object at the core of the construction. Intuitively, it makes sense — consensus is fundamental because it is fundamentally lacking in a distributed system where each participant has their own, independent world view. Somewhat ironically, while consensus is universal it is also impossible as we have seen earlier. So what are we to conclude from the impossibility and universality of consensus? It is important to regard consensus as a limiting notion — we can have consensus but with weaker termination properties, or stronger termination properties, but with a weaker notion of consensus. But once we have a consensus object, we can use it to implement any other object, and this implementation will be wait-free outside of any restrictions on wait-freedom imposted by the consensus object itself. Moreover, as we have seen in the aside on event-sourcing, consensus is useful with a weaker consistency condition.