Notes on Paxos


This is the place were we start discussing real Paxos :-) It starts with a “ballot voting” algorithm. This algorithm, just like the ones we’ve already seen, does not define any messages. Rather, message passing is an implementation detail, so we’ll get to it later.

Recall that rigged voting requires all acceptors to vote for a single values. It is immune to split voting, but is susceptible to getting stuck when the leader goes offline. The idea behind ballot voting is to have many voting rounds, ballots. In each ballot, acceptors can vote only for a single value, so each ballot individually can get stuck. However, as we are running many ballots, some ballots will make progress. The value is chosen in a ballot if it is chosen by some quorum of acceptors. The value is chosen in an overall algorithm if it is chosen in some ballot.

The Turing award question is: how do we make sure that no two ballots choose different values? Note that it is OK if two ballots choose the same value.

Let’s just brute force this question, really. First, assume that the ballots are ordered (for example, by numbering them with natural numbers). And let’s say we want to pick some value v to vote for in ballot b. When v is safe? Well, when no other value v1 can be chosen by any other ballot. Let’s tighten this up a bit.

Value v is safe at ballot b if any smaller ballot b1 (b1 < b) did not choose and will not choose any value other than v.

So yeah, easy-peasy, we just need to predict which values will be chosen in the future, and we are done! We’ll deal with it in a moment, but let’s first convince ourselves that, if we only select safe values for voting, we won’t violate consensus spec.

So, when we select a safe value v to vote for in a particular ballot, it might get chosen in this ballot. We need to check that it won’t conflict with any other value. For smaller ballots that’s easy — it’s the definition of safety condition. What if we conflict with some value v1 chosen in a future ballot? Well, that value is also safe, so whoever chose v1, was sure that it won’t conflict with v.

How do we tackle the precognition problem? We’ll ask acceptors to commit to not voting in certain ballots. For example, if you are looking for a safe value for ballot b and know that there’s a quorum q such that each quorum member never voted in smaller ballots, and promised to never vote in smaller ballots, you can be sure that any value is safe. Indeed, any quorum in smaller ballots will have at least one member which would refuse to vote for any value.

Ok, but what if there’s some quorum member which has already voted for some v1 in some ballot b1 < b? (Take a deep breath, the next sentence is the kernel of the core idea of Paxos). Well, that means that v1 was safe at b1, so, if there will be no votes between b1 and b, v1 is also safe at b! (Exhale). In other words, to pick a safe value at b we:

  1. Take some quorum q.

  2. Make everyone in q promise to never vote in ballots earlier than b.

  3. Among all of the votes already cast by the quorum members we pick the one with the highest ballot number.

  4. If such vote exists, its value is a safe value.

  5. Otherwise, any value is safe.

To implement the “never vote” promise, each acceptor will maintain maxBal value. It will never vote in ballots smaller or equal to maxBal.

Let’s stop hand-waving and put this algorithm in math. Again, we are not thinking about messages yet, and just assume that each acceptor can observe the state of the whole system.

Ballot Vote

 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Sets: 𝔹 -- Numbered set of ballots (for example, ℕ) 𝕍 -- Arbitrary set of values 𝔸 -- Finite set of acceptors ℚ ∈ 2^𝔸 -- Set of quorums Assume: ∀ q1, q1 ∈ ℚ: q1 ∩ q2 ≠ {} Vars: -- Set of (acceptor, ballot, value) triples votes ∈ 2^(𝔸×𝔹×𝕍) -- Function that maps acceptors to ballot numbers or -1. -- maxBal :: 𝔸 -> 𝔹 ∪ {-1} maxBal ∈ (𝔹 ∪ {-1})^𝔸 Voted(a, b) ≡ ∃ v ∈ 𝕍: (a, b, v) ∈ votes Safe(v, b) ≡ ∃ q ∈ ℚ: ∀ a ∈ q: maxBal(a) ≥ b - 1 ∧ ∃ b1 ∈ 𝔹 ∪ {-1}: ∀ b2 ∈ [b1+1; b-1], a ∈ q: ¬Voted(a, b2) ∧ b1 = -1 ∨ ∃ a ∈ q: (a, b1, v) ∈ votes AdvanceMaxBal(a, b) ≡ maxBal(a) < b ∧ votes' = votes ∧ maxBal' = λ a1 ∈ 𝔸: if a1 = a then b else maxBal(a1) Vote(a, b, v) ≡ maxBal(a) < b ∧ ∀ (a1, b1, v1) ∈ votes: b = b1 ⇒ v = v1 ∧ Safe(v, b) ∧ votes' = votes ∪ (a, b, v) ∧ maxBal' = λ a1 ∈ 𝔸: if a1 = a then b else maxBal(a1) Init ≡ votes = {} ∧ maxBal = λ a ∈ 𝔸: -1 Next ≡ ∃ a ∈ 𝔸, b ∈ 𝔹: AdvanceMaxBal(a, b) ∨ ∃ v ∈ 𝕍: Vote(a, b, v) chosen ≡ {v ∈ V: ∃ q ∈ ℚ, b ∈ 𝔹: AllVotedFor(q, b, v)} AllVotedFor(q, b, v) ≡ ∀ a ∈ q: (a, b, v) ∈ votes -- Notation
-- [b1;b2]: inclusive interval of ballots
-- Y^X: set of function from X to Y (f: X -> Y)
-- λ x ∈ X: y: function that maps x to y
-- ¬: "not", negation
--
-- f' = λ x1 ∈ X: if x1 = x then y else f(a):
-- A tedious way to write that f' is the same function as f,
-- except on x, where it returns y instead.
--
-- I am sorry! In my defense, TLA+ notation for this
-- is also horrible :-)

Let’s unwrap this top-down. First, the chosen condition says that it is enough for some quorum to cast votes in some ballot for a value to be accepted. It’s trivial to see that, if we fix the ballot, then any two quorums would vote for the same value — quorums intersect. Showing that quorums vote for the same value in different ballots is the tricky bit.

The Init condition is simple — no votes, any acceptor can vote in any ballot (= any ballot with number larger than -1).

The Next consists of two cases. On each step of the protocol, some acceptor either votes for some value in some ballot ∃ v ∈ 𝕍: Vote(a, b, v), or declares that it won’t cast additional vote in small ballots AdvanceMaxBal(a, b). Advancing ballot just sets maxBal for this acceptor (but takes care not to rewind older decisions). Casting a vote is more complicated and is predicated on three conditions:

  • We haven’t forfeited our right to vote in this ballot.

  • If there’s some vote in this ballot already, we are voting for the same value.

  • If there are no votes, then the value should be safe.

Note that the last two checks overlap a bit: if the set of votes cast in a ballot is not empty, we immediately know that the value is safe: somebody has proven this before. But it doesn’t harm to check for safety again: a safe value can not become unsafe.

Finally, the safety check. It is done in relation to some quorum — if q proves that v is safe, than members of this quorum would prevent any other value to be accepted in early ballots. To be able to do this, we first need to make sure that q indeed finalized their votes for ballots less than b (maxBall is at least b - 1). Then, we need to find the latest vote of q. There are two cases

  • No one in q ever voted (b1 = -1). In this case, there are no additional conditions on v, any value would work.

  • Someone in q voted, and b1 is the last ballot when someone voted. Then v must be the value voted for in b1. This implies Safe(v, b1).

If all of these conditions are fulfilled, we cast our vote and advance maxBall.

This is the hardest part of the article. Take time to fully understand Ballot Vote.

Rigorously proving that Ballot Voting satisfies Consensus would be tedious — the specification is large, and the proof would necessary use every single piece of the spec! But let’s add some hand-waving. Again, we want to provide homomorphism from Ballot Voting to Consensus. Cases where the image of a step is a stuttering step (the set of chosen values is the same) are obvious. It’s also obvious that the set of chosen values never decreases (we never remove votes, so a value can not become unchosen). It also increases by at most one value with each step.

The complex case is to prove that, if currently only v1 is chosen, no other v2 can be chosen as a result of the current step. Suppose the contrary, let v2 be the newly chosen value, and v1 be a different value chosen some time ago. v1 and v2 can’t belong to the same ballot, because every ballot contains votes only for a single value (this needs proof!). Lets say they belong to b1 and b2, and that b1 < b2. Note that v2 might belong to b1 — nothing prevents smaller ballot from finishing later. When we chose v2 for b2, it was safe. This means that some quorum either promised not to vote in b1 (but then v1 couldn’t have been chosen in b1), or someone from the quorum voted for v2 in b1 (but then v1 = v2 (proving this might require repeated application of safety condition)).

Ok, but is this better than Majority Voting? Can Ballot Voting get stuck? No — if at least one quorum of machines is online, they can bump their maxBall to a ballot bigger than any existing one. After they do this, there necessary will be a safe value relative to this quorum, which they can then vote on.

However, Ballot Voting is prone to a live lock — if acceptors continue to bump maxBal instead of voting, they’ll never select any value. In fact, in the current formulation one needs to be pretty lucky to not get stuck. To finish voting, there needs to be a quorum which can vote in ballot b, but not in any smaller ballot, and in the above spec this can only happen by luck.

It is impossible to completely eliminate live locks without assumptions about real time. However, when we implement Ballot Voting with real message passing, we try to reduce the probability of a live lock.