Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

The Bug in Paxos Made Simple

There's not really a bug in Paxos, but clickbait is fun.

Over the last few weeks, I’ve been picking up the excellent P programming language, a language for modelling and specifying distributed systems. One of the first things I did in P was implement Paxos - an algorithm I know well, has a lot of subtle failure modes, and is easy to get wrong. Perfect for practicing specification. To test out P’s model checker, I intentionally implemented a subtly buggy version of Paxos, following the description in Paxos Made Simple. The model checker found, as expected, implemented the way I read Paxos Made Simple, that Paxos is broken.

I mentioned this to a colleague who said they had never heard of this bug. I think it deserves to be more well known, so I thought I’d write a bit about it.

The problem lies not in the Paxos algorithm itself, but in the description in the paper. Michael Deardeuff pointed out this bug to me, and also wrote it up in what may be the best Stack Overflow exchange of all time1 (or, at least, the one with the best value-to-upvotes ratio). In the Stack Overflow question, user lambda describes the following sequence of events:

Consider that there are totally 3 acceptors ABC. We will use X(n:v,m) to denote the status of acceptor X: proposal n:v is the largest numbered proposal accepted by X where n is the proposal number and v is the value of the proposal, and m is the number of the largest numbered prepare request that X has ever responded.

The following can play out:

1. P1 sends 'prepare 1' to AB
2.  Both AB respond P1 with a promise to not to accept any request numbered smaller than 1.\
    Now the status is: A(-:-,1) B(-:-,1) C(-:-,-)
3.  P1 receives the responses, then gets stuck and runs very slowly
4.  P2 sends 'prepare 100' to AB
5.  Both AB respond P2 with a promise to not to accept any request numbered smaller than 100.
    Now the status is: A(-:-,100) B(-:-,100) C(-:-,-)
6.  P2 receives the responses, chooses a value b and sends 'accept 100:b' to BC   
7.  BC receive and accept the accept request, the status is: A(-:-,100) B(100:b,100) C(100:b,-).
    Note that proposal 100:b has been chosen.
8.  P1 resumes, chooses value a and sends 'accept 1:a' to BC
9.  B doesn't accept it, but C accepts it because C has never promise anything.
    Status is: A(-:-,100) B(100:b,100) C(1:a,-).

This seems to be a major problem, because now the system could forget the decided value and decide on another one, violating the most basic safety property of Paxos. As Micheal points out in his answer, it turns out that this happens because of two ambiguities in the text of Paxos Made Simple. First, on the selection of acceptors for the accept (second) phase (from Paxos Made Simple):

If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.

If you follow the letter of this statement, and send the accept messages to the acceptors who responded to your first phase messages, then the problem can’t happen. Unfortunately, this also makes the algorithm somewhat less robust in practice. Fortunately, there’s another possible fix. Again, from Michael’s answer:

By accepting a value the node is also promising to not accept earlier values.

Lamport doesn’t say this in Paxos Made Simple. Instead, he says:

If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

So if you don’t quite follow the letter of the text about acceptor selection, and then do follow the text about how acceptors handle accept messages, then you end up with the bug described in the Stack Overflow question. That seems like a narrow case, but I’ll admit that I’ve implemented Paxos incorrectly in this way multiple times. It’s a very easy mistake to make.

Leslie Lamport is one of my technical writing heroes. I re-read some of his papers, like What Good is Temporal Logic? from time to time just because I like the way they are written. Pointing out this ambiguity isn’t criticizing his writing, but rather reminding you about how hard it is to write crisp descriptions of even relatively simple distributed protocols in text. As Lamport himself says:

Prose is not the way to precisely describe algorithms.

That’s a big part of why I like languages like P and TLA+ so much. Not only are they great ways to specify, check, and model algorithms, but they are also great ways to communicate them. If you work with distributed algorithms, I strongly advise picking up one of these languages.

Updates

Thanks for posting this, it’s a super interesting observation! I actually think about this issue a bit differently. Acceptors in Paxos store the “highest proposal accepted” not the “last proposal accepted”.

Footnotes

  1. I was reminded about this excellent exchange by Mahesh Balakrishnan’s recent post Paxos made Abstract., which is well worth reading to give a different perspective on how Paxos works, and one way to think about it from a systems perspective.
  2. There are a lot more correct variants of jury selection for different phases that meet different design goals, as Heidi Howard’s work clearly points out.