Mars Code, in February’s CACM, is a very interesting look from the outside at some of the software engineering practices that helped make the Mars Science Laboratory mission successful. The authors cover a lot of ground in the article, from code reviews to coding standards to model extraction and model checking. While writing about model checking, they tell the story of Snark, a non-blocking deque algorithm.
The Snark paper looks great. The algorithm is presented clearly, including both clear text descriptions and blocks of C-like pseudocode. It’s published in a well-respected journal, LNCS. It dedicates four and a half of its fifteen pages to a well-structured and clearly written sketch proof of the correctness of the algorithm, with clear diagrams explaining some of the tricky cases. It’s got Guy Steele on the author list. It’s from Sun. As far as my is this paper likely to be trustworthy? heuristics go, this one doesn’t raise many suspicions.
Unfortunately, Snark is broken. Not subtly broken on an obscure liveness measure, but fundamentally broken in that it’s unsafe. In DCAS is not a Silver Bullet for Nonblocking Algorithm Design, Doherty et al lay out two bugs in the Snark algorithm. It’s worth noting, though, that et al in this case includes most of the authors of the original paper. The paper explains the bugs well, and it’s interesting how subtly 30 lines of code can be broken. It’s well worth reading, especially if you are interested in non-blocking algorithms. Later, Leslie Lamport used the same bugs as a test of the PlusCal language. In Checking a Multithreaded Algorithm with +CAL he explains how he model checked the original and fixed Snark algorithms using TLC.
Snark is not an isolated example either. Another great example is Chord. The hits are similar: top authors, top journal, good institution, and a detailed sketch proof. In Chord’s case, it’s also widely implemented and well respected. It even won the 2011 SIGCOMM Test of Time Paper Award. Like Snark, Chord appears to be subtly broken. Pamela Zave, in Using Lightweight Modeling To Understand Chord, found a number of places where either Chord is broken, or the version presented in the paper (and proven in the proofs!) isn’t safe. More details can be found in the slides for a talk titled How to make Chord Correct.
The purpose of pointing out these examples isn’t self-superiority or even schadenfreude, but to demonstrate that the heuristics we use when deciding whether to trust a published algorithm don’t always work. Even talented teams of really smart people get this stuff wrong all the time. Not only are multi-threaded and distributed algorithms tricky to reason about, but techniques for proving their correctness are also complex and often not approachable. Most practitioners lack the training to verify the correctness of the proofs that do exist. Often, it’s even quite challenging to write down the requirements sufficiently clearly, though techniques like Refinement Mappings can often help.
Modeling, specification and model checking play an important role. Lamport demonstrated issues in Snark using TLA+, Doherty et al used Spin, and Zave used Alloy. Each of these tools has areas of strength, but each allows computer-aided checking of specifications or models. Lamport puts it this way:
a hand proof is no substitute for model checking. As the two-sided queue example shows, it is easy to make a mistake in an informal proof.
Tools like TLA+ have a big advantages for practitioners: they tend to be much more approachable than proof techniques, allow exploration of algorithm variants without starting from scratch, and allow easy exploration of the effect of a model on the correctness of a specification.The also look like code, and the tools frequently look and behave like the editors and compilers most are familiar with. If you can pick up Haskell or Scheme, and have a passing familiarity with basic set theory, you can learn TLA+, PlusCal or Spin. As another benefit, formal specifications of algorithms like Raft and Cheap Paxos are often (though not always) more complete and precise than either the text or pseudocode descriptions given in papers. Theses specifications allow automated model checking, and provide evidence that such model checking has been done. A specification provides some evidence that the author has thought through all the edge cases of an implementation, at least within the boundaries of the model.
Of course, model checking isn’t a panacea. It’s just as easy to make mistakes in either the specification or invariants as it is in a proof. Lamport explains some other limitations of model checking:
Model checking is no substitute for proof. Most algorithms can be checked only on instances of an algorithm that are too small to give us complete confidence in their correctness. Moreover, a model checker does not explain why the algorithm works.
Traditional software engineering practices, such as unit tests, are another promising approach. Approaches for testing distributed and multi-threaded algorithm implementations, with good coverage of ordering and failure cases, are still in their infancy, but there is a lot of very encouraging progress. More broadly, and more in industry than academia, we need a cultural shift. A developer who developed their own block cipher and dismissed concerns with it’s easy to reason that it is secure would get laughed at by the security community. Somebody who proposed a new sorting algorithm and refused to demonstrate that it actually sorts things would be the target of derision. People who propose their own distributed algorithms, both in the research press and in practice, are too frequently allowed to get away with trust me and it’s obvious as demonstrations of correctness. I’m not accusing the authors of the Snark or Chord papers of this, but it is very common. Depending on what you are doing, that’s not good enough. You don’t need to read much more than the comments on Lamport’s Why We Should Build Software Like We Build Houses to see evidence of a strong anti-design and anti-specification bent in software engineering practitioners. While these formal techniques and methods aren’t needed for much of the work programmers do, there are many types of systems for which they are extremely helpful (and should more often be required).
If you’re looking for a distributed algorithm, and your business, your life or your reputation are on the line, don’t accept I copied it from a paper. Don’t accept it’s obviously right. Don’t accept correctness based on reputation. Think twice before trusting only an informal proof.