Many of the most interesting results in distributed systems have come from looking at problems known to be impossible under one set of constraints, and finding how little those constraints can be relaxed before the problem becomes possible. One great example is how adding a random Oracle to the asynchronous system model used by FLP makes consensus possible. That result is very interesting, but not as practically important as the idea of failure detectors.
The theoretical importance of detecting failures in the asynchronous model dates back to work in the 1980s from Dolev, Dwork and Stockmeyer and Dwork, Lynch and Stockmeyer. The latter of these papers is very interesting, because it describes what can be argued is the first practical consensus algorithm before the publication of Viewstamped Replication and Paxos. More on that another time. A great, detailed, description and characterization of failure detectors can be found in Unreliable Failure Detectors for Reliable Distributed Systems by Chandra and Toueg. They also introduced the concept of unreliable failure detectors:
In this paper, we propose an alternative approach to circumvent such impossibility results, and to broaden the applicability of the asynchronous model of computation. Since impossibility results for asynchronous systems stem from the inherent difficulty of determining whether a process has actually crashed or is only "very slow," we propose to augment the asynchronous model of computation with a model of an external failure detection mechanism that can make mistakes. In particular, we model the concept of unreliable failure detectors for systems with crash failures.
The failure detectors that Chandra and Toueg describe are distributed, rather than global, failure detectors. Each process uses local state to keep a list of other processes that it suspects have failed, and adds and removes processes from this list based on communication with other processes. A local failure detector like that can make two kinds of mistakes: putting processes that haven't failed onto the list, and not putting processes that have failed onto the list. Remember, that like in all real distributed systems, there is no central oracle that can tell a node whether its list is correct. These two kinds of mistakes lead to the definition of two properties. From Chandra and Toueg:
completeness requires that a failure detector eventually suspects every process that actually crashes, while accuracy restricts the mistakes that a failure detector can make.⋄
Let's start with the best combination of these properties, a failure detector where every correct process eventually permanently suspects every crashed process of failing (strong completeness), and never suspects a non-crashed process (strong accuracy). This failure detector, P, can be seen as the ideal asynchronous failure detector. It doesn't make mistakes, and it does the best it can at detection while remaining asynchronous. At the other end of the scale is ◇W. With this failure detector, every crashed process is eventually permanently suspected by some crashed process, and eventually some correct process is not suspected by any correct process. ◇W, unlike P, can make lots and lots of mistakes, for arbitrarily long amounts of time.
Before going further, it's worth introducing one piece of notation. Even informal writing about failure detectors tends to make heavy use of the ◇ operator from temporal logic. Don't be put off by the notation, ◇F simply means F is eventually true. There is some state in the future where F is true. To better understand that, let's compare the failure detectors ◇W and W. Both of these meet the weak completeness condition:
Weak Completeness. Eventually every process that crashes is permanently suspected by some correct process.
W meets the weak accuracy condition:
Weak Accuracy. Some correct process is never suspected.
While ◇W only meets the strictly weaker eventual weak accuracy condition.
Eventual Weak Accuracy. There is a time after which some correct process is never suspected by any correct process.
Comparing those two makes the difference more obvious. ◇W is allowed to make mistakes early on (before a time) what W isn't allowed to make.
The existence of these classes of failure detectors allows meaningful comparisons to be made about the difficulty of different distributed problems, much like complexity classes allow us to compare the difficulty of computational problems. For example, it is known that consensus can be solved using ◇W if only a minority of processes fail. The problem known as non-blocking atomic commit (NB-AC), on the other hand, cannot be solved with ◇W if there is a single failure. In a very meaningful sense, NB-AC is harder than consensus. When I first learned about that result, I found it surprising: my assumption had been that uniform consensus was equivalent to the hardest problems in distributed systems.
First, let's define the NB-AC and consensus problems. They have a lot in common, both being non-blocking agreement problems. Both consensus and NB-AC attempt to get a multiple processes to agree on a single value without blocking in the presence of failures. Two-phase commit is, like NB-AC and consensus, an agreement protocol, but it is a blocking one. The presence of a single failure will cause 2PC to block forever.
Guerraoui defines consensus with three conditions:
Agreement: No two correct participants decide different values
Uniform-Validity: If a participant decides v, then v must have been proposed by some participant
Termination: Every correct process eventually decides
Uniform consensus expands the agreement condition to a stronger one, called uniform agreement:
Uniform-Agreement: No two participants (correct or not) decide different values.
Consensus is, therefore, about deciding. NB-AC, on the other hand, is about accepting or voting on whether to commit a transaction. Guerraoui defines it with four conditions:
Uniform-Agreement: No two participants AC-decide different values.
Uniform-Validity: If a participant AC-decides commit, then all participants voted ''yes''.
Termination: Every correct process eventually AC-decides
NonTriviality: If all participants vote yes, and there is no failure, then every correct participant eventually AC-decides commit.
Notice how similar this appears to be to the uniform consensus problem. Guerraoui describes how it is the last one of these conditions, NonTriviality, which has the effect of requiring that a solution to NB-AC has precise knowledge about failures. To meet the Termination condition, eventually each process needs to commit or abort. Eventual strong accuracy doesn't provide the knowledge required to make that decision, because it admits a time t where a process is simply delayed but it's vote is ignored (violating the uniform validity or NonTriviality conditions depending on the vote). Weak accuracy doesn't provide the right knowledge either, because it allows an incorrect abort (and hence violation of NonTriviality) based on incomplete knowledge of the failed set.
If you only have unreliable failure detectors, uniform consensus is no harder than consensus, though reliable failure detectors (like P) make consensus easier than uniform consensus. Therefore, the addition of the uniform agreement requirement doesn't explain why consensus can be solved with ◇W and NB-AC can't. Instead, it's that seemingly harmless NonTriviality condition that makes NB-AC harder. That's a great example of how intuition is often a poor guide in distributed systems problems: seemingly similar problems, with very similar definitions, may end up with completely different difficulties.