Mirrored RAID1 is a classic way of increasing storage durability. It’s also a classic example of a system that’s robust against independent failures, but fragile against dependent failure. Patterson et al’s 1988 paper, which popularized mirroring, even covered the problem:
As mentioned above we make the same assumptions that disk manufacturers make – that the failures are exponential and independent. (An earthquake or power surge is a situation where an array of disks might not fail independently.)
A 2-way striped RAID can be in three possible states: a state with no failures, a state with one failure, or a state with two failures. The system moves between the first and second states, and the second and third states, when a failure happens. It can return from the second state to the first by repair. In the third state, data is lost, and returning becomes an exercise in disaster recovery (like restoring a backup). The classic Markov model looks like this, with the failure rate λ and repair rate μ:
This model clearly displays its naive thinking: it assumes that the failure rate of 2 disks is double the failure rate of a single disk2. All experienced system operators know that’s not true in practice. A second disk failure seems more likely to happen soon after a first. This happens for three reasons.
The third case, latent failures, may be the most interesting to system designers. They are a great example of the fact that systems4 often don’t know how far they are from failure. In the simple RAID case, a storage system with a latent failure believes that it’s in the first state, but actually is in the second state. This problem isn’t, by any means, isolated to RAID.
Another good example of the same problem is a system with a load balancer and some webservers behind it. The load balancer runs health checks on the servers, and only sends load to the servers that it believes are healthy. This system, like mirrored RAID, is susceptible to having outages caused by failures with the same cause (flood, earthquake, etc), failures triggered by the first failure (overload), and latent failures. The last two are vastly more common than the first: the servers fail one-by-one over time, and the system stays up until it either dies of overload, or the last server fails.
In both the load-balancer and RAID cases, a black box monitoring of the system is not sufficient. Black box monitoring, including external monitors, canaries, and so on, only tell the system which side of an externally visible failure boundary a system is on. Many kinds of systems, including nearly every kind that includes some redundancy, can move towards this boundary through multiple failures without crossing it. Black-box monitoring misses these internal state transitions. Catching them can significantly improve the actual, real-world, durability and availability of a system.
Presented that way, it seems obvious. However, I think there’s something worth paying real attention to hear: complex systems, the kind we tend to build when we want to build failure-tolerant systems, have a property that simple systems don’t. Simple systems, like a teacup, are either working or they aren’t. There is no reason to invest in maintenance (beyond the occasional cleaning) until a failure happens. Complex systems are different. They need to be constantly maintained to allow them to achieve their optimum safety characteristics.
This requires deep understanding of the behavior of the system, and involves complexities that are often missed in planning and management activities. If planning for, and allocating resources to, maintenance activities is done without this knowledge (or, worse, only considering external failure rates) then its bound to under-allocate resources to the real problems.
That doesn’t mean that all maintenance must, or should, be done by humans. It’s possible, and necessary at scale, to automate many of the tasks needed to keep systems far from the failure boundary. You’ve just got to realize that your automation is now part of the system, and the same conclusions apply.
Footnotes: