Just after I joined the EBS team at AWS in 2011, the service suffered a major disruption lasting more than two days to full recovery. Recently, on Twitter, Andrew Certain said:
We were super dependent on having a highly available network to make the replication work, so having two NICs and a second network fabric seemed to be a way to improve availability. But the lesson of this event is that only some forms of redundancy improve availability.
I’ve been thinking about the second part of that a lot recently, as my team starts building a new replicated system. When does redundancy actually help availability? I’ve been breaking that down into four rules:
This might seem like obvious, even tautological, but each serves as the trigger of deeper thinking and conversation.
Andrew (or Kerry Lee, I’m not sure which) introduced this to the EBS team as don’t be weird.
So I think it reinforces two lessons:
— Andrew Certain (@tacertain) July 20, 2019
1/ Don't be weird
2/ Modality is bad
This isn’t a comment on people (who are more than welcome to be weird), but on systems. Weirdness and complexity add risk, both risk that we don’t understand the system that we’re building, and risk that we don’t understand the system that we are operating. When adding redundancy to a system, it’s easy to fall into the mistake of adding too much complexity, and underestimating the ways in which that complexity adds risk.
Once you’ve failed over to the redundant component, are you sure it’s going to be able to take the load? Even in one of the simplest cases, active-passive database failover, this is a complex question. You’re going from warm caches and full buffers to cold caches and empty buffers. Performance can differ significantly.
As systems get larger and more complex, the problem gets more difficult. What components do you expect to fail? How many at a time? How much traffic can each component handle? How do we stop our cost reduction and efficiency efforts from taking away the capacity needed to handle failures? How do we continuously test that the failover works? What mechanism do we have to make sure there’s enough failover capacity? There’s typically at least as much investment in answering these questions as building the redundant system in the first place.
Chaos testing, gamedays, and other similar approaches are very useful here, but typically can’t test the biggest failure cases in a continuous way.
When systems suffer partial failure, it’s often hard to tell what’s healthy and what’s unhealthy. In fact, different systems in different parts of the network often completely disagree on health. If your system sees partial failure and fails over towards the truly unhealthy side, you’re in trouble. The complexity here comes from the distributed systems fog of war: telling the difference between bad networks, bad software, bad disks, and bad NICs can be surprisingly hard. Often, systems flap a bit before falling over.
If your redundancy is a single shot, it’s not going to add much availability in the long term. So you need to make sure the system can safely get from one to two, or N to N+1, or N to 2N. This is relatively easy in some kinds of systems, but anything with a non-zero RPO or asynchronous replication or periodic backups can make it extremely difficult. In small systems, human judgement can help. In larger systems, you need an automated plan. Most likely, you’re going to make a better automated plan during daylight in the middle of the week during your design phase than at 3AM on a Saturday while trying to fix the outage.