Cloudflare’s deep postmortem for their November 18 outage triggered a ton of online chatter about error handling, caused by a single line in the postmortem:
.unwrap()If you’re not familiar with Rust, you need to know about Result, a kind of struct that can contain either a succesful result, or an error. unwrap says basically “return the successful results if their is one, otherwise crash the program”1. You can think of it like an assert.
There’s a ton of debate about whether asserts are good in production2, but most are missing the point. Quite simply, this isn’t a question about a single program. It’s not a local property. Whether asserts are appropriate for a given component is a global property of the system, and the way it handles data.
Let’s play a little error handling game. Click the ✅ if you think crashing the process or server is appropriate, and the ❌ if you don’t. Then you’ll see my vote and justification.
If you don’t want to play, and just see my answers, click here: .
There are three unifying principles behind my answers here.
Are failures correlated? If the decision is a local one that’s highly likely to be uncorrelated between machines, then crashing is the cleanest thing to do. Crashing has the advantage of reducing the complexity of the system, by removing the working in degraded mode state. On the other hand, if failures can be correlated (including by adversarial user behavior), its best to design the system to reject the cause of the errors and continue.
Can they be handled at a higher layer? This is where you need to understand your architecture. Traditional web service architectures can handle low rates of errors at a higher layer (e.g. by replacing instances or containers as they fail load balancer health checks using AWS Autoscaling), but can’t handle high rates of crashes (because they are limited in how quickly instances or containers can be replaced). Fine-grained architectures, starting with Lambda-style serverless all the way to Erlang’s approach, are designed to handle higher rates of errors, and crashing rather the continuing is appropriate in more cases.
Is it possible to meaningfully continue? This is where you need to understand your business logic. In most cases with configuration, and some cases with data, its possible to continue with the last-known good version. This adds complexity, by introducing the behavior mode of running with that version, but that complexity may be worth the additional resilience. On the other hand, in a database that handles updates via operations (e.g. x = x + 1) or conditional operations (if x == 1 then y = y + x) then continuing after skipping some records could cause arbitrary state corruption. In the latter case, the system must be designed (including its operational practices) to ensure the invariant that replicas only get records they understand. These kinds of invariants make the system less resilient, but are needed to avoid state divergence.
The bottom line is that error handling in systems isn’t a local property. The right way to handle errors is a global property of the system, and error handling needs to be built into the system from the beginning.
Getting this right is hard, and that’s where blast radius reduction techniques like cell-based architectures, independent regions, and shuffle sharding come in. Blast radius reduction means that if you do the wrong thing you affect less than all your traffic - ideally a small percentage of traffic. Blast radius reduction is humility in the face of complexity.
Footnotes
panic isn’t necessarily a crash, but it’s close enough for our purposes here. If you’d like to explain the difference to me, feel free.unwrap case explicit in the code (the programmer can see that this line has “succeed or die behavior”, entirely locally on this one line of code), and prevents action-at-a-distance behavior (which silently continuing with a NULL pointer could cause). What Rust doesn’t do perfectly here is make this explicit enough. Some suggested that unwrap should be called or_panic, which I like. Others suggested lints like clippy should be more explicit about requiring unwrap to come with some justification, which may be helpful in some code bases. Overall, I’d rather be writing Rust than C here.