Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

What Now? Handling Errors in Large Systems

More options means more choices.

Cloudflare’s deep postmortem for their November 18 outage triggered a ton of online chatter about error handling, caused by a single line in the postmortem:

.unwrap()

If you’re not familiar with Rust, you need to know about Result, a kind of struct that can contain either a successful result, or an error. unwrap says basically “return the successful results if there is one, otherwise crash the program”¹. You can think of it like an assert.

There’s a ton of debate about whether asserts are good in production², but most are missing the point. Quite simply, this isn’t a question about a single program. It’s not a local property. Whether asserts are appropriate for a given component is a global property of the system, and the way it handles data.

Let’s play a little error handling game. Click the ✅ if you think crashing the process or server is appropriate, and the ❌ if you don’t. Then you’ll see my vote and justification.

One of ten web servers behind a load balancer encounters uncorrectable memory errors, and takes itself out of service.

Your vote:

My vote: ✅

Uncorrectable memory errors are independent, and do not depend on user-provided content. In the presence of bad memory, it's impossible for a program to proceed safely. Taking the machine out of service is the safest course of action.
One of ten multi-threaded application servers behind a load balancer encounters a null pointer in business logic while processing a customer request.

Your vote:

My vote: ❌

Customer requests triggering bugs in business logic isn't a good reason to bring the whole server down. Instead, fail that particular request (returning an HTTP 5xx error), and continue with other user requests. In approaches like Erlang, or even Lambda, it may be the right approach to crash the whole application in response to a bad request, because this crash is handled at a higher layer in the architecture. This is also why I prefer languages like Rust and Java to languages like C and C++ for services: the ability to continue after a `NullPointerException` or getting a `Option::None` is much better than the ability to continue after (say) a segfault. It is possible to write C that's safe in all the same cases, but the explicit handling of errors in Rust make it much easier.
One database replica receives a logical replication record from the primary that it doesn't know how to process

Your vote:

My vote: ✅

In general, replicas in this position can't continue, because applying future updates can cause arbitrary state corruption, and return arbitrarily wrong results to clients. For a successful system, ensuring the primary doesn't send bad records to replicas must be a system invariant.
One web server receives a global configuration file from the control plane that appears malformed.

Your vote:

My vote: ❌

The right answer here will vary based on the needs of the system, but in most systems the best design would be for the server to continue with the last known good version of configuration, while alerting an operator that the latest version can't be processed. This is subtly different from the previous case: configuration doesn't tend to have the same consistency requirements as state, and tends to be entirely replaced with each new version, and so treating configuration currency as an invariant of the system reduces resilience unnecessarily.
One web server fails to write its log file because of a full disk.

Your vote:

My vote: ❌

It may seem like this is an uncorrelated condition, and it could be. The local log rotation agent could have crashed, for example. But it also could be because of a global condition, like a prior deployment of a bad log rotation configuration, or ongoing load spike. Unless there are specific requirements (e.g. legal requirements) for log retention, it's likely best to continue and inform an operator.

If you don’t want to play, and just see my answers, click here: .

There are three unifying principles behind my answers here.

Are failures correlated? If the decision is a local one that’s highly likely to be uncorrelated between machines, then crashing is the cleanest thing to do. Crashing has the advantage of reducing the complexity of the system, by removing the working in degraded mode state. On the other hand, if failures can be correlated (including by adversarial user behavior), its best to design the system to reject the cause of the errors and continue.

Can they be handled at a higher layer? This is where you need to understand your architecture. Traditional web service architectures can handle low rates of errors at a higher layer (e.g. by replacing instances or containers as they fail load balancer health checks using AWS Autoscaling), but can’t handle high rates of crashes (because they are limited in how quickly instances or containers can be replaced). Fine-grained architectures, starting with Lambda-style serverless all the way to Erlang’s approach, are designed to handle higher rates of errors, and crashing rather the continuing is appropriate in more cases.

Is it possible to meaningfully continue? This is where you need to understand your business logic. In most cases with configuration, and some cases with data, its possible to continue with the last-known good version. This adds complexity, by introducing the behavior mode of running with that version, but that complexity may be worth the additional resilience. On the other hand, in a database that handles updates via operations (e.g. x = x + 1) or conditional operations (if x == 1 then y = y + x) then continuing after skipping some records could cause arbitrary state corruption. In the latter case, the system must be designed (including its operational practices) to ensure the invariant that replicas only get records they understand. These kinds of invariants make the system less resilient, but are needed to avoid state divergence.

The bottom line is that error handling in systems isn’t a local property. The right way to handle errors is a global property of the system, and error handling needs to be built into the system from the beginning.

Getting this right is hard, and that’s where blast radius reduction techniques like cell-based architectures, independent regions, and shuffle sharding come in. Blast radius reduction means that if you do the wrong thing you affect less than all your traffic - ideally a small percentage of traffic. Blast radius reduction is humility in the face of complexity.

Footnotes

Yes, I know a panic isn’t necessarily a crash, but it’s close enough for our purposes here. If you’d like to explain the difference to me, feel free.
And a ton of debate about whether Rust helped here. I think Rust does two things very well in this case: it makes the unwrap case explicit in the code (the programmer can see that this line has “succeed or die behavior”, entirely locally on this one line of code), and prevents action-at-a-distance behavior (which silently continuing with a NULL pointer could cause). What Rust doesn’t do perfectly here is make this explicit enough. Some suggested that unwrap should be called or_panic, which I like. Others suggested lints like clippy should be more explicit about requiring unwrap to come with some justification, which may be helpful in some code bases. Overall, I’d rather be writing Rust than C here.

Marc's Blog

About Me

Links

What Now? Handling Errors in Large Systems

Similar Posts

Something Completely Different