Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

The Fundamental Mechanism of Scaling

It's not Paxos, unfortunately.

A common misconception among people picking up distributed systems is that replication and consensus protocols—Paxos, Raft, and friends—are the tools used to build the largest and most scalable systems. It’s obviously true that these protocols are important building blocks. They’re used to build systems that offer more availability, better durability, and stronger integrity than a single machine. At the most basic level, though, they don’t make systems scale.

Instead, the fundamental approach used to scale distributed systems is avoiding co-ordination. Finding ways to make progress on work that doesn’t require messages to pass between machines, between clusters of machines, between datacenters and so on. The fundamental tool of cloud scaling is coordination avoidance.

A Spectrum of Systems

With this in mind, we can build a kind of spectrum of the amount of coordination required in different system designs:

Coordinated These are the kind that use paxos, raft, chain replication or some other protocol to make a group of nodes work closely together. The amount of work done by the system generally scales with the offered work (W) and the number of nodes (N), something like O(N * W) (or, potentially, worse under some kinds of failures).

Data-dependent Coordination These systems break their workload up into uncoordinated pieces (like shards), but offer ways to coordinate across shards where needed. Probably the most common type of system in this category is sharded databases, which break data up into independent pieces, but then use some kind of coordination protocol (such as two-phase commit) to offer cross-shard transactions or queries. Work done can vary between O(W) and O(N * W) depending on access patterns, customer behavior and so on.

Leveraged Coordination These systems take a coordinated system and build a layer on top of it that can do many requests per unit of coordination. Generally, coordination is only needed to handle failures, scale up, redistribute data, or perform other similar management tasks. In the happy case, work done in these kinds of systems is O(W). In the bad case, where something about the work or environment forces coordination, they can change to O(N * W) (see Some risks of coordinating only sometimes for more). Despite this risk, this is a rightfully popular pattern for building scalable systems.

Uncoordinated These are the kinds of systems where work items can be handled independently, without any need for coordination. You might think of them as embarrassingly parallel, sharded, partitioned, geo-partitioned, or one of many other ways of breaking up work. Uncoordinated systems scale the best. Work is always O(W).

This is only one cut through a complex space, and some systems don’t quite fit1. I think it’s still useful, though, because by building a hierarchy of coordination we can think clearly about the places in our systems that scale the best and worst. The closer a system is to the uncoordinated end the better it will scale, in general.

Other useful tools

There are many other ways to approach this question of when coordination is necessary, and how that influences scale.

The CAP theorem2, along with a rich tradition of other impossibility results3, places limits on the kinds of things systems can do (and, most importantly, the kinds of things they can offer to their clients) without needing coordination. If you want to get into the details there, the breakdown in Figure 2 of Highly Available Transactions: Virtues and Limitations is pretty clear. I like it because it shows us both what is possible, and what isn’t.

The CALM theorem is very useful, because it provides a clear logical framework for whether particular programs can be run without coordination, and something of a path for constructing programs that are coordination free. If you’re going to read just one distributed systems paper this year, you could do a lot worse than Keeping CALM.

Harvest and Yield is another way to approach the problem, by thinking about when systems can return partial results4. This is obviously a subtle topic, because the real question is when your clients and customers can accept partial results, and how confused they will be when they get them. At the extreme end, you start expecting clients to write code that can handle any subset of the full result set. Sometimes that’s OK, sometimes it sends them down the same rabbit hole that CALM takes you down. Probably the hardest part for me is that partial-result systems are hard to test and operate, because there’s a kind of mode switch between partial and complete results and modes make life difficult. There’s also the minor issue that there are 2N subsets of results, and testing them all is often infeasible. In other words, this is a useful too, but it’s probably best not to expose your clients to the full madness it leads to.

Finally, we can think about the work that each node needs to do. In a coordinated system, there is generally one or more nodes that do O(W) work. In an uncoordinated system, the ideal node does O(W/N) work, which turns into O(1) work because N is proportional to W.

Footnotes

  1. Like systems that coordinate heavily on writes but mostly avoid coordination on reads. CRAQ is one such system, and a paper that helped me fall in love with distributed systems. So clever, and so simple once you understand it.
  2. Best described by Brewer and Lynch.
  3. See, for example, Nancy Lynch’s 1989 paper A Hundred Impossibility Proofs for Distributed Computing. If there were a hundred of these in 1989, you can imagine how many there are now, 32 years later. Wow, 1989 was 32 years ago. Huh.
  4. I wrote a post about it back in 2014.