A common misconception among people picking up distributed systems is that replication and consensus protocols—Paxos, Raft, and friends—are the tools used to build the largest and most scalable systems. It’s obviously true that these protocols are important building blocks. They’re used to build systems that offer more availability, better durability, and stronger integrity than a single machine. At the most basic level, though, they don’t make systems scale.
Instead, the fundamental approach used to scale distributed systems is avoiding co-ordination. Finding ways to make progress on work that doesn’t require messages to pass between machines, between clusters of machines, between datacenters and so on. The fundamental tool of cloud scaling is coordination avoidance.
A Spectrum of Systems
With this in mind, we can build a kind of spectrum of the amount of coordination required in different system designs:
Coordinated These are the kind that use paxos, raft, chain replication or some other protocol to make a group of nodes work closely together. The amount of work done by the system generally scales with the offered work (W) and the number of nodes (N), something like O(N * W) (or, potentially, worse under some kinds of failures).
Data-dependent Coordination These systems break their workload up into uncoordinated pieces (like shards), but offer ways to coordinate across shards where needed. Probably the most common type of system in this category is sharded databases, which break data up into independent pieces, but then use some kind of coordination protocol (such as two-phase commit) to offer cross-shard transactions or queries. Work done can vary between O(W) and O(N * W) depending on access patterns, customer behavior and so on.
Leveraged Coordination These systems take a coordinated system and build a layer on top of it that can do many requests per unit of coordination. Generally, coordination is only needed to handle failures, scale up, redistribute data, or perform other similar management tasks. In the happy case, work done in these kinds of systems is O(W). In the bad case, where something about the work or environment forces coordination, they can change to O(N * W) (see Some risks of coordinating only sometimes for more). Despite this risk, this is a rightfully popular pattern for building scalable systems.
Uncoordinated These are the kinds of systems where work items can be handled independently, without any need for coordination. You might think of them as embarrassingly parallel, sharded, partitioned, geo-partitioned, or one of many other ways of breaking up work. Uncoordinated systems scale the best. Work is always O(W).
This is only one cut through a complex space, and some systems don’t quite fit1. I think it’s still useful, though, because by building a hierarchy of coordination we can think clearly about the places in our systems that scale the best and worst. The closer a system is to the uncoordinated end the better it will scale, in general.
Other useful tools
There are many other ways to approach this question of when coordination is necessary, and how that influences scale.
The CAP theorem2, along with a rich tradition of other impossibility results3, places limits on the kinds of things systems can do (and, most importantly, the kinds of things they can offer to their clients) without needing coordination. If you want to get into the details there, the breakdown in Figure 2 of Highly Available Transactions: Virtues and Limitations is pretty clear. I like it because it shows us both what is possible, and what isn’t.
The CALM theorem is very useful, because it provides a clear logical framework for whether particular programs can be run without coordination, and something of a path for constructing programs that are coordination free. If you’re going to read just one distributed systems paper this year, you could do a lot worse than Keeping CALM.
Harvest and Yield is another way to approach the problem, by thinking about when systems can return partial results4. This is obviously a subtle topic, because the real question is when your clients and customers can accept partial results, and how confused they will be when they get them. At the extreme end, you start expecting clients to write code that can handle any subset of the full result set. Sometimes that’s OK, sometimes it sends them down the same rabbit hole that CALM takes you down. Probably the hardest part for me is that partial-result systems are hard to test and operate, because there’s a kind of mode switch between partial and complete results and modes make life difficult. There’s also the minor issue that there are 2N subsets of results, and testing them all is often infeasible. In other words, this is a useful too, but it’s probably best not to expose your clients to the full madness it leads to.
Finally, we can think about the work that each node needs to do. In a coordinated system, there is generally one or more nodes that do O(W) work. In an uncoordinated system, the ideal node does O(W/N) work, which turns into O(1) work because N is proportional to W.
Footnotes