Avoiding coordination is the one fundamental thing that allows us to build distributed systems that out-scale the performance of a single machine1. When we build systems that avoid coordinating, we end up building components that make assumptions about what other components are doing. This, too, is fundamental. If two components can’t check in with each other after every single step, they need to make assumptions about the ongoing behavior of the other component.
One way to classify these assumptions is into optimistic and pessimistic assumptions. I find it very useful, when thinking through the design of a distributed system, to be explicit about each assumption each component is making, whether that assumption is optimistic or pessimistic, and what exactly happens if the assumption is wrong. The choice between pessimistic and optimistic assumptions can make a huge difference to the scalability and performance of systems.
I generally think of optimistic assumptions as ones that avoid or delay coordination, and pessimistic assumptions as ones that require or seek coordination. The optimistic assumption assumes it’ll get away with its plans. The pessimistic assumption takes the bull by the horns and makes sure it will.
To make this concrete, let’s consider some examples.
Example 1: Caches
Distributed caches almost always make assumptions about whether the data they are holding is changed or not. Unlike with CPUs2, distributed caches typically aren’t coherent, but we still want them to be eventually consistent. By eventually consistent we mean that if the write stream stops, the caches eventually all converge on containing the same data. In other words, inconsistencies are relatively short-lived.
Possibly the most common way of ensuring this property—that inconsistencies are short-lived—is with a time to live (TTL). This simply means that the cache only keeps items around for a certain fixed period of time. The TTL provides a strong3 upper bound on how stale an item can be. This is a simple, strong, and highly popular mechanism. It’s also a pessimistic one: the cache is doing extra work assuming that the item has changed. In systems with a low per-item write rate, that pessimistic assumption can be wrong much more often than it’s right.
One downside of the pessimistic approach TTL takes is that it means the cache empties when it can’t talk to the authority. This is unavoidable: caches simply can’t provide strongly bounded staleness (or any other strong recency guarantee) if they can’t reach the authority4. Thus the pessimistic TTL approach has a strong availability disadvantage: if a network partition or authority downtime lasts longer than the TTL, the cache hit rate will drop to zero.
Two more optimistic patterns are quite commonly used to address this situation (especially in DNS and networking systems). One approach is to synchronously try fetch the new item, but then optimistically continue to use the old one if that’s possible (optimistic because it’s making the optimistic assumption that the item hasn’t change). A subtly different approach is to asynchronously try fetch the new item, and use the old one until that can complete. These protocol seem very similar to TTL, but are deeply fundamentally different. They don’t offer strong recency or staleness guarantees, but can tolerate indefinite network partitions5.
Example 2: OCC
Optimistic concurrency control and its tradeoffs with pessimistic locking-based approaches is a classic topic (maybe the most classic topic) in distributed databases. I won’t try advance that debate here. Instead, to summarize: optimistic concurrency control is a way of implementing isolated (as in ACID I) transactions that assumes that other concurrent transactions don’t conflict, and detecting at the last moment if that assumption is wrong. Pessimistic approaches like the classic two-phase locking, on the other hand, do a whole lot of coordination based on the assumption that other transactions do conflict, and it’s worth detecting that early while there’s still time to avoid duplicate work and make smart scheduling decisions.
OCC systems, in general, coordinate less than pessimistic systems when their optimistic assumption is right, and more than pessimistic systems when the optimistic assumption is wrong.
Comparing these two is approaches is a hard enough first-order problem, but to complicate things further the choice between optimism and pessimism leads to a number of second-order problems too. For example, the number of contending transactions depends on the number of concurrent transactions, and the number of concurrent transactions depends on lock wait times in pessimistic systems and retry rates in optimistic systems. In both kinds of systems, this leads to a direct feedback loop between past contention and future contention.
Example 3: Leases
Leases are a kind of time-based lock widely used in distributed systems. In most systems, a lease is replacing a number of coordination steps. One component takes a lease, and then uses that lease as a license to multiple things without worrying that other components are doing conflicting things, or may disagree, or whatever. Freed from the worry about conflicts, the lease-holding component can avoid coordinating and go ahead at full speed.
Leases are an interesting blend of pessimism (I’m assuming other things are going to conflict with my work, so I’m going to stop them in their tracks) and optimism (I’m assuming I can go ahead without coordination for the next bounded period of time). If the pessimism is wrong, all the heartbeating and updating and storing of leases is wasted work. As is the time other components could have spent doing work which they wasted while waiting for the lease.
Conclusion
One way I like to reason about the behavior of systems is by writing sentences of the form “this component is assuming that…”
For our TTL example, we could write statements like:
These statements are a tool to help structure our thinking about the behavior of the system. The third one—the availability-staleness tradeoff—is especially powerful because its often a hidden assumption people make when choosing a strict TTL.
By coloring each assumption as pessimistic (coordination-requiring) or optimistic (coordination-avoiding), we can also structure our thinking about the best time to coordinate, and make sure we’re being consistent in our choices about when and why coordination is needed.
Footnotes