Is your system having scaling trouble? A bit too slow? Sending too much traffic to the database? Add a caching layer! After all, caches are a best practice and a standard way to build systems. What trouble could following a best practice cause?
Lots of trouble, as it turns out. In the context of distributed systems, caches are a powerful and useful tool. Unfortunately, applied incorrectly, caching can introduce some highly undesirable system behaviors. Applied incorrectly, caches can make your system unstable. Or worse, metastable. To understand why that is, we need to understand a bit about how systems scale.
Let’s start with the basics. Your system (hopefully) has some customers who send requests to it. Most often, you have lots of customers, and each one sends requests fairly infrequently. Those requests coming in from your customers are the offered load, generally measured in something like requests per second. Then, your system does some work on those requests, and eventually gives the results to some happy customers. The rate it does that is the goodput.
The number of requests inside your system, the concurrency, is related to the offered load and goodput. When they’re the same, the concurrency varies a small amount, but is relatively stable. The amount of concurrency in your system depends on the offered load and the time it takes to handle each request (latency). So far, so good.
But there’s some bad news. The bad news is that latency isn’t really a constant. In most systems, and maybe all systems, it increases with concurrency. And concurrency increases with latency. Maybe you can see where this is going.
Most real systems like this have a congestive collapse mode, where they can’t get rid of requests as fast as they arrive, concurrency builds up, and the goodput drops, making the issue worse. You can use tools like Little’s law to think about those situations.
What does this have to do with caches?
The most common use of caches in distributed systems is to reduce load on a data store, like a database. When data is needed, you check the cache, if it’s not there, you go to the database and get the data, and stash it into the cache. That’s mostly good, because it reduces load on the database, and reduces latency.
What happens when the cache is empty? Well, latency is higher, and load on the backend database is higher. When latency is higher, concurrency is higher, and goodput may be lower. When load on the backend database is higher, it’s concurrency is higher, and goodput may be lower. In fact, the latter is very likely. After all, you put that cache in place to protect the backend database from all that load it can’t handle.
So our system has two stable loops. One’s a happy loop where the cache is full:
The other is a sad loop, where the cache is empty, and stays empty:
What’s interesting and important here is that these are both stable loops. Unless something changes, the system can run in either one of these modes forever. That’s good in the case of the good loop, but bad in the case of the bad loop. It’s a classic example - probably the most common one of all - of a metastable distributed system.
It gets worse
This problem is bad, and especially insidious for a couple of reasons that may not be obvious on the surface.
Load testing typically isn’t enough to kick a system in the good loop into the bad loop, and so may not show that the bad loop exists. This is for a couple of reasons. One is that caches love load, and typically behave better under high, predictable, well-behaved load than under normal circumstances. The other is that load tests typically test lots of load, instead of testing the bad pattern for caches, which is load with a different (and heavier-tailed) key frequency distribution from the typical one.
Caches extract cacheability. What I mean by that is that the load that misses the cache is less cacheable than the load that hits the cache. So typically, systems end up with either a hierarchy of cache sizes (like a CPU), or with one big cache. But when that cache is empty, a lot of requests for the same key will go to the systems behind it. A cache could have helped there, but there isn’t one because it wouldn’t have helped in the happy case.
Caches are based on assumptions. Fundamentally, a cache assumes that there’s either some amount of temporal or spatial locality of access (i.e. if Alice is sending work now, she’ll probably be sending more work soon, so it’s efficient to keep Alice’s stuff on deck), or their key distribution isn’t uniform (i.e. Bob sends work every second, Alice sends work every day, so it’s efficient to keep Bob’s stuff on deck and fetch Alice’s when we need it). These assumptions don’t tend to be rigorous, or enforced in any way. They may change in ways that are invisible to most approaches to monitoring.
But aren’t CPU caches good?
Yes, CPU caches are good. Our computers would be way slower without them.
Thinking about why CPU caches are good and (generally) immune to this problem is very instructive. It’s because of offered load. When you’re clicking away on your laptop, say designing a robot in CAD or surfing the web, you react to slowness by asking for less work. That means that slowness caused by empty caches reduces goodput, but also reduces offered load. The unbounded increase in concurrency doesn’t happen.
Good caches have feedback loops. Like back pressure, and limited concurrency. Bad caches are typically open-loop. This starts to give us a hint about how we may use caches safely, and points to some of the safe patterns for distributed systems caching. More on that later.