Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

Is the Mean Really Useless?

Don't be too mean to the mean.

“The mean is useless” is a commonly-repeated statement in the systems observation and monitoring world. As people correctly point out, the mean (or average¹) tends to hide information about outliers, tends to be optimistic for many metrics, and can even be wildly misleading in presence of large outliers. That doesn’t mean that the average is useless, just that you need to be careful of how you interpret it, and what you use it for.

All descriptive statistics are misleading

All descriptive statistics are misleading, and potentially dangerous. The most prosaic reason for that is that they are summaries: by their nature they don’t capture the entire reality of the data they are summarizing. There is no way for a single number to capture everything you need to know about a large set of numbers. Anscombe’s quartet is the most famous illustration of this problem: four data sets that have very different graphs, but the same mean and variance in 𝑥 and 𝑦, and the same linear trend. Thanks to Albert Cairo and Autodesk, there’s an even more fun example: the datasaurus dozen².

There are other, more subtle, reasons that descriptive statistics are misleading too. One is that statistics in real-world computer systems change with time, and you can get very different results depending on how those changes in time align with when you sample and how long you average for. Point-in-time sampling can lead to completely missing some detail, especially when the sampling time is aligned to wall-clock time with no jitter.

In this example, we’ve got a machine that runs a periodic job (like a cron job) every minute, and it uses all the CPU on the box for a second. If we sample periodically, aligned to the minute boundary, we’ll think the box has 100% CPU usage. Instead, if we sample periodically aligned to any other second, we’ll think it completely idle. If, instead, we sample every second and emit a per-minute summary we’ll get a mean of 1.7% usage, a median of 0% and a 99th percentile of 100%. None of those really tell us what’s going on. Graphing the time series helps in this case, punting the problem to our brain’s ability to summarize graphs, but that’s hard to do at scale. The darling on the monitoring crowd, histograms, don’t really help here either³. That’s obviously a fantastically contrived example. Check out this presentation from Azul systems for some real-world ones.

Ok, so descriptive statistics suck. For many operational tasks, medians and percentiles suck less than the mean. But that shouldn’t be taken to imply that averages are useless.

Throughput

The throughput of a serial system, how many items of work it can do in a period of time, is one over the mean latency. That changes when the system can do multiple work items at once, either due to pipelining (like CPUs, networks and storage) or due to true parallelism (again, like CPUs, networks, and storage), but mean latency remains the denominator.

Consider a serial system that processes requests in 1ms 99% of the time, and 1s 1% of the time. The mean latency is 10.99ms, and the throughput is 91 requests per second. What the monitoring people are saying when they talk about the mean being bad is that neither of those figures (10.99ms or 91 rps) tells you that 1% of requests are having a really bad time. That’s all true. But both of those numbers are still very useful for capacity planning.

The mean throughput number, 91 requests per second in our example, allows us to compare our expected rate of traffic with the capacity of the system. If we’re expecting 10 requests per second at peak this holiday season, we’re good. If we’re expecting 100, then we’re in trouble. Once we know we’re in capacity trouble we can react by adding a second processor (doubling throughput in theory), or by trying to reduce latency (probably starting with that 1000ms outlier). Just looking at our latency graphs doesn’t tell us that.

Contention and Little’s Law

Another place the mean is really useful is in context of Little’s Law: 𝐋=𝛌𝐖.

the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system

There are a lot of reasons that this law is interesting and useful, but the biggest one for system operators is concurrency and how it relates to scale. In almost all computer systems concurrency is a limited resource. In thread-per-request (and process-per-request) systems the limit is often the OS or language scheduler. In non-blocking, evented, and green-thread systems limits include memory, open connections, the language scheduler, and backend limitations like database connections. In modern serverless systems like AWS Lambda, you can provision concurrency directly.

Like throughput, Little’s law gives us a way to reason about the long-term capacity of a system, and how close we are to it. In large-scale distributed systems many of the limited resources can be difficult to measure directly, so these capacity measures are also useful in understanding the load on resources we can’t observe.

It’s very useful to build an intuition around Little’s law, because it provides a handle onto some of the dynamic behaviors of computer systems. In real-world system (often due to contention), latency (𝐖) tends to increase along with concurrency (𝐋), meaning that the actual reaction of 𝐋 to increasing arrival rate (𝛌) can be seriously non-linear. Similarly, timeout-and-retry leads the arrival rate to increase as latency increases, again leading to non-linear effects.

Little’s law isn’t true of the other descriptive statistics. You can’t plug in a percentile of 𝛌 or 𝐖 and expect to get a correct percentile of 𝐋. It only works, except in exceptional circumstances, on the mean.

Request Size and Volume

Mean request size (or packet size, or response size, etc) is another extremely useful mean. It’s useful precisely because of the way the mean is skewed by outliers. Remember that the mean is defined as the sum divided by the count: if you multiply it back by the count you can extract the sum. When it comes to storage, or even things like network traffic, the total count is a very useful capacity measure. Percentiles, by their nature, are robust to outliers, but the measure you’re actually interested in (“how much storage am I using?”) may be driven by outliers.

Conclusion

Graphs, percentiles, medians, maximums, and moments are all extremely useful tools if you’re interested in monitoring systems. But I feel that, in their fervor to promote these tools, people have over-stated the case against the mean. In some quarters there even seems to be a religious fervor against the average, and immediate judgments of incompetence against anybody who uses it. That’s unfortunate, because the average is a tool that’s well-suited to some important tasks. Like all statistics, it needs to be used with care, but don’t believe the anti-mean zealots (and, importantly, don’t be mean).

Footnotes:

In this post I use mean and average more-or-less interchangeably, even though that isn’t technically correct. You know what I mean.
The paper on how the datasaurus dozen were made is worth reading.
Histograms do, obviously, help in other cases, as do other tools like Box plots. Sometimes you have to graph and summarize data in multiple ways before finding the one that answers the question you need to answer. John Tukey’s Exploratory Data Analysis is the classic book in that field.