Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

Histogram vs eCDF

Accumulation is a fun word.

Histograms are a rightfully popular way to present data like latency, throughput, object size, and so on. Histograms avoid some of the difficulties of picking a summary statistic, or group of statistics, which is hard to do right. I think, though, that there’s nearly always a better choice than histograms: the empirical cumulative distribution function (eCDF). To understand why, let’s look at an example, starting with the histogram1.

This latency distribution is very strongly bimodal. It’s the kind of thing you might expect from a two-tiered cache: a local tier with very low latency, and a remote tier with latency in the 2 to 3ms range. Super common in systems and databases. The histogram illustrates that bimodality very well. It’s easy to see that the second mode is somewhere around 2.5ms. The next two questions on my mind would be: how much do these two spikes contribute? and where is the first spike? In histogram form, its hard to answer these questions. The first we’d need to answer by doing some mental area-under-the-curve estimation, and the second is obscured by bucketing.

Here’s the same data in eCDF form. You can think of it as the histogram summed up, or integrated, or accumulated2. The first thing you may notice is how easy it has become to see the relative contribution of our first and second mode. The first mode contributes around 70% of measurements. If this is a cache system, we immediately know that our cache hit rate is around 70%. We also know that the 65th percentile is very low, and the 75th is very high. In fact, we can read these percentile3 values right off the graph by finding the 0.65 and 0.75 points on the Y axis, moving right until we hit the curve, and reading their value off the X axis. Magic!

The second question, about the location of the first spike, can be answered by zooming in. That works because the eCDF, unlike the histogram, doesn’t require bucketing, so we can zoom around in X and Y as much as we like without changing the shape of the curve. Say, for example, we wanted to look at the tail in more detail. Let’s zoom in on the top right.

Again, we can easily read off high percentiles from the graph, and don’t have to worry about how changing bucket widths.

I believe that for nearly all purposes in systems design and operations, eCDFs are a better choice than histograms for presenting data to humans.

What other cool stuff can eCDFs do? Another cool thing eCDFs make easy is generating random numbers from a measured distribution. Remember how we could go up the Y axis to find the value of different percentiles? Computers can do that too: generate a random number between 0 and 1, and then “read off” the X axis value. I’ll leave the exercise of doing that efficiently and accurately to the reader.

Just as easy as finding the value of a percentile is finding the percentile of a value. This is less frequently useful in systems, but occasionally it is nice to be able to ask “how much of an outlier is this value?”. For example, say you’re building a filesystem that can only store files less than 1MiB. Take the eCDF of file distributions in the world, find 1MB on the X axis, and the Y value will be the percentage of files your system will be able to store.

It’s trivial to transform an eCDF into a histogram, by bucketing up first differences (fn+1 - fn). You can’t go from the histogram to the eCDF so easily in general, because bucketing loses data.

Why do you say eCDF and not just CDF? At least in my head, the CDF is the true cumulative distribution function of the underlying distribution, and the eCDF is an empirical estimate of it (related by the fundamental theorem of statistics. Whether you feel that’s a useful distinction depends on whether you think there is an underlying distribution that’s exists separately from our measurements of it (and whether you still think that in the face of the non-stationarity of nearly all systems).

Isn’t a histogram a data structure and not a graph? Some folks use histogram to mean the graph (as I do above), and some folks use it to mean a data structure designed for summarizing a stream of values. There are many flavors of these, but most have a set of buckets (exponentially spaced is common) and sum into the buckets. The “real” eCDF is calculated directly from the samples themselves without bucketing, but can also be estimated from these histograms-as-data-structures. If you’re summarizing your data using one of these data structures, it’s nice to store as many buckets as feasible (and indeed many more than you’d show a human).

Footnotes

  1. I use histogram to also cover frequency polygon here, because most people don’t recognize the distinction. I don’t think it’s a particularly useful distinction anyway. You also might say ePDF.
  2. It is, after all, the cumulative distribution function. It’s nice when things say what they are.
  3. I say percentile here, but this is true for all quantiles. You can read off the quartiles, deciles, quintiles, heptiles, etc right off the graph in the same way.