<?xml version="1.0"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Marc Brooker's Blog</title>
    <link>http://brooker.co.za/blog/</link>
    <atom:link href="http://brooker.co.za/blog/rss.xml" rel="self" type="application/rss+xml" />
    <description>Marc Brooker's Blog</description>
    <language>en-us</language>
    <pubDate>Tue, 13 Feb 2024 17:43:27 +0000</pubDate>
    <lastBuildDate>Tue, 13 Feb 2024 17:43:27 +0000</lastBuildDate>

    
    <item>
      <title>Better Benchmarks Through Graphs</title>
      <link>http://brooker.co.za/blog/2024/02/12/parameters.html</link>
      <pubDate>Mon, 12 Feb 2024 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2024/02/12/parameters</guid>
      <description>&lt;h1 id=&quot;better-benchmarks-through-graphs&quot;&gt;Better Benchmarks Through Graphs&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Isn&apos;t the ambiguity in the word *graphs* fun?&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;em&gt;This is a blog post version of a talk I gave at the Northwest Database Society meeting last week. The &lt;a href=&quot;https://brooker.co.za/blog/resources/nwds_mbrooker_feb_2024.pdf&quot;&gt;slides are here&lt;/a&gt;, but I don’t believe the talk was recorded.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I believe that one of the things that’s holding back databases as an engineering discipline (and why so much remains stubbornly opinion-based) is a lack of good benchmarks, especially ones available at the design stage. The gold standard is designing for and benchmarking against real application workloads, but there are some significant challenges achieving this ideal. One challenge&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; is that, as in any system with concurrency, &lt;em&gt;traces&lt;/em&gt; capture the behavior of the application running on another system, and they might have issued different operations in a different order running on this one (for example, think about how in most traces it’s hard to tell the difference between &lt;em&gt;application thinking&lt;/em&gt; and &lt;em&gt;application waiting for data&lt;/em&gt;, which could heavily influence results if we’re trying to understand the effect of speeding up the &lt;em&gt;waiting for data&lt;/em&gt; portion). Running real applications is better, but is costly and raises questions of access (not all customers, rightfully, are comfortable handing their applications over to their DB vendor).&lt;/p&gt;

&lt;p&gt;Industry-standard benchmarks like TPC-C, TPC-E, and YCSB exist. They’re widely used, because they’re easy to run, repeatable, and form a common vocabulary for comparing the performance of systems. On the other hand, these benchmarks are known to be poorly representative of real-world workloads. For the purposes of this post, mostly that’s because they’re &lt;em&gt;too easy&lt;/em&gt;. We’ll get to what that means later. First, here’s why it matters.&lt;/p&gt;

&lt;p&gt;Designing, optimizing, or improving a database system requires a lot of choices and trade-offs. Some of these are big (optimistic vs pessimistic, distributed vs single machine, multi-writer vs single-writer, optimizing for reads or writes, etc), but there are also thousands of small ones (“&lt;em&gt;how much time should I spend optimizing this critical section?&lt;/em&gt;”). We want benchmarks that will shine light on these decisions, allowing us to make them in a quantitative way.&lt;/p&gt;

&lt;p&gt;Let’s focus on just a few of the decisions the database system engineer makes: how to implement &lt;em&gt;atomicity&lt;/em&gt;, &lt;em&gt;isolation&lt;/em&gt;, and &lt;em&gt;durability&lt;/em&gt; in a distributed database. Three of the factors that matter there are transaction size (&lt;em&gt;how many rows?&lt;/em&gt;), locality (&lt;em&gt;is the same data accessed together all the time?&lt;/em&gt;), and coordination (&lt;em&gt;how many machines need to make a decision together?&lt;/em&gt;). Just across these three factors, the design that’s &lt;em&gt;best&lt;/em&gt; can vary widely.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/wsz_axes.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If we think of these three factors as ones that define a space&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. At each point in this space, keeping other concerns constant, some design is &lt;em&gt;best&lt;/em&gt;. Our next challenge is generating synthetic workloads—fake applications—for each point of the space. Standard approaches to benchmarking sample this space sparsely, and the industry-standard ones do it extremely poorly. In the search for a better way, we can turn, as computer scientist so often do, to graphs.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/wsz_graph.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this graph, each row (or other object) in our database is a node, and the edges mean &lt;em&gt;transacted with&lt;/em&gt;. So two nodes are connected by a (potentially weighted) edge if they appear together in a transaction. We can then generate example transactions by taking a random walk through this graph of whatever length we need to get transactions of the right size.&lt;/p&gt;

&lt;p&gt;The graph model seems abstract, but is immediately useful in allowing us to think about why some of the standard benchmarks are so easy. Here’s the graph of write-write edges for TPC-C &lt;em&gt;neworder&lt;/em&gt; (with one warehouse), for example.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/wsz_tpcc.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Notice how it has 10 disjoint islands. One thing that allows us to see is that we could immediately partition this workload into 10 shards, without ever having to execute a distributed protocol for &lt;em&gt;atomicity&lt;/em&gt; or &lt;em&gt;isolation&lt;/em&gt;. Immediately, that’s going to look flattering to a distributed database architecture. This graph-based way of thinking is generally a great way of thinking about the partitionability of workloads. Partitioning is trying to draw a line through that graph which cuts as few edges as possible&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;If we’re comfortable that graphs are a good way of modelling this problem, and random walks over those graphs&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; are a good way to generate workloads with a particular shape, we can ask the next question: how do we generate graphs with the properties we want? Generating graphs with particular shapes is a classic problem, but one approach I’ve found particularly useful is based on &lt;a href=&quot;http://worrydream.com/refs/Watts-CollectiveDynamicsOfSmallWorldNetworks.pdf&quot;&gt;the small-world networks model&lt;/a&gt; from Watts and Strogatz&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. This model gives us a parameter $p$ which, which allows us to vary between &lt;em&gt;ring lattices&lt;/em&gt; (the simplest graph with a particular constant degree), and completely random graphs. Over the range of $p$, long-range connections form across broad areas of the graph, which seem to correlate very well with the &lt;em&gt;contention&lt;/em&gt; patterns we’re interested in exploring.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/wsz_ws.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That gives us two of the parameters we’re interested in: transaction size is set by the length of random walks we do, and coordination which is set by adjusting $p$. We haven’t yet solved &lt;em&gt;locality&lt;/em&gt;. In our experiments, locality is closely related to &lt;em&gt;degree distribution&lt;/em&gt;, which the Watts-Strogatz model doesn’t control very well. We can easily control the central tendency of that distribution (by setting the initial degree of the ring lattice we started from), but can’t really simulate the outliers in the distribution that model things like &lt;em&gt;hot keys&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In the procedure for creating these Watts-Strogatz graph, the targets of the &lt;em&gt;rewirings&lt;/em&gt; from the ring lattice are chosen uniformly. We can make the degree distribution more extreme by choosing non-uniformly, such as with a Zipf distribution (even though Zipf &lt;a href=&quot;https://brooker.co.za/blog/2023/02/07/hot-keys.html&quot;&gt;seems to be a poor match for real-world distributions in many cases&lt;/a&gt;). This lets us create a Watt-Strogatz-Zipf model.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/wsz_wsz.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Notice how we have introduced a hot key (near the bottom right). Even if we start our random walk uniformly, we’re quite likely to end up there. This kind of internal hot key is fairly common in relational transactional workloads (for example, secondary indexes with low cardinality, or dense auto-increment keys).&lt;/p&gt;

&lt;p&gt;This approach to generating benchmark loads has turned out to be very useful. I like how flexible it is, how we can generate workloads with nearly any characteristics, and how well it maps to other graph-based ways of thinking about databases. I don’t love how the relationship between the parameters and the output characteristics is non-linear in a potentially surprising way. Overall, this post and talk were just scratching the surface of a deep topic, and there’s a lot more we could talk about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Play With the Watts-Strogatz-Zipf Model&lt;/strong&gt;&lt;a name=&quot;sim&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;!-- Generated by GPT-4 with the prompt: &quot;write an html5/js file that does the following:

large square canvas
draw a 20 node graph, follows the &quot;small world networks&quot; model
add a slider that allows the user to change the value of the p parameter&quot;

then a bit of manual tweaking, and another prompt to add the Zipf distribution.

 --&gt;

&lt;div&gt;
&lt;canvas id=&quot;graphCanvas&quot; width=&quot;600&quot; height=&quot;600&quot;&gt;&lt;/canvas&gt;&lt;br /&gt;
$p$ parameter: &lt;input type=&quot;range&quot; id=&quot;pSlider&quot; min=&quot;0&quot; max=&quot;0.6&quot; step=&quot;0.01&quot; value=&quot;0&quot; /&gt;&lt;br /&gt;
degree: &lt;input type=&quot;range&quot; id=&quot;degSlider&quot; min=&quot;1&quot; max=&quot;5&quot; step=&quot;1&quot; value=&quot;0&quot; /&gt;&lt;br /&gt;
Zipf exponent: &lt;input type=&quot;range&quot; id=&quot;zipfSlider&quot; min=&quot;0.1&quot; max=&quot;2.0&quot; step=&quot;0.01&quot; value=&quot;0&quot; /&gt;&lt;br /&gt;

&lt;script&gt;
const canvas = document.getElementById(&apos;graphCanvas&apos;);
const ctx = canvas.getContext(&apos;2d&apos;);
const slider = document.getElementById(&apos;pSlider&apos;);
const degSlider = document.getElementById(&apos;degSlider&apos;);
const zipfSlider = document.getElementById(&apos;zipfSlider&apos;);
const nodeCount = 20;
const radius = 250; // Radius for nodes layout in a circle
const centerX = canvas.width / 2;
const centerY = canvas.height / 2;

// This is an extremely inefficient O(N^2) way to make Zipf-distributed numbers, but it works OK. This approach
//  is based on generating the empirical CDF, then sampling from it directly using the O(N) method.
function generateZipf(s, N) {
    // Calculate Zipfian constants for normalization
    let c = 0;
    for (let i = 1; i &lt;= N; i++) {
        c += 1.0 / (i ** s);
    }
    c = 1 / c;

    // Generate CDF (cumulative distribution function)
    let cdf = [0]; // CDF starts with 0
    for (let i = 1; i &lt;= N; i++) {
        cdf[i] = cdf[i - 1] + c / (i ** s);
    }

    // Use random number to find corresponding value
    const random = Math.random();
    for (let i = 1; i &lt;= N; i++) {
        if (random &lt;= cdf[i]) {
            return i - 1; // Adjust if you want 0 to N-1 range, otherwise it gives 1 to N
        }
    }
    return N - 1; // In case of rounding errors, return the last element
}

function generateGraph(p, degree, z_exp) {
    let nodes = [];
    let edges = new Map();

    // Initialize nodes and place them in a circle
    for (let i = 0; i &lt; nodeCount; i++) {
        let angle = (i / nodeCount) * 2 * Math.PI;
        nodes.push({
            x: centerX + radius * Math.cos(angle),
            y: centerY + radius * Math.sin(angle),
        });
    }

    // Create a ring lattice with k neighbors
    let k = degree; // Number of nearest neighbors (assumed even for simplicity)
    for (let i = 0; i &lt; nodeCount; i++) {
        for (let j = 1; j &lt;= k; j++) {
            let neighbor = (i + j) % nodeCount;
            if (!edges.has(i)) edges.set(i, new Set());
            if (!edges.has(neighbor)) edges.set(neighbor, new Set());
            edges.get(i).add(neighbor);
            edges.get(neighbor).add(i);
        }
    }

    // Rewire edges with probability p
    edges.forEach((value, key) =&gt; {
        value.forEach(neighbor =&gt; {
            if (Math.random() &lt; p) {
                let oldNeighbor = neighbor;
                let newNeighbor;
                do {
                    newNeighbor = generateZipf(z_exp, nodeCount);
                } while (newNeighbor === key || edges.get(key).has(newNeighbor));
                // Remove the old neighbor
                edges.get(key).delete(oldNeighbor);
                edges.get(oldNeighbor).delete(key);
                // Wire to the new neighbor
                edges.get(key).add(newNeighbor);
                edges.get(newNeighbor).add(key);

            }
        });
    });

    return { nodes, edges };
}

function drawGraph(graph) {
    ctx.clearRect(0, 0, canvas.width, canvas.height); // Clear the canvas

    // Draw edges
    graph.edges.forEach((value, key) =&gt; {
        value.forEach(neighbor =&gt; {
            ctx.beginPath();
            ctx.moveTo(graph.nodes[key].x, graph.nodes[key].y);
            ctx.lineTo(graph.nodes[neighbor].x, graph.nodes[neighbor].y);
            ctx.stroke();
        });
    });

    // Draw nodes
    graph.nodes.forEach(node =&gt; {
        ctx.beginPath();
        ctx.arc(node.x, node.y, 5, 0, 2 * Math.PI);
        ctx.fill();
    });
}

function updateGraph() {
    const p = parseFloat(slider.value);
    const degree = parseInt(degSlider.value);
    const z_exp = parseFloat(zipfSlider.value);
    const graph = generateGraph(p, degree, z_exp);
    drawGraph(graph);
}

slider.addEventListener(&apos;input&apos;, updateGraph);
degSlider.addEventListener(&apos;input&apos;, updateGraph);
zipfSlider.addEventListener(&apos;input&apos;, updateGraph);

// Initial drawing
updateGraph();
&lt;/script&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; There’s an excellent discussion of more problems with traces in Traeger et al’s &lt;a href=&quot;https://www.fsl.cs.sunysb.edu/docs/fsbench/fsbench-tr.html#sec:traces&quot;&gt;A Nine Year Study of File System and Storage Benchmarking&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; I’ve drawn them here as orthogonal, which they aren’t in reality. Let’s hand-wave our way past that.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; This general way of thinking dates back to at least 1992’s &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/130283.130308&quot;&gt;On the performance of object clustering techniques&lt;/a&gt; by Tsangaris et al (this paper’s &lt;em&gt;Expansion Factor&lt;/em&gt;, from section 2.1, is a nice way of thinking about distributed databases scalability in general). Thanks to Joe Hellerstein for pointing this paper out to me. More recently, papers like &lt;a href=&quot;https://dl.acm.org/doi/10.14778/1920841.1920853&quot;&gt;Schism&lt;/a&gt; and &lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/3471485.3471490&quot;&gt;Chiller&lt;/a&gt; have made use of it.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; There’s a lot to be said about the relationship between the shape of graphs and the properties of random walks over those graphs. Most of it would need to be said by somebody more competent in this area of mathematics than I am.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; The degree distribution of these small-world networks is a whole deep topic of its own. Roughly, there’s a big spike at the degree of the original ring lattice, and the distribution decays exponentially away from that (with the exponent related to $p$).&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; Google Scholar lists nearly 54000 citations for this paper, so its not exactly obscure.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>How Do You Spend Your Time?</title>
      <link>http://brooker.co.za/blog/2024/02/06/time.html</link>
      <pubDate>Tue, 06 Feb 2024 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2024/02/06/time</guid>
      <description>&lt;h1 id=&quot;how-do-you-spend-your-time&quot;&gt;How Do You Spend Your Time?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Career advice, or something like it.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;em&gt;Some people who ask me for advice at work get very long responses. Sometimes, those responses aren’t specific to my particular workplace, and so I share them here. In the past, I’ve written about &lt;a href=&quot;https://brooker.co.za/blog/2022/11/08/writing.html&quot;&gt;writing&lt;/a&gt;, &lt;a href=&quot;https://brooker.co.za/blog/2023/09/21/audience.html&quot;&gt;writing for an audience&lt;/a&gt;, &lt;a href=&quot;https://brooker.co.za/blog/2022/12/15/thumb.html&quot;&gt;heuristics&lt;/a&gt;, and &lt;a href=&quot;https://brooker.co.za/blog/2020/10/19/big-changes.html&quot;&gt;getting big things done&lt;/a&gt;. This is another of those emails.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When we spoke, you mentioned that you weren’t happy with the things you were getting done. You thought you were productive, and getting a lot done, but they weren’t the things you (or your manager) thought were most valuable for your project and team. You’re busy, you’re productive, but it doesn’t feel right. It’s a problem I’ve faced before, which I think I’ve mostly solved for myself. Here’s some thoughts on what worked for me.&lt;/p&gt;

&lt;p&gt;I set myself a time budget. Five or six &lt;em&gt;themes&lt;/em&gt;, and an explicit goal for how I should divide my time between these themes. Then, over the long-term (weeks and months), I hold myself accountable to approximately spending my time in the way that I planned. I also, twice a year or so, calibrate with my manager that they agree with this time budget, and adjust accordingly.&lt;/p&gt;

&lt;p&gt;The exercise of setting the budget itself is the most valuable thing. It requires me to really think about what I need to do to make my teams and projects successful, the way I balance between the short- and long-term, and what I want to get done over time. This seems to be the most common mistake I hear from folks: they’re not happy with how they spend their time, but haven’t thought at all about what success looks like, about what would make them happy. Even if you don’t act on the results at all, I recommend you spend some time thinking about your &lt;em&gt;themes&lt;/em&gt; and how you divide your time between them, then talk to your manager about it in your next one-on-one.&lt;/p&gt;

&lt;p&gt;The next step is to follow your budget. Here, I think it’s useful to be flexible in the short term (stuff comes up, the cycle of your business and projects demand different things, people need help, etc), but rather stubborn over the long term. If you aren’t following your plan over the long term, why not?&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; Is it the wrong plan? If it’s the wrong plan, change the plan. If its the right plan, deeply understand why you can’t stick to it.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I always remember the observation of a very successful soldier who said, “Peace-time plans are of no particular value, but peace-time planning is indispensable.”&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sticking to your budget requires saying no. You’re a capable person, and a lot of people know that, so lots of folks are going to ask for your help with stuff. Sometimes, you’re going to need to guide them elsewhere. Or just say &lt;em&gt;no&lt;/em&gt; outright. That doesn’t feel good, but if you always say &lt;em&gt;yes&lt;/em&gt; to stuff that isn’t that important you can’t be surprised when you don’t get important stuff done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some Caveats&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are some ways that I see folks taking this kind of thing too far. One of them is setting &lt;em&gt;important&lt;/em&gt; in the wrong way: focusing on visibility, or trend chasing, or executive face time, or whatever. I haven’t found focusing on those things valuable.&lt;/p&gt;

&lt;p&gt;Then, there’s the dirty work. The messy stuff that’s always urgent, and only sometimes important. Some folks get this wrong by always taking it on. &lt;em&gt;Why didn’t you get this important task done? Because I was on this ticket, and that customer issue, and those on-call tasks, and so on.&lt;/em&gt; It’s super easy, in an operationally-heavy business like ours, to get into nothing &lt;em&gt;but&lt;/em&gt; the details. That’s a trap. Going too far the other way is a trap too. As a leader, you need to be deeply aware of these tasks. You need to be hands-on with the most important ones. How can you expect to make successful changes to a system &lt;a href=&quot;https://brooker.co.za/blog/2019/06/17/chernobyl.html&quot;&gt;you don’t understand&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;The other failure mode is losing control of your time. I spoke to a senior SDE the other day who was in 20 hours of planning-related meetings a week, every week, across multiple teams. What was weird about that is that he didn’t think he was adding much value in these meetings, his manager didn’t think we was adding much value, and his manager’s manager didn’t either. But they had all just drifted into that situation without explicitly talking about it. But nobody fixed it because everybody assumed that somebody else was doing it for a good reason. Sometimes, the only way out of these situations is to have a hard conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My Themes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When I do this exercise, my themes tend to look something like this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;IC (individual contributor) work&lt;/strong&gt; This includes writing code, reading code, reviewing code, debugging, testing, standing around a whiteboard talking code and design, writing design docs, reviewing design docs, and so on. The core stuff that is the practice of software engineering.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mentoring and Teaching&lt;/strong&gt; This includes ad-hoc mentoring, standing one-on-ones&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, and simply having time open on my calendar for the “do you have a few minutes to chat about my career?” conversations with folks near me. I also tend to put things like tech talks into this bucket.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Strategic Stuff&lt;/strong&gt; What are we doing next year? What do the next five years look like? Where are the industry trends going? What are the new things our customers are thinking about that seems like it could be big? What skills am I going to need? What skills are the folks in my organization going to need?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rhythm of Business&lt;/strong&gt; This is the day-to-day. The way it looks has varied a lot over my career (more &lt;em&gt;business reviews&lt;/em&gt;, less &lt;em&gt;sprint planning&lt;/em&gt;), but includes everything involved in getting hands-on with the business. This includes the technical side (operations reviews, security meetings, looking into tickets and metrics, that kind of thing), money side (business reviews, etc), and people side (talent reviews, interviewing, and so on).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Learning&lt;/strong&gt; I put aside time during my work day to learn things, including reading papers, implementing algorithms I think are potentially important, reading books, and similar activities. This often feels hard to justify, but isn’t - over time I’ve gathered a good set of success stories of business value of me spending my time this way&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Customers&lt;/strong&gt; I like talking to customers, and some of them like talking to me. Customers are the most important thing to stay connected to.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These buckets aren’t perfect, and not everything quite naturally fits, but most things do. I tend to tweak them over time.&lt;/p&gt;

&lt;p&gt;The buckets aren’t important, and the edge cases are very much not important. What’s important is doing the exercise, and being explicit and thoughtful about how you spend your time. I find it valuable, anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Beyond Myself&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I tend to apply this pattern of thinking across the organizations I work with. The first step is encouraging people (both ICs and managers) to be thoughtful about how they spend their time, and encouraging managers (and other leaders) to be thoughtful and explicit about how they spend their team’s time. One conversation that I find particularly useful is just zeroing in on the &lt;em&gt;IC work&lt;/em&gt; bucket. How big should that bucket be for the software engineers of each level in your team? How big is it today? Why are those numbers different? If the person manages managers, the conversation is doubly useful. Every time I have that conversation I learn something important, and often surprising. I highly recommend it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; &lt;a href=&quot;https://quoteinvestigator.com/2017/11/18/planning/&quot;&gt;Dwight Eisenhower&lt;/a&gt;. The first time I read that, it was attributed to Richard Nixon, which would be pretty funny.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Although I don’t do a lot of those. Over the years, I’ve found ad-hoc “can I grab some time on your calendar to talk about X?” mentoring much more effective, both as a mentor and a mentee, than the “let’s meet every two weeks at this time” style.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; To pick an example, a couple months ago I needed to review some work of a time building vector storage features, and didn’t know if I could ask smart questions. So a read a handful of papers, implemented some of the key algorithms and data structures (PQ, HNSW, an LSH variant) and felt like I was much more able to ask the right questions.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; There’s this idea of &lt;a href=&quot;https://en.wikipedia.org/wiki/Revealed_preference&quot;&gt;revealed preference&lt;/a&gt; in economics, which is the theory that looking at what a consumer buys can reveal their real utility function (rather than the one they say they have, or even think they have). I don’t know if its good economics or not, but it’s a useful lens on how you spend your time. You say you love a salad, but always end up ordering the burger.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Pat's Big Deal, and Transaction Coordination</title>
      <link>http://brooker.co.za/blog/2024/01/23/big-deal.html</link>
      <pubDate>Tue, 23 Jan 2024 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2024/01/23/big-deal</guid>
      <description>&lt;h1 id=&quot;pats-big-deal-and-transaction-coordination&quot;&gt;Pat’s Big Deal, and Transaction Coordination&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Working together towards a common goal.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;I have a lot of opinions about Pat Helland’s CIDR’24 paper &lt;a href=&quot;https://www.cidrdb.org/cidr2024/papers/p63-helland.pdf&quot;&gt;Scalable OLTP in the Cloud: What’s the BIG DEAL?&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Most importantly, I like the BIG DEAL that he proposes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scalable apps don’t concurrently update the same key.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scalable DBs don’t coordinate across disjoint TXs updating different keys.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In exchange for fulfilling their sides of this big deal&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; the application gets a database that can scale&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, and the database gets a task to do that allows it to scale. The cost of this scalability, for the application developer, is having to deal with a weird concurrency phenomenon called &lt;em&gt;write skew&lt;/em&gt;. In this post, we’ll look at &lt;em&gt;write skew&lt;/em&gt;, why not preventing it helps databases scale, and how we can alter Pat’s big deal to prevent &lt;em&gt;write skew&lt;/em&gt; and get &lt;em&gt;serializability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In particular, we’ll try answer two big questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Is Pat’s big deal the best deal available?&lt;/li&gt;
  &lt;li&gt;Would a serializable big deal be better for application programmers?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Snapshot Isolation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The big deal experiences &lt;em&gt;write skew&lt;/em&gt; because of the database isolation level that Pat chose: snapshot isolation. Other than this particular weird behavior, the application can pretend that concurrent transactions run in a serial total order, which makes application developer’s lives easy. Nobody likes reasoning about concurrency, and reasoning about concurrency while correctly implementing business logic is pretty hard, so that’s comforting.&lt;/p&gt;

&lt;p&gt;There are a lot of ways to talk about these concurrency anomalies in the database literature. Tens, at least. The one I think is most accessible to developers is Martin Kleppmann’s &lt;a href=&quot;https://github.com/ept/hermitage/tree/master&quot;&gt;Hermitage&lt;/a&gt;, a set of minimal tests that illustrate each of the weird things that databases users can experience.&lt;/p&gt;

&lt;p&gt;The test setup is super simple:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;create table test (id int primary key, value int);
insert into test (id, value) values (1, 10), (2, 20);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Next, we make two connections to the database we’ll call &lt;em&gt;A&lt;/em&gt; and &lt;em&gt;B&lt;/em&gt;, and run a transaction through each connection. For each row of the table, imagine us running that statement, waiting for it to end, and then going on to the next row.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/write_skew.png&quot; alt=&quot;SQL for Martin Kleppmann&apos;s G2-item example from Hermitage&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In a &lt;em&gt;snapshot isolated&lt;/em&gt; database, both &lt;em&gt;A&lt;/em&gt; and &lt;em&gt;B&lt;/em&gt; commit. In a &lt;em&gt;serializable&lt;/em&gt; database, one of them needs to fail: there’s no way to order these two transactions into a serial order that makes sense (either &lt;em&gt;A&lt;/em&gt; needs to see &lt;em&gt;B&lt;/em&gt;’s writes, or &lt;em&gt;B&lt;/em&gt; needs to see &lt;em&gt;A&lt;/em&gt;’s)&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;This isn’t some strange academic edge case. For example, consider an application that reads from the &lt;em&gt;stuff in warehouse&lt;/em&gt; table, then writes to an &lt;em&gt;order&lt;/em&gt; table (without updating the warehouse, because the stuff is still there). The snapshot isolated version will sell some things too many times.&lt;/p&gt;

&lt;p&gt;The big deal then comes with a real trade-off, and forces the application programmer to take some care to ensure correct results (but, importantly, less so than at lower levels like &lt;em&gt;read committed&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is SI Ideal for the Big Deal?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To understand whether SI is ideal for the big deal, we need to look at another angle on isolation: what coordination the database needs to do to achieve that level&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Let’s say we have a transaction &lt;em&gt;A&lt;/em&gt;, and that transaction starts at some time $\tau^A_s$ and commits at some time $\tau^A_c$. To offer transaction &lt;em&gt;A&lt;/em&gt; snapshot isolation, we need to offer it two properties:&lt;/p&gt;

&lt;p&gt;Promise $1$: &lt;em&gt;A’s&lt;/em&gt; reads see all the effects of transactions that committed before &lt;em&gt;A&lt;/em&gt; started (i.e. before $\tau^A_s$), and none of the effects of transactions that committed after.&lt;/p&gt;

&lt;p&gt;Promise $2_{si}$: &lt;em&gt;A&lt;/em&gt; can only commit if none of the keys it &lt;em&gt;writes&lt;/em&gt; have been written by other transactions between &lt;em&gt;A&lt;/em&gt; starts and when it commits (i.e. between $\tau^A_s$ and $\tau^A_c$).&lt;/p&gt;

&lt;p&gt;There are many ways to implement these guarantees, but the implementation decisions aren’t particularly important here. What’s important is the coordination needed. There doesn’t appear to be any inherent reason that promise 1 (the read guarantee) requires any coordination at all. For example, &lt;em&gt;A&lt;/em&gt; could be given its own read replica which is completely disconnected from the stream of updates for the duration. It’s the second step where coordination is required: either to block the other writers (write locks), prevent other from committing, or detect the writes at the time &lt;em&gt;A&lt;/em&gt; comes to commit. All of those require coordinating with other writers, either continuously through the transaction or at commit time.&lt;/p&gt;

&lt;p&gt;Now, let’s consider what the second promise would look like if we wanted to offer &lt;em&gt;serializability&lt;/em&gt; to &lt;em&gt;A&lt;/em&gt; (and therefore prevent that write skew anomaly we talked about earlier).&lt;/p&gt;

&lt;p&gt;Promise $2_{ser}$: &lt;em&gt;A&lt;/em&gt; can only commit if none of the keys it &lt;em&gt;read&lt;/em&gt; have been written by other transactions between &lt;em&gt;A&lt;/em&gt; starts and when it commits (i.e. between $\tau^A_s$ and $\tau^A_c$).&lt;/p&gt;

&lt;p&gt;We’ve changed one word in the definition, but entered something of a rabbit hole. The snapshot version of Promise 2 only needs to coordinate on writes, and find write-write conflicts between transactions. It only needs to keep track of keys that are written, and talk to the machines that are responsible for detecting conflicts on those keys.&lt;/p&gt;

&lt;p&gt;The serializable version, on the other hand, needs to track all the keys &lt;em&gt;A&lt;/em&gt; read (and the keys &lt;em&gt;A&lt;/em&gt;’s predicates could have read but didn’t see), and then look for writes from other transactions to those keys. This doesn’t seem that different, really, but is a practical problem because it’s very easy (and common) for applications to make those read sets very big. For example, imagine &lt;em&gt;A&lt;/em&gt; does:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT id FROM stock WHERE type = &apos;chair&apos; ORDER BY num_in_stock DESC LIMIT 1;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the serializable version of the promise, now &lt;em&gt;every chair&lt;/em&gt; is in &lt;em&gt;A&lt;/em&gt;’s read set. &lt;em&gt;A&lt;/em&gt; will then need to conflict with any other transaction that writes to any chair row, even if it’s not the one that &lt;em&gt;A&lt;/em&gt; picked&lt;sup&gt;&lt;a href=&quot;#foot7&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. If we sold just one chair of any type during the time &lt;em&gt;A&lt;/em&gt; is running, the serializable version of &lt;em&gt;A&lt;/em&gt; couldn’t commit. What’s worse, from the Big Deal’s perspective of thinking about scalability, is that &lt;em&gt;A&lt;/em&gt; would need to coordinate with the machines that own &lt;em&gt;all&lt;/em&gt; chairs. In a distributed database, that’s a lot of coordination!&lt;/p&gt;

&lt;p&gt;In the snapshot version, &lt;em&gt;A&lt;/em&gt; would only need to coordinate with the machines that own any chairs it touched. Like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;UPDATE stock SET num_in_stock = num_in_stock - 1 WHERE id = &apos;the cool chair the customer chose&apos;;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The snapshot version of &lt;em&gt;A&lt;/em&gt; would only need to coordinate with that one machine that owns that one critical chair. Changing that one word between Promise $2_{si}$ and Promise $2_{ser}$ significantly changed the required coordination patterns.&lt;/p&gt;

&lt;p&gt;But does that change the asymptotic scalability of the database? It does in this example (because of $O(\textrm{chairs})$ coordination for serializability and $O(1)$ for snapshot). But does it in general? Only if we believe that applications’ read behavior, in general, is asymptotically different from their write behavior (otherwise we’re just moving constants around&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;). Specifically, that the number of &lt;em&gt;read-write&lt;/em&gt; edges is asymptotically different from the number of &lt;em&gt;write-write&lt;/em&gt; edges in the data access graph.&lt;/p&gt;

&lt;p&gt;This is the sense in which we can say that snapshot isolation is better for Pat’s Big Deal: making an assumption that applications access data in an asymptotically different way from how they write it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I Am Altering the Deal&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Wow, ok, that’s a lot of writing. But now I think we can propose, in the tradition of Roosevelt, a New Deal. A serializable deal. Without write skew. First, let’s remind ourselves about Pat’s Big Deal:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scalable apps don’t concurrently update the same key.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scalable DBs don’t coordinate across disjoint TXs updating different keys.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And now, our serializable New Big Deal.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scalable apps don’t update keys that are frequently read by other concurrent processes, and try not to read keys that are frequently written.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As the kids say: Oof.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scalable DBs don’t coordinate across disjoint TXs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Well, we’ve simplified that one, but have made the definition of &lt;em&gt;disjoint&lt;/em&gt; much more complex by defining it in terms of both reads and writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pray I Don’t Alter it Any Further&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The serializable version of the Big Deal is simpler for the application programmer from a correctness perspective. In fact, they can basically assume that concurrency doesn’t exist, which is very nice indeed. But it’s harder on the application programmer from a scalability and performance perspective, in that they have to be much more careful about the reads they do to get good scalability. It’s clear that’s not an easy win, but might be a net win in some circumstances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Also check out Murat Demirbas’s &lt;a href=&quot;http://muratbuffalo.blogspot.com/2024/01/scalable-oltp-in-cloud-whats-big-deal.html&quot;&gt;analysis of the paper&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; A big deal and a big deal. A major agreement, and very important.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; &lt;a href=&quot;https://github.com/ept/hermitage/tree/master&quot;&gt;Hermitage&lt;/a&gt; separates this phenomenon into &lt;em&gt;write skew&lt;/em&gt; (G2-item) and &lt;em&gt;anti-dependency cycles&lt;/em&gt; (G2). They’re closely related, with the latter focusing on predicate reads.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; We’re skimming over some deep water here to make a point - isolation implementation is the topic of decades of database research, and decades of attempts to formalize (e.g. &lt;a href=&quot;https://pmg.csail.mit.edu/papers/adya-phd.pdf&quot;&gt;Adya&lt;/a&gt;, &lt;a href=&quot;https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf&quot;&gt;Crooks, et al&lt;/a&gt;), implement (e.g. &lt;a href=&quot;https://www.amazon.com/Transaction-Processing-Concepts-Techniques-Management/dp/1558601902/&quot;&gt;Gray and Reuter&lt;/a&gt;, &lt;a href=&quot;https://www.eecs.harvard.edu/~htk/publication/1981-tods-kung-robinson.pdf&quot;&gt;Kung and Robinson&lt;/a&gt;), and explain isolation levels. Please forgive me some simplification.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; But let’s be clear - in the actual practical world moving constants around is super important.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; In this post, I’m using the word &lt;em&gt;scale&lt;/em&gt; (and related words like &lt;em&gt;scalable&lt;/em&gt; and &lt;em&gt;scalability&lt;/em&gt;) in the asymptotic sense Pat uses in his paper. Note that this is different from the sense that most folks use them in.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot7&quot;&gt;&lt;/a&gt; Assuming &lt;em&gt;A&lt;/em&gt; does any writes. If &lt;em&gt;A&lt;/em&gt; is read-only, it can “commit” at $\tau^A_c = \tau^A_s$ (which, because of our Promise 1, is always a valid and serializable thing to do).&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>What is Scalability Anyway?</title>
      <link>http://brooker.co.za/blog/2024/01/18/scalability.html</link>
      <pubDate>Thu, 18 Jan 2024 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2024/01/18/scalability</guid>
      <description>&lt;h1 id=&quot;what-is-scalability-anyway&quot;&gt;What is Scalability Anyway?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Do words mean things? Why?&lt;/p&gt;

&lt;p&gt;What does &lt;em&gt;scalable&lt;/em&gt; mean?&lt;/p&gt;

&lt;p&gt;As systems designers, builders, and researchers, we use that word a lot. We kind of all use it to mean that same thing, but not super consistently. Some include scaling both up and down, some just up, and some just down. Some include both scaling on a box (&lt;em&gt;vertical&lt;/em&gt;) and across boxes (&lt;em&gt;horizontal&lt;/em&gt;), some just across boxes. Some include big rack-level systems, some don’t.&lt;/p&gt;

&lt;p&gt;Here’s my definition:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A system is &lt;em&gt;scalable&lt;/em&gt; in the range where the cost of adding incremental work is &lt;em&gt;approximately constant&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I like this definition, in terms of incremental or marginal costs, because it seems to clear up a lot of the confusion by making scalability a customer/business outcome.&lt;/p&gt;

&lt;p&gt;Let’s look at some examples, starting with a single-machine system. This could be a single-box database, an application that only runs on one server, or even a client-side application. There’s an initial spike in marginal cost (when you have to buy the box, start the instance, or launch the container), then a wide range where the marginal cost of doing more work is near-zero. It’s near-zero because the money has been spent already. Finally, there’s a huge spike in costs when the load exceeds what a single box can do - often requiring a complete rethinking of the architecture of the system.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/scalability_one_box.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The place people run into trouble with these single-box systems is either overestimating or underestimating the effect of that big spike on the right. If you’re too worried about it, you can end up spending a bunch of system complexity and cost avoiding hitting something that you’ll never actually hit. On the other hand, if you do hit it and didn’t plan for it, you’re almost universally going to have a bad time. &lt;em&gt;We can’t grow the business until we rearchitect&lt;/em&gt; is a really bad place to end up.&lt;/p&gt;

&lt;p&gt;Our second example is a classic multi-machine architecture, which could be a sharded database, or a load-balanced application. As with a single box, we have an initial spike where we have to buy the first box/container/etc. Then there are periods where the marginal cost is low, with periodic spikes related to adding another fixed-size unit. Depending on the kind of application, the size of that initial spike may be the same size as the single-box case (some apps are trivial to load-balance), or it could be much higher (because you need to figure out how to shard).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/scalability_sharded.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This diagram is being very optimistic for sharded databases, essentially assuming the workload requires no cross-shard coordination. If it does, then the marginal costs once we pass a single machine are no longer constant, and there’s a significant stairstep as the need for coordination crosses more machines. &lt;a href=&quot;https://brooker.co.za/blog/2022/10/04/commitment.html&quot;&gt;I’ve written about this effect before&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Our last example is a &lt;em&gt;serverless&lt;/em&gt; system, like Lambda, or S3, or DynamoDB. In these models, the marginal cost of additional work (or additional storage) is nearly constant across the entire range. Work is billed per-unit, and if there are stair steps they’re usually downwards (as with S3’s tiered pricing).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/scalability_serverless.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This linearity of marginal cost is super important, and the key customer benefit of serverless pricing models. Scalability works both &lt;em&gt;down&lt;/em&gt; (by not having an initial spike), and &lt;em&gt;up&lt;/em&gt; (by not having spikes at particular loads). The downside is that the &lt;em&gt;floor&lt;/em&gt; is a constant rather than being near-zero, which requires a fundamentally different approach to thinking about unit economics.&lt;/p&gt;

&lt;p&gt;For this to work, you need to have a rather holistic view of &lt;em&gt;cost&lt;/em&gt;. For example, you could achieve the serverless cost model, looking only at price, by using an existing serverless offering or by building your own. However, with a more realistic view of cost, building your own would come with a significant initial spike. All the cost to design, build, debug, etc the first version really adds up. The same is true of sharded or load balanced systems - their initial spikes tend to be larger. This is a key reason I prefer serverless as a base for systems building, where I can get it.&lt;/p&gt;

&lt;p&gt;These examples aren’t anywhere near exhaustive. The definition, and this form of graphing it out, is a useful tool that I reach for all the time when thinking about system designs.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Why Aren't We SIEVE-ing?</title>
      <link>http://brooker.co.za/blog/2023/12/15/sieve.html</link>
      <pubDate>Fri, 15 Dec 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/12/15/sieve</guid>
      <description>&lt;h1 id=&quot;why-arent-we-sieve-ing&quot;&gt;Why Aren’t We SIEVE-ing?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Captain, we are being scanned!&lt;/p&gt;

&lt;p&gt;Long-time readers of this blog will know that I have mixed feelings about caches. One on hand, caching is critical to the performance of systems at every layer, from CPUs to storage to whole distributed architectures. On the other hand, caching being this critical means that designers need to carefully consider what happens when the cache is emptied, and they don’t always do that well&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Because of how important caches are, I follow the literature in the area fairly closely. Even to a casual observer, it’s obvious that there’s one group of researchers who’ve been on a bit of a tear recently, including Juncheng Yang, Yazhuo Zhang, K. V. Rashmi, and Yao Yue in various combinations. Their recent papers include &lt;a href=&quot;https://www.usenix.org/system/files/osdi20-yang.pdf&quot;&gt;a real-world analysis of cache systems at Twitter&lt;/a&gt;, &lt;a href=&quot;https://jasony.me/publication/hotos23-qdlp.pdf&quot;&gt;an analysis of the dynamics of cache eviction&lt;/a&gt;, and &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3600006.3613147&quot;&gt;a novel FIFO-based cache design with some interesting properties&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The most interesting one to me, which I expect anybody who enjoys a good algorithm will get a kick out of, is the eviction algorithm &lt;a href=&quot;https://junchengyang.com/publication/nsdi24-SIEVE.pdf&quot;&gt;SIEVE&lt;/a&gt; (their paper is coming up at NSDI’24). SIEVE is an &lt;em&gt;eviction algorithm&lt;/em&gt;, a way of deciding which cached item to toss out when a new one needs to be put in. There are hundreds of these in the literature. At least. Classics including throwing out the least recently inserted thing (FIFO), least recently accessed thing (LRU), thing that’s been accessed least often (LFU), and even just a random thing. Eviction is interesting because it’s a tradeoff between accuracy, speed (how much work is needed on each eviction and each access), and metadata size. The slower the algorithm, the less latency and efficiency benefit from caching. The larger the metadata, the less space there is to store actual data.&lt;/p&gt;

&lt;p&gt;SIEVE performs well. In their words:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Moreover, SIEVE has a lower miss ratio than 9 state-of-the-art algorithms on more than 45% of the 1559 traces, while the next best algorithm only has a lower miss ratio on 15%.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What’s super interesting about SIEVE is that it’s both very effective, and an extremely simple change on top of a basic FIFO queue. Here’s Figure 1 from &lt;a href=&quot;https://junchengyang.com/publication/nsdi24-SIEVE.pdf&quot;&gt;their paper&lt;/a&gt; with the pseudocode:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/sieve_figure_1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The only other change is to set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;obj.visited&lt;/code&gt; on access. Like the classic &lt;a href=&quot;https://www.multicians.org/paging-experiment.pdf&quot;&gt;CLOCK&lt;/a&gt; (from the 1960s!), and unlike the classic implementation of LRU, SIEVE doesn’t require changing the queue order on access, which reduces the synchronization needed in a multi-tenant setting. All it needs on access is to set a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bool&lt;/code&gt;, which is a simple atomic operation on most processors. That’s something of a big deal, for an algorithm that performs so well.&lt;/p&gt;

&lt;h2 id=&quot;why-arent-we-all-sieve-ing&quot;&gt;Why aren’t we all SIEVE-ing?&lt;/h2&gt;

&lt;p&gt;SIEVE raises an interesting question - if it’s so effective, and so simple, and so closely related to an algorithm that’s been around forever, why has nobody discovered it already? It’s possible they have, but I haven’t seen it before, and the authors say they haven’t either. Their hypothesis is an interesting one:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In block cache workloads, which frequently feature scans, popular objects often intermingle with objects from scans. Consequently, both types of objects are rapidly evicted after insertion.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We conjecture that not being scan-resistant is probably the reason why SIEVE remained undiscovered over the decades of caching research, which has been mostly focused on page and block accesses.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s believable. Scan resistance is important, and has been the focus of a lot of caching improvements over the decades&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Still, it’s hard to believe that folks kept finding this, and kept going &lt;em&gt;nah, not scan resistant&lt;/em&gt; and tossing it out. Fascinating how these things are discovered.&lt;/p&gt;

&lt;p&gt;Scan-resistance is important for block and file workloads because these workloads tend to be a mix of random access (&lt;em&gt;update that database page&lt;/em&gt;, &lt;em&gt;move that file&lt;/em&gt;) and large sequential access (&lt;em&gt;backup the whole database&lt;/em&gt;, &lt;em&gt;do that unindexed query&lt;/em&gt;). We don’t want the hot set of the cache that makes the random accesses fast evicted to make room for the sequential&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; pages that likely will never be accessed again&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;a-scan-resistant-sieve&quot;&gt;A Scan-Resistant SIEVE?&lt;/h2&gt;

&lt;p&gt;This little historical mystery raises the question of whether there are similarly simple, but more scan-resistant, approaches to cache eviction. One such algorithm, which I’ll call SIEVE-k, involves making a small change to SIEVE.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Each item is given a small counter rather than a single access bit,&lt;/li&gt;
  &lt;li&gt;On access the small counter is incremented rather than set, saturating at the value &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;k&lt;/code&gt;,&lt;/li&gt;
  &lt;li&gt;When the eviction &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hand&lt;/code&gt; goes past, the counter is decremented (saturating at 0), rather than reset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My claim here is that the eviction counter will go up for the most popular objects, causing them to be skipped in the round of evictions kicked off by the scan. This approach has some downsides. One is that eviction goes from worst-case &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;O(N)&lt;/code&gt; to worst-case &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;O(kN)&lt;/code&gt;, and the average case eviction also seems to go up by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;k&lt;/code&gt; (although I don’t love my analysis there). The other is that this could delay eviction of things that need to be evicted. Balancing these things, the most interesting variant of SIEVE-k is probably SIEVE-2 (along with SIEVE-1, which is the same as Zhang et al’s original algorithm).&lt;/p&gt;

&lt;h2 id=&quot;does-it-work&quot;&gt;Does It Work?&lt;/h2&gt;

&lt;p&gt;Yeah. Sort of. First, let’s consider a really trivial case of a Zipf-distributed &lt;em&gt;base&lt;/em&gt; workload, and a periodic linear scan workload that turns on and off. In this simple setting SIEVE-2 out-performs SIEVE-1 across the board (lower miss rates are better).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/sieve_k_results.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Clearly, with the 16MiB cache size here, SIEVE-2 and SIEVE-3 are doing a better job than SIEVE of keeping the scan from emptying out the cache. Beyond this magic size, it performs pretty much identically to SIEVE-1.&lt;/p&gt;

&lt;p&gt;But the real-world is more complicated than that. Using the excellent open source &lt;a href=&quot;https://github.com/cacheMon/libCacheSim&quot;&gt;libCacheSim&lt;/a&gt; I tried SIEVE-2 against SIEVE on a range of real-world traces. It was worse than SIEVE across the board on web-cache style KV workloads, as expected. Performance on block workloads&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt; was a real mixed bag, with some wins and some losses. So it seems like SIEVE-k is potentially interesting, but isn’t a win over SIEVE more generally.&lt;/p&gt;

&lt;p&gt;If you’d like to experiment some more, I’ve implemented SIEVE-k in &lt;a href=&quot;https://github.com/mbrooker/libCacheSim&quot;&gt;a fork of libCacheSim&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;updates&quot;&gt;Updates&lt;/h2&gt;

&lt;p&gt;&lt;a name=&quot;updates&quot;&gt;&lt;/a&gt;The inimitable Keegan Carruthers-Smith writes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I believe there is an improvement on your worst case for SIEVE-k eviction from O(kN) to O(N):
When going through the list, keep track of the minimum counter seen.  Then if you do not evict on the first pass, decrement by that minimum value.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which is, indeed, correct and equivalent to what my goofy k-pass approach was doing (only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;k/2&lt;/code&gt; times more efficient). He also pointed out that other optimizations are possible, but probably not that interesting for small &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;k&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;And, on the fediverse, Tobin Baker pointed out something important about SIEVE compared to FIFO and CLOCK: removing items from the middle of the list (rather than the head or tail only) means that the simple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;circular array&lt;/code&gt; approach doesn’t work. The upshot is needing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;O(log N)&lt;/code&gt; additional state&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt; to keep a linked list. Potentially an interesting line of investigation for implementations that are very memory overhead sensitive or CPU cache locality sensitive (and scanning through entries in a random order rather than sequentially). Tobin then &lt;a href=&quot;https://fediscience.org/@tobinbaker@discuss.systems/111660149084030363&quot;&gt;pointed out an interesting potential fix&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A simple fix to the SIEVE algorithm to accommodate circular arrays would be to move the current tail entry into the evicted entry’s slot (much like CLOCK copies a new entry into the evicted entry’s slot). This is really not very different from the FIFO-reinsertion algorithm, except that its promotion method (moving promoted entries to evicted slots) preserves the SIEVE invariant of keeping new entries to the right of the “hand” and old entries to the left.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This one is interesting, and I don’t have a good intuition for how it would affect performance (or whether the analogy to FIFO-reinsertion is correct). Implementing it in libCacheSim would likely sort that out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Partially because it’s hard to do. &lt;a href=&quot;https://brooker.co.za/blog/2022/06/02/formal.html&quot;&gt;We need better tools&lt;/a&gt; for reasoning about system behavior.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Including Betty O’Neil’s &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/170036.170081&quot;&gt;The LRU-K Page Replacement Algorithm For Database Disk Buffer&lt;/a&gt;, a classic approach to scan resistance from the 90s database literature.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; It’s worth mentioning that some caches solve this by hoping that clients will let them know when data is only going to be accessed once (like with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;POSIX_FADV_NOREUSE&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;POSIX_FADV_DONTNEED&lt;/code&gt;). This can be super effective with the right discipline, but storage systems &lt;em&gt;in general&lt;/em&gt; can’t make these kinds of assumptions (and often don’t have these kinds of interfaces at all).&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; I say &lt;em&gt;sequential&lt;/em&gt; here, but it’s really not sequential access that matters. What matters is that scans tend to happen at a high rate, and that they introduce a lot of &lt;em&gt;one hit wonders&lt;/em&gt; (pages that are read once and never again, and therefore are not worth caching). Neither of those criteria need sequential access, but it happens to be true that they come along most often during sequential accesses.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Block traces are interesting, because they tend to represent a kind of residue of accesses after the &lt;em&gt;easy&lt;/em&gt; caching has already been done (by the database engine or OS page cache), and so represent a pretty tough case for cache algorithms in general.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; Which can be halved by &lt;a href=&quot;https://en.wikipedia.org/wiki/XOR_linked_list&quot;&gt;committing unspeakable evil&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>It's About Time!</title>
      <link>http://brooker.co.za/blog/2023/11/27/about-time.html</link>
      <pubDate>Mon, 27 Nov 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/11/27/about-time</guid>
      <description>&lt;h1 id=&quot;its-about-time&quot;&gt;It’s About Time!&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;What&apos;s the time? Time to get a watch.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;My friend Al Vermeulen used to say &lt;em&gt;time is for the amusement of humans&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Al’s sentiment is still the common one among distributed systems builders: real wall-clock physical time is great for human-consumption (like log timestamps and UI presentation), but shouldn’t be relied on by computer for things like actually affect the operation of the system. This remains a solid starting point, the right default position, but the picture has always been more subtle. Recently, the availability of ever-better time synchronization has made it even more subtle. This post will attempt to unravel some of that subtlety.&lt;/p&gt;

&lt;p&gt;Today is a good day to talk about time, because last week &lt;a href=&quot;https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-time-sync-service-microsecond-accurate-time/&quot;&gt;AWS announced&lt;/a&gt; (&lt;a href=&quot;https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/&quot;&gt;more details here&lt;/a&gt;) microsecond-accurate time synchronization in EC2, improving on what was &lt;a href=&quot;https://aws.amazon.com/blogs/mt/manage-amazon-ec2-instance-clock-accuracy-using-amazon-time-sync-service-and-amazon-cloudwatch-part-1/&quot;&gt;already very good&lt;/a&gt;. All this means is that if you have an EC2 instance&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; you can expect its clock to by accurate to within microseconds of the &lt;em&gt;physical time&lt;/em&gt;. It turns out that having microsecond-level time accuracy makes some &lt;em&gt;distributed systems stuff&lt;/em&gt; much easier than it was in the past.&lt;/p&gt;

&lt;p&gt;In hopes of understanding the controversy over using real time in systems, let’s descend level-by-level into how we might entangle physical time more deeply in our system designs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 0: Observability, and the Amusement of Humans&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;… reality, the name we give to the common experience&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When we try understand how a system works, or why it’s not working, the first task is to establish &lt;em&gt;causality&lt;/em&gt;. Thing A caused thing B. Here in our weird little universe, we need thing A to have &lt;em&gt;happened before&lt;/em&gt; thing B for A to have caused B. Time is useful for this.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Prosecutor&lt;/em&gt;: Why, Mr Load Balancer, did you stop sending traffic to Mrs Server?&lt;br /&gt;
&lt;em&gt;Mr LB&lt;/em&gt;: Simply, sir, because she stopped processing my traffic!&lt;br /&gt;
&lt;em&gt;Mrs Server, from the gallery&lt;/em&gt;: Liar! Liar! I only stopped processing because you stopped sending!&lt;/p&gt;

&lt;p&gt;If we can’t trust the order of our logs (or other events), finding causality is difficult. If our logs are accurately timestamped the task becomes much easier. If we can expect our logs to be timestamped so accurately that a A having a lower timestamp than B implies that A happened before B, then our ordering task becomes trivial.&lt;/p&gt;

&lt;p&gt;We’ll get back to talking about clock error later, but for now the important point is that sufficiently accurate clocks make observing systems significantly easier, because they make establishing causality significantly easier. This is a big deal. If we get nothing else out of good clocks, just the observability benefits are probably worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: A Little Smarter about Wasted Work&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;He’s worth no more;&lt;br /&gt;
They say he parted well, and paid his score,&lt;br /&gt;
And so, God be with him! Here comes newer comfort.&lt;sup&gt;&lt;a href=&quot;#foot7&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Have you ever worked on something, then once you got it done you were told it wasn’t needed anymore? Distributed systems feel like that all the time. Clients give us work, then time out, or wander off, and the work still gets done. One solution to this problem is to give each piece of work a Time To Live (TTL), where each item of work is marked with an expiry time. “If you’re still working on this after twelve thirty, don’t bother finishing it because I won’t be waiting anymore”. TTLs have traditionally been implemented using relative time (&lt;em&gt;in 10 seconds&lt;/em&gt;, or in steps as with &lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc791&quot;&gt;IP&lt;/a&gt;) rather than absolute time (&lt;em&gt;until 09:54:10 UTC&lt;/em&gt;) because comparing absolute times across machines is risky. The downside of the relative approach is that everybody needs to measure the time taken and remember to decrease the TTL, which adds complexity. High quality clocks fix the drift problem, and allow us to use absolute time TTLs.&lt;/p&gt;

&lt;p&gt;Cache TTLs can also be based on absolute time, and the ability to accurately compare absolute time across machines allows caches to more easily implement patterns like &lt;em&gt;bounded staleness&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here on Level 1 clock quality matters more than Level 0, because the operational properties of the system (and therefore its availability and cost) depend on clock correctness. So we’re starting to step away from the amusement of humans to make assumptions about clocks that actually affect the client-observable running of the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Rates and Leases&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Gambling’s wrong and so is cheating, so is forging phony I.O.U.s.&lt;br /&gt;
Let’s let Lady Luck decide what type of torture’s justified,&lt;br /&gt;
I’m pit boss here on level two!&lt;sup&gt;&lt;a href=&quot;#foot8&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.acm.org/doi/10.1145/74851.74870&quot;&gt;Leases&lt;/a&gt; are a nearly ubiquitous, go-to, mutual exclusion mechanism in distributed systems. The core idea is simple: have a client &lt;em&gt;lease&lt;/em&gt; the right to exclude other clients for a period of time, and allow them to periodically renew their lease to keep excluding others. Leases, unlike more naive locks, allow the system to recover if a client fails while holding onto exclusivity: the lease isn’t renewed, it times out, and other clients are allowed to play. It’s this fault tolerance property that makes leases so popular.&lt;/p&gt;

&lt;p&gt;Did you notice those words &lt;em&gt;a period of time&lt;/em&gt;? Leases make a very specific assumption: that the lease provider’s clock moves at about the same speed as the lease holder’s clock. They don’t have to have the same absolute value, but they do need to mostly agree on how long a second is. If the lease holder’s clock is running fast, that’s mostly OK because they’ll just renew too often. If the lease provider’s clock is moving fast, they might allow another client to take the lease while the first one still thinks they’re holding it. That’s less OK.&lt;/p&gt;

&lt;p&gt;Robust lease implementations fix this problem with a &lt;em&gt;safety time&lt;/em&gt; ($\Delta_{safety}$). Instead of allowing the lease provider to immediately give the lease to somebody else when it expires ($T \langle expiry \rangle$), they need to wait an extra amount of time (until $T \langle expiry \rangle + \Delta_{safety}$) before handing out the lease to somebody else, while the lease holder tries to ensure that they renew comfortably before $T \langle expiry \rangle$.&lt;/p&gt;

&lt;p&gt;Robust lease implementations also need to ensure that lease holders don’t keep assuming they hold the lease beyond $T \langle expiry \rangle$. This sounds trivial, but in a world of pauses from GC and IO and multithreading and whatnot it’s harder than it looks. Being able to reason about the expiry time with absolute time may make this simpler.&lt;/p&gt;

&lt;p&gt;Whatever the implementation, leases fundamentally make assumptions about clock rate. Historically, clock rates have been more reliable than clock absolute values, but still aren’t entirely foolproof. Better clocks make leases more reliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Getting Real about Time&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I am the very model of a modern Major-General,&lt;br /&gt;
I’ve information vegetable, animal, and mineral,&lt;br /&gt;
I know the kings of England, and I quote the fights historical&lt;br /&gt;
From Marathon to Waterloo, in order categorical;&lt;sup&gt;&lt;a href=&quot;#foot9&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When a client asks a database for &lt;em&gt;consistent&lt;/em&gt; data, they’re typically asking something very specific: make sure the answer reflects all the facts that were known &lt;em&gt;before I started this request&lt;/em&gt; (or, even more specifically, at some point between this request was started and when it completed). They might also be asking for an &lt;em&gt;isolated snapshot&lt;/em&gt; of the facts, but they can’t ask for facts that haven’t come along yet. Just the facts so far, please.&lt;/p&gt;

&lt;p&gt;In other words, they’re asking the database to pick a time $T \langle now \rangle$ such that $T \langle request start \rangle \leq T \langle now \rangle \leq T \langle request end \rangle$ and all facts that were committed before $T \langle now \rangle$ are visible. They might also be asking that facts committed after $T \langle now \rangle$ are not visible, but that’s more a matter of isolation than of consistency.&lt;/p&gt;

&lt;p&gt;In a single-system database, this is trivial. In a sharded database, the isolation part is a little tricky but the per-key consistency part is easy. Replication, when we have multiple copies of any individual fact in the database, is when things get tricky. What we want is for a client to be able to go to any replica independently, and not require any coordination between replicas when these reads occur, because this allows us to scale reads horizontally.&lt;/p&gt;

&lt;p&gt;There are many, many, variants on solutions to this problem. High-quality absolute time gives us a rather simple one: the client picks its $T \langle request start \rangle$, then goes to a replica and says “wait until you’re sure you’ve seen all the writes before $T \langle request start \rangle$, then do this read for me”. This complicates writes somewhat (writes need to be totally ordered in an order consistent with physical time), but makes consistent reads easy.&lt;/p&gt;

&lt;p&gt;We’re starting to form a picture of a tradeoff now. Relying on physical time allows distributed systems to avoid coordination in some cases where it would have otherwise been necessary. However, if that time is wrong, the result will also likely be wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 4: Consistent Snapshots&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Life is not about significant details, illuminated in a flash, fixed forever.&lt;sup&gt;&lt;a href=&quot;#foot10&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Just like we can use absolute time to get consistent reads, we can use absolute time to take consistent snapshots. Classic algorithms like &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/distributed-snapshots-determining-global-states-distributed-system/&quot;&gt;Chandy-Lamport&lt;/a&gt; have to deal with the fact that distributed systems can’t easily tell everybody to do something at the same time (e.g. “write down everything you know and send it to me”). With high-quality absolute time we can. “At 12:00:00 exactly, write down everything you know and send it to me”. With a perfect clock, this is trivial.&lt;/p&gt;

&lt;p&gt;Even excellent clocks, however, aren’t perfect. Even with only tens of microseconds of time error, things can change during the uncertainty interval and make the snapshot inconsistent. This is where having a bound on clock error (such as what you can get with &lt;a href=&quot;https://github.com/aws/clock-bound&quot;&gt;clock-bound&lt;/a&gt;) becomes useful: it provides a bounded window of time when a snapshot can be captured along with a window of changes that are relatively easy to fix with a full view of the system. The smaller the window, the less post-repair work is needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 5: Ordering Updates&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Effective leadership is putting first things first.&lt;sup&gt;&lt;a href=&quot;#foot11&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Last Writer Wins (LWW) is a very popular, and effective, way to avoid coordination in a multi-writer distributed database. It provides a simple rule for dealing with conflicts: the one with the higher timestamp overwrites the one with the lower timestamp. LWW has two big advantages. First, it doesn’t require coordination, and therefore allows for low latency, high availability, and high scalability. The second is that it’s really super simple. CRDTs (and other generalizations of monotonicity) have the same first advantage, but not typically the second&lt;sup&gt;&lt;a href=&quot;#foot14&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;. They are seldom &lt;em&gt;super simple&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;LWW also has two disadvantages. First, the semantics of “clobber this write with that one” aren’t great, making it difficult to make internally consistent changes to complex databases (ACID’s &lt;em&gt;C&lt;/em&gt;) or data structures. Second, the definition of &lt;em&gt;last&lt;/em&gt; may not always match what the clients expect. In fact, they may do write &lt;em&gt;A&lt;/em&gt; then write &lt;em&gt;B&lt;/em&gt; and see &lt;em&gt;A&lt;/em&gt; take precedence over &lt;em&gt;B&lt;/em&gt; just because it landed on a server with a slightly faster clock. High quality clocks help us solve this second problem. For example, if the clock error is less than the client round-trip time, then the client can never observe this kind of anomaly. They can still happen, but the client can never prove they happened.&lt;/p&gt;

&lt;p&gt;Using physical clocks to order writes is, for good reasons, controversial. In fact, most experienced distributed system builders would consider it a sin. But high quality clocks allow us to avoid one of the major downsides of LWW, and make its attractive properties even more attractive in the right applications. However, it’s important to note that many of the commonly-cited downsides of using physical clocks to order writes don’t have much to do with clocks at all, and instead have to do with coordination avoidance (especially accepting an unbounded amount of change on both sides of a partition). Great clocks don’t fix those problems, because they aren’t fundamentally caused by time. Kyle Kingsbury’s &lt;a href=&quot;https://aphyr.com/posts/285-call-me-maybe-riak&quot;&gt;work on Riak data loss&lt;/a&gt; from a decade ago is a perfect illustration of the problem (and a problem that dates back to Riak’s roots in &lt;a href=&quot;https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;Dynamo&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;If you’re thinking about ordering writes or doing consistent snapshots using physical time, it’s worth checking out hybrid approaches (like &lt;a href=&quot;http://muratbuffalo.blogspot.com/2014/07/hybrid-logical-clocks.html&quot;&gt;Hybrid Logical Clocks&lt;/a&gt; or &lt;a href=&quot;https://people.csail.mit.edu/devadas/pubs/tardis.pdf&quot;&gt;physiological time order&lt;/a&gt;) that offer properties that degrade more gracefully in the face of time error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When Things Go Wrong&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;They’re funny things, Accidents. You never have them till you’re having them.&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So far, I’ve been talking about time as though programs can know what the current time is. This is obviously impossible.&lt;/p&gt;

&lt;p&gt;First, even assuming access to a perfect clock, they can only know what the current time &lt;em&gt;was&lt;/em&gt;. The moment we execute the next instruction, that time is outdated. Variable CPU clocks, cache misses, OS schedulers, runtime schedulers, GC pauses, bus contention, interrupts, and all sorts of other things conspire against us to make it difficult to know how long ago &lt;em&gt;was&lt;/em&gt; was. The best we can generally do on general-purpose computers is to use any measure of time as a sort of lower bound of the current time&lt;sup&gt;&lt;a href=&quot;#foot12&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;But clocks aren’t perfect. Every oscillator has some amount of jitter and some amount of drift (or, rather, a complex spectrum of error). We can correct much, but not all, of this error. Thus our current time might also be a time from the future, even by the time we get to use it. In EC2 this error is very low, but it still exists.&lt;/p&gt;

&lt;p&gt;To avoid getting too confused, and riffing off Lamport&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, we can establish some notation. Let’s say $T \langle A \rangle$ is the time that event $A$ happens. But $T \langle A \rangle$ is a secret to us: instead we can only know that it lies somewhere between $T \langle A \rangle_{low}$ and $T \langle A \rangle_{high}$ (the open source project &lt;a href=&quot;https://github.com/aws/clock-bound&quot;&gt;clockbound&lt;/a&gt; provides just this API). Alternatively, we can say that we can know $T \langle A \rangle + \epsilon$ where $\epsilon$ is chosen from some asymmetrical error distribution. Improving clock quality is both about driving $E[\epsilon]$ to zero, and about putting tight bounds on the range of $\epsilon$.&lt;/p&gt;

&lt;p&gt;If our bounds are accurate enough we can say that $T \langle A \rangle_{high} &amp;lt; T \langle B \rangle_{low}$ implies that $A$ &lt;em&gt;happens before&lt;/em&gt; $B$. We can write this as $A \rightarrow B$. The full statement is then $T \langle A \rangle_{high} &amp;lt; T \langle B \rangle_{low} \Rightarrow A \rightarrow B$ &lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;, and $A \rightarrow B \Rightarrow T \langle A \rangle_{high} &amp;lt; T \langle B \rangle_{low}$ (here, we’re using $\Rightarrow$ to mean &lt;em&gt;implies&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;There’s something qualitative and important that happens when the error on $T \langle A \rangle$ (aka $\epsilon$) is smaller than the amount of time it would take event $A$ to cause anything to happen (e.g. smaller than one network latency): that means that we can be sure that events that are timestamped before $T \langle A \rangle$ &lt;em&gt;cannot have been caused by A&lt;/em&gt;. This is a rather magical property.&lt;/p&gt;

&lt;p&gt;I’m suspicious of any distributed system design that uses time without talking about the range of errors on the time estimate (i.e. any design that assumes $\epsilon == 0$ or even $\epsilon \geq 0$).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Paradiso&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;But already my desire and my will&lt;br /&gt;
were being turned like a wheel, all at one speed&lt;sup&gt;&lt;a href=&quot;#foot13&quot;&gt;13&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you’re still with me, brave and intrepid to have made it this far, I’d like to offer a tool for thinking about how to use physical time in your distributed systems: start by thinking about what can go wrong.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What if the clock I use for my log timestamps are wrong?&lt;/em&gt; Operators and customers will likely be confused. This is unlikely to have any first-order effects on the operations of your system, but could make it more difficult to operate and increase downtime in that way.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What if the clock I use to do reads is wrong?&lt;/em&gt; Perhaps your design, like &lt;a href=&quot;https://www.usenix.org/system/files/atc23-idziorek.pdf&quot;&gt;DynamoDB’s transaction design&lt;/a&gt; would retain serializability but lose linearizability and see a lower transaction rate. Keeping some properties in the face of clock error is where approaches like &lt;a href=&quot;http://muratbuffalo.blogspot.com/2014/07/hybrid-logical-clocks.html&quot;&gt;Hybrid Logical Clocks&lt;/a&gt; come in super handy.&lt;/p&gt;

&lt;p&gt;And so on. If you can come up with a good explanation for what will happen when time is wrong, and you’re OK with that happening with some probability, then you should feel OK using physical time. If arbitrarily bad things happen when time is wrong, you’re probably going to have a bad time. If you don’t consider it all, then you may consider yourself lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; I’m sure he still does, but likely not as often now he’s retired.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Of the right type, in the right region (for now), with all the configuration set up right (for now).&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; From Tom Stoppard’s &lt;em&gt;Rosencrantz and Guildenstern are Dead&lt;/em&gt;. Endlessly quotable.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; In &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/&quot;&gt;Time, Clocks and the Ordering of Events in a Distributed System&lt;/a&gt;. You should read this paper, today. In fact, stop here and read it now. Yes, I know you read it before and know the key points, but there’s a lot of smart stuff going on here that you may not remember.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Compare this to Lamport’s &lt;em&gt;clock condition&lt;/em&gt; on page 2 of Time, Clocks.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; A. A. Milne, of course.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot7&quot;&gt;&lt;/a&gt; Shakespeare, from Macbeth. This line is followed with the greatest stage direction of all “Enter Macduff, with Macbeth’s head.”&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot8&quot;&gt;&lt;/a&gt; From the delightful Futurama episode “Hell is Other Robots”, credited to Ken Keeler and Eric Kaplan.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot9&quot;&gt;&lt;/a&gt; &lt;em&gt;For my military knowledge, though I’m plucky and adventury, Has only been brought down to the beginning of the century.&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot10&quot;&gt;&lt;/a&gt; From Sontag’s &lt;em&gt;On Photography&lt;/em&gt;. “One can’t possess reality, one can possess images” is nearly as fitting.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot11&quot;&gt;&lt;/a&gt; From Stephen Covey, I think from the book of the same name. You thought you’d make it this far without Self Help, but alas.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot12&quot;&gt;&lt;/a&gt; Dedicated hardware can do much better. Back in graduate school I shared my office with Stephan Sandenbergh, who was building extremely high-quality clocks aimed at building coherent radar systems, many orders of magnitude better than what I’m talking about here. No doubt the state-of-the-art has continued to advance since then.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot13&quot;&gt;&lt;/a&gt; How did I get this far into a level-by-level descent without Dante? But, of course, preferring the spheres of heaven to the circles of hell.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot14&quot;&gt;&lt;/a&gt; Some folks pointed out to me that LWW &lt;em&gt;is&lt;/em&gt; technically a CRDT, which I guess, is fair but not particularly useful.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Optimism vs Pessimism in Distributed Systems</title>
      <link>http://brooker.co.za/blog/2023/10/18/optimism.html</link>
      <pubDate>Wed, 18 Oct 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/10/18/optimism</guid>
      <description>&lt;h1 id=&quot;optimism-vs-pessimism-in-distributed-systems&quot;&gt;Optimism vs Pessimism in Distributed Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;What&amp;mdash;Me Worry?&lt;/p&gt;

&lt;p&gt;Avoiding coordination is the &lt;a href=&quot;https://brooker.co.za/blog/2021/01/22/cloud-scale.html&quot;&gt;one fundamental thing&lt;/a&gt; that allows us to build distributed systems that out-scale the performance of a single machine&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. When we build systems that avoid coordinating, we end up building components that make assumptions about what other components are doing. This, too, is fundamental. If two components can’t check in with each other after every single step, they need to make assumptions about the ongoing behavior of the other component.&lt;/p&gt;

&lt;p&gt;One way to classify these assumptions is into &lt;em&gt;optimistic&lt;/em&gt; and &lt;em&gt;pessimistic&lt;/em&gt; assumptions. I find it very useful, when thinking through the design of a distributed system, to be explicit about each assumption each component is making, whether that assumption is &lt;em&gt;optimistic&lt;/em&gt; or &lt;em&gt;pessimistic&lt;/em&gt;, and what exactly happens if the assumption is wrong. The choice between pessimistic and optimistic assumptions can make a huge difference to the scalability and performance of systems.&lt;/p&gt;

&lt;p&gt;I generally think of optimistic assumptions as ones that avoid or delay coordination, and pessimistic assumptions as ones that require or seek coordination. The optimistic assumption assumes it’ll get away with its plans. The pessimistic assumption takes the bull by the horns and makes sure it will.&lt;/p&gt;

&lt;p&gt;To make this concrete, let’s consider some examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 1: Caches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Distributed caches almost always make assumptions about whether the data they are holding is changed or not. Unlike with CPUs&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, distributed caches typically aren’t &lt;em&gt;coherent&lt;/em&gt;, but we still want them to be &lt;em&gt;eventually consistent&lt;/em&gt;. By &lt;em&gt;eventually consistent&lt;/em&gt; we mean that if the write stream stops, the caches eventually all converge on containing the same data. In other words, inconsistencies are relatively short-lived.&lt;/p&gt;

&lt;p&gt;Possibly the most common way of ensuring this property—that inconsistencies are short-lived—is with a time to live (TTL). This simply means that the cache only keeps items around for a certain fixed period of time. The TTL provides a strong&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; upper bound on how stale an item can be. This is a simple, strong, and highly popular mechanism. It’s also a &lt;em&gt;pessimistic&lt;/em&gt; one: the cache is doing extra work assuming that the item has changed. In systems with a low per-item write rate, that pessimistic assumption can be wrong much more often than it’s right.&lt;/p&gt;

&lt;p&gt;One downside of the pessimistic approach TTL takes is that it means the cache empties when it can’t talk to the authority. This is unavoidable: caches simply can’t provide strongly bounded staleness (or any other strong recency guarantee) if they can’t reach the authority&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Thus the pessimistic TTL approach has a strong availability disadvantage: if a network partition or authority downtime lasts longer than the TTL, the cache hit rate will drop to zero.&lt;/p&gt;

&lt;p&gt;Two more optimistic patterns are quite commonly used to address this situation (especially in DNS and networking systems). One approach is to synchronously try fetch the new item, but then &lt;em&gt;optimistically&lt;/em&gt; continue to use the old one if that’s possible (optimistic because it’s making the optimistic assumption that the item hasn’t change). A subtly different approach is to asynchronously try fetch the new item, and use the old one until that can complete. These protocol seem very similar to TTL, but are deeply fundamentally different. They don’t offer strong recency or staleness guarantees, but can tolerate indefinite network partitions&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2: OCC&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Optimistic concurrency control and its tradeoffs with pessimistic locking-based approaches is a classic topic (maybe the most classic topic) in distributed databases. I won’t try advance that debate here. Instead, to summarize: &lt;em&gt;optimistic concurrency control&lt;/em&gt; is a way of implementing isolated (as in ACID I) transactions that assumes that other concurrent transactions don’t conflict, and detecting at the last moment if that assumption is wrong. &lt;em&gt;Pessimistic&lt;/em&gt; approaches like the classic two-phase locking, on the other hand, do a whole lot of coordination based on the assumption that other transactions do conflict, and it’s worth detecting that early while there’s still time to avoid duplicate work and make smart scheduling decisions.&lt;/p&gt;

&lt;p&gt;OCC systems, in general, coordinate less than pessimistic systems when their optimistic assumption is right, and more than pessimistic systems when the optimistic assumption is wrong.&lt;/p&gt;

&lt;p&gt;Comparing these two is approaches is a hard enough first-order problem, but to complicate things further the choice between optimism and pessimism leads to a number of second-order problems too. For example, the number of contending transactions depends on the number of concurrent transactions, and the number of concurrent transactions depends on lock wait times in pessimistic systems and retry rates in optimistic systems. In both kinds of systems, this leads to a direct feedback loop between past contention and future contention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 3: Leases&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.acm.org/doi/10.1145/74851.74870&quot;&gt;Leases&lt;/a&gt; are a kind of time-based lock widely used in distributed systems. In most systems, a lease is replacing a number of coordination steps. One component takes a lease, and then uses that lease as a license to multiple things without worrying that other components are doing conflicting things, or may disagree, or whatever. Freed from the worry about conflicts, the lease-holding component can avoid coordinating and go ahead at full speed.&lt;/p&gt;

&lt;p&gt;Leases are an interesting blend of pessimism (&lt;em&gt;I’m assuming other things are going to conflict with my work, so I’m going to stop them in their tracks&lt;/em&gt;) and optimism (&lt;em&gt;I’m assuming I can go ahead without coordination for the next bounded period of time&lt;/em&gt;). If the pessimism is wrong, all the heartbeating and updating and storing of leases is wasted work. As is the time other components could have spent doing work which they wasted while waiting for the lease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One way I like to reason about the behavior of systems is by writing sentences of the form “this component is assuming that…”&lt;/p&gt;

&lt;p&gt;For our TTL example, we could write statements like:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;This component is assuming that clients are OK with seeing stale data as long as the staleness is bounded&lt;/em&gt;, and&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;This cache is assuming that the items it holds have changed, and should be checked after every TTL expiry&lt;/em&gt;, and&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;This cache is assuming that clients would rather experience unavailability or higher latency than see items that are more stale than the TTL bound&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These statements are a tool to help structure our thinking about the behavior of the system. The third one—the availability-staleness tradeoff—is especially powerful because its often a hidden assumption people make when choosing a strict TTL.&lt;/p&gt;

&lt;p&gt;By coloring each assumption as &lt;em&gt;pessimistic&lt;/em&gt; (coordination-requiring) or &lt;em&gt;optimistic&lt;/em&gt; (coordination-avoiding), we can also structure our thinking about the best time to coordinate, and make sure we’re being consistent in our choices about when and why coordination is needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; And, in a lot of ways, the fundamental thing that allows us to build machines that out-scale the performance of a single in-order core.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Or some CPUs, at least. Most of the CPUs we’re familiar with keep their caches &lt;em&gt;coherent&lt;/em&gt; using protocols like &lt;a href=&quot;https://en.wikipedia.org/wiki/MESI_protocol&quot;&gt;MESI&lt;/a&gt;. These protocols are interesting, because they allow coordination avoidance for unmodified items, at the cost of tracking state and ownership and assuming that the coherency protocol is correctly executed by all participants.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Only strong if the TTL clock starts ticking at the time the item fetch started. Most implementations don’t do this, and instead start the clock at the time the item fetch ended, leading to potentially unbounded staleness.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Following a similar argument to the one Bailis et al make in &lt;a href=&quot;https://arxiv.org/pdf/1302.0309.pdf&quot;&gt;Section 5.2 of Highly Available Transactions&lt;/a&gt;, for which they cite &lt;a href=&quot;https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf&quot;&gt;Gilbert and Lynch&lt;/a&gt; somewhat hand-wavingly. I will continue the hand-waving here.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; If you’re a CAP theorem kinda person, you might call TTL a CP system and these variant AP systems. But that would mostly serve to highlight the limitations of CAP thinking, because none of these variants are &lt;em&gt;C&lt;/em&gt;. If you’re a &lt;a href=&quot;https://brooker.co.za/blog/2014/07/16/pacelc.html&quot;&gt;PACELC&lt;/a&gt; kinda person, you might call the strict TTL variant PCEL, and the less-strict variants PAEL.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; But if you’re interested in learning more about it, check out this &lt;a href=&quot;https://www.youtube.com/watch?v=MM0J0_LX8cg&quot;&gt;Andy Pavlo lecture&lt;/a&gt;, or Harding et al’s excellent 2017 paper &lt;a href=&quot;https://www.cs.cmu.edu/~pavlo/papers/p553-harding.pdf&quot;&gt;An Evaluation of Distributed Concurrency Control&lt;/a&gt;, or Kung and Papadimitriou’s classic 1979 paper &lt;a href=&quot;http://www.eecs.harvard.edu/~htk/publication/1979-sigmod-kung-papadimitriou.pdf&quot;&gt;An Optimality Theory of Concurrency Control for Databases&lt;/a&gt;, or Agrawal et al’s 1987 classic &lt;a href=&quot;https://web.eecs.umich.edu/~jag/eecs584/papers/acl.pdf&quot;&gt;Concurrency Control Performance Modeling: Alternatives and Implication&lt;/a&gt; (thanks Peter Alvaro for reminding me about this one), or the OCC OG &lt;a href=&quot;https://www.eecs.harvard.edu/~htk/publication/1981-tods-kung-robinson.pdf&quot;&gt;On Optimistic Methods for Concurrency Control&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Writing For Somebody</title>
      <link>http://brooker.co.za/blog/2023/09/21/audience.html</link>
      <pubDate>Thu, 21 Sep 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/09/21/audience</guid>
      <description>&lt;h1 id=&quot;writing-for-somebody&quot;&gt;Writing For Somebody&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Who&apos;s there?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sometimes I write long emails to people at work. Sometimes those emails are generally interesting, and not work-specific at all. Sometimes I share those emails here on my blog. This may be one of those times.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Always write for somebody.&lt;/p&gt;

&lt;p&gt;Always have an idea in your head, as you’re writing, who your writing is intended to communicate with. Sometimes, that’s a particular person. Your boss. A mentee, or mentor. Bob from legal. Sometimes it’s a group of people, or a kind of person. Sometimes its your future self, or past self&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;I find that having a particular audience in mind allows me to focus my writing better, to communicate better, and make myself more likely to achieve my writing goals. To have empathy for them as a reader, as I would have empathy for somebody I was trying to communicate with face-to-face.&lt;/p&gt;

&lt;p&gt;It’s hard, and perhaps impossible, to distill empathy into a structured approach. But thinking about a particular audience does make for a somewhat useful, if incomplete, checklist:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Who am I writing for?&lt;/li&gt;
  &lt;li&gt;What are they afraid of?&lt;/li&gt;
  &lt;li&gt;What do they already know?&lt;/li&gt;
  &lt;li&gt;What misconceptions do they have?&lt;/li&gt;
  &lt;li&gt;What do I want them to know or understand?&lt;/li&gt;
  &lt;li&gt;What do I want them to do with this knowledge or understanding?&lt;/li&gt;
  &lt;li&gt;What do they want to get out of the time they’re spending reading my writing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Answering each of these questions provides me with a clear lens through which to evaluate whether my writing is going to be successful.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What do they already know?&lt;/em&gt; and &lt;em&gt;What misconceptions do they have?&lt;/em&gt; help focus the use of space and time in a document. Where to spend detail and explanation to ensure clarity, and where information can be elided for brevity. They also help ensure information is presented in the right order, and repeated the right number of times.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What do I want them to know or understand?&lt;/em&gt; helps make sure the right information is present in the document. If there is something that’s not covered by &lt;em&gt;What do they already know?&lt;/em&gt; but I do want them to know at the end, then I need to ensure that information is in the document.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What do I want them to do with this knowledge or understanding?&lt;/em&gt; is the call to action. Sometimes, like with a lot of the writing I do at work, that call to action is explicit. I want somebody to make this decision. Or move in this direction. Or make this investment. Sometimes, like with this blog and academic papers&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; the call to action is more subtle. I want people to understand things, and perhaps to use them in their future work.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What do they want to get out of the time they’re spending reading my writing?&lt;/em&gt; is the quid pro quo. The value for the reader. In some cases, the value is explicit. In others, you’re asking for somebody’s time and attention (deep reading takes significant time and attention) in order to ask them a favor. Is my writing respectful of their time? Or am I wasting their time while asking for more of their time? That doesn’t seem like an experience people would be keen to repeat.&lt;/p&gt;

&lt;p&gt;The odd one out is &lt;em&gt;What are they afraid of?&lt;/em&gt; It’s emotional, and personal, and not about rational decision making at all. That’s the point. Humans aren’t purely rational decision makers. We approach each decision with a lot of context, a lot of prior experience, and sometimes with our own scars and blind spots. Frequently, I see folks who are unable to drive a particular decision because they haven’t been explicit enough about the readers’ prior experiences. Last time we tried this, the bread got stale before we could buy the cheese. Are you thinking about the cheese far enough ahead this time?&lt;/p&gt;

&lt;p&gt;This doesn’t capture everything important about writing, of course. But nearly every piece of ineffective writing I see doesn’t have a clear answer to one or more of these questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Tonally, I think of this blog as being addressed to my past self as an audience. He’s somebody I know rather well.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; I read a lot of research papers. Reliably, the ones I enjoy the least are the ones that were written &lt;em&gt;for&lt;/em&gt; the peer reviewers, with the primary call to action being &lt;em&gt;accept this paper into your conference.&lt;/em&gt; This misalignment of incentives between writers and reviewers and consumers of research, a kind of &lt;a href=&quot;https://en.wikipedia.org/wiki/Principal%E2%80%93agent_problem&quot;&gt;principal-agent problem&lt;/a&gt;, is a serious downside to research peer review.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Exponential Value at Linear Cost</title>
      <link>http://brooker.co.za/blog/2023/09/08/exponential.html</link>
      <pubDate>Fri, 08 Sep 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/09/08/exponential</guid>
      <description>&lt;h1 id=&quot;exponential-value-at-linear-cost&quot;&gt;Exponential Value at Linear Cost&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;What a deal!&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;https://cdn.jsdelivr.net/npm/vega@5&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;https://cdn.jsdelivr.net/npm/vega-lite@4&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;https://cdn.jsdelivr.net/npm/vega-embed@6&quot;&gt;&lt;/script&gt;

&lt;p&gt;Binary search is kind a of a magical thing. With each additional search step, the size of the haystack we can search doubles. In other words, the value of a search is &lt;em&gt;exponential&lt;/em&gt; in the amount of effort. That’s a great deal. There are a few similar deals like that in computing, but not many. How often, in life, do you get exponential value at linear cost?&lt;/p&gt;

&lt;p&gt;Here’s another important one: redundancy.&lt;/p&gt;

&lt;p&gt;If we have $N$ hosts, each with availability $A$, any one of which can handle the full load of the system, the availability of the total system is:&lt;/p&gt;

&lt;p&gt;$A_{system} = 1 - (1 - A)\^N$&lt;/p&gt;

&lt;p&gt;It’s hard to overstate how powerful this mechanism is, and how important it has been to the last couple decades of computer systems design. From RAID to cloud services, this is the core idea that makes them work. It’s also a little hard to think about, because our puny human brains just can’t comprehend the awesome power of exponents (mine can’t at least).&lt;/p&gt;

&lt;p&gt;If you want to try some numbers, give this a whirl:&lt;/p&gt;

&lt;div id=&quot;vis&quot;&gt;&lt;/div&gt;

&lt;script type=&quot;text/javascript&quot;&gt;
  function make_data(n, host_avail, dc_avail) {
      let data = [];
      for (let i = 0; i &lt; n; i++) {
        data.push({
          &quot;x&quot;: i,
          &quot;y&quot;: dc_avail * (1 - (1 - host_avail)**i),
        });
      }
      return data;
  }

  function updateView(view) {
    let new_data = make_data(view.signal(&apos;Hosts&apos;), view.signal(&apos;HostAvail&apos;), 1.0);
    view.change(&apos;points&apos;, vega.changeset().remove(vega.truthy).insert(new_data)).runAsync();
  }

  var spec = &quot;https://brooker.co.za/blog/resources/redundancy_vega_lite_spec.json&quot;;
  vegaEmbed(&apos;#vis&apos;, spec).then(function(result) {
    updateView(result.view);
    result.view.addSignalListener(&apos;HostAvail&apos;, function(name, value) {
      updateView(result.view);
    });
    result.view.addSignalListener(&apos;Hosts&apos;, function(name, value) {
      updateView(result.view);
    });
  }).catch(console.error);
&lt;/script&gt;

&lt;p&gt;What you’ll realize pretty quickly is that this effect is very hard to compete with. No matter how high you make the availability for a single host, even a very poor cluster quickly outperforms it in this simple model. Exponentiation is extremely powerful.&lt;/p&gt;

&lt;p&gt;Unfortunately it’s not all good news. This exponentially powerful effect only works when all these hosts fail independently. Let’s extend the model just a little bit to include the effect of them being in the same datacenter, and that datacenter having availability $D$. We can easily show that the availability of the total system then becomes:&lt;/p&gt;

&lt;p&gt;$A_{system} = D * (1 - (1 - A)\^N)$&lt;/p&gt;

&lt;p&gt;Which doesn’t look nearly as good.&lt;/p&gt;

&lt;div id=&quot;vis2&quot;&gt;&lt;/div&gt;

&lt;script type=&quot;text/javascript&quot;&gt;
  function updateView2(view) {
    let new_data = make_data(view.signal(&apos;Hosts&apos;), view.signal(&apos;HostAvail&apos;), view.signal(&apos;DCAvail&apos;));
    view.change(&apos;points&apos;, vega.changeset().remove(vega.truthy).insert(new_data)).runAsync();
  }

  var spec = &quot;https://brooker.co.za/blog/resources/redundancy_2_vega_lite_spec.json&quot;;
  vegaEmbed(&apos;#vis2&apos;, spec).then(function(result) {
    updateView(result.view);
    result.view.addSignalListener(&apos;HostAvail&apos;, function(name, value) {
      updateView2(result.view);
    });
    result.view.addSignalListener(&apos;Hosts&apos;, function(name, value) {
      updateView2(result.view);
    });
    result.view.addSignalListener(&apos;DCAvail&apos;, function(name, value) {
      updateView2(result.view);
    });
  }).catch(console.error);
&lt;/script&gt;

&lt;p&gt;Which goes to show how quickly things go wrong when there’s some correlation between the failures of redundant components. System designers must pay careful attention to ensuring that designs consider this effect, almost beyond all others, when designing distributed systems. Exponential goodness is our most powerful ally. Correlated failures are its kryptonite.&lt;/p&gt;

&lt;p&gt;This observation is obviously fairly basic. It’s also critically important.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>On The Acoustics of Cocktail Parties</title>
      <link>http://brooker.co.za/blog/2023/08/25/party-time.html</link>
      <pubDate>Fri, 25 Aug 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/08/25/party-time</guid>
      <description>&lt;h1 id=&quot;on-the-acoustics-of-cocktail-parties&quot;&gt;On The Acoustics of Cocktail Parties&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Only parties of well-mannered guests will be considered.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;If you, like me, tend to practice punctual arrival at parties, you’ve likely noticed that most parties start out quiet. Folks are talking in small groups, using their normal voices, and having productive conversations. As more people arrive, the background noise increase. First a little, allowing guests to continue to use a conventional volume. Then, at some point, the background noise will exceed a normal speaking voice, and speakers will increase their volume. This doesn’t solve the problem: instead leading to a further increase in background noise and further volume increases.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/asa_cover.jpg&quot; alt=&quot;Cover of the January 1959 issue of The Journal of the Acoustical Society of America&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In 1959’s issue of the Journal of the Acoustical Society of America&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, William R. MacLean&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt; modeled the root cause of this problem in a fun (and rather tongue-in-cheek) paper called &lt;em&gt;On the Acoustics of Cocktail Parties&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;, &lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;In the article, MacLean models a party consisting of $N$ guests, clustered in groups of $K$. These guests, being well-mannered, only allow one speaker per group, for $\frac{N}{K}$ speakers.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In presence of a sufficiently weak background of noise, including other conversations, a well-mannered guest will talk with a small acoustic output $P_m$ … and, if necessary, will adjust his talking distance to a minimum conventional distance $d_0$.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This goes on until the background noise gets too loud:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In the presence of a gradually increasing background of noise however, the average guest $A$ will increase this talking power to a much larger value without being consciously aware of any strain or even the existence of the background, but at a certain maximum acoustic output $P_m$ the strain will become apparent to $A$ who, rather than overtax himself, will reduce his talking distance $d$ to to a distance less than the conventional minimum $d_0$ until conversation again becomes possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The mathematical model that MacLean builds is where this becomes interesting (and, I dare say, topical for this blog). First, he defines the critical distance $D$ at which the sound energy from each group’s speaker is equal to the background noise.&lt;/p&gt;

&lt;p&gt;$D = \sqrt{\frac{\alpha V}{4 \pi h}}$&lt;/p&gt;

&lt;p&gt;where $V$ is the volume of the room, $\alpha$ is the average sound absorption coefficient ($a &amp;lt; 1$), and $h$ mean free path of a &lt;em&gt;ray of sound&lt;/em&gt; through the room&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Then he works out the signal-to-noise ratio (SNR) that each listener observes:&lt;/p&gt;

&lt;p&gt;$S^2 = \frac{ ( \frac{D}{d_0} )^2 + 1}{\frac{N}{K} - 1}$&lt;/p&gt;

&lt;p&gt;Finally, introducing the minimum comfortable listener SNR $S_m$, we can calculate the critical number of guests $N_0$ where the party transitions from a quiet one (comfortable speaking in loose groups) to a loud one (shouting in uncomfortably tight groups).&lt;/p&gt;

&lt;p&gt;$N &amp;lt; N_0 = K ( 1 + \frac{D^2 + d_0^2}{d_0^2 S_m^2} )$&lt;/p&gt;

&lt;p&gt;MacLean goes on to show&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt; that even if the speakers are interrupted by silence (a speech from the host, perhaps), the party will become loud again in a finite time so long as $N \geq N_0$.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this interesting?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These kinds of threshold effects are well known in all sorts of systems. In &lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc896&quot;&gt;RFC 896&lt;/a&gt; from 1984 John Nagle observes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In heavily  loaded  pure datagram  networks  with  end to end retransmission, as switching nodes become congested, the  round  trip  time  through  the  net increases  and  the  count of datagrams in transit within the net also increases.  This is normal behavior under load.  As long  as there is only one copy of each datagram in transit, congestion is under  control.   Once  retransmission  of  datagrams   not   yet delivered begins, there is potential for serious trouble.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This condition is stable.  Once the  saturation  point  has  been reached,  if the algorithm for selecting packets to be dropped is fair, the network will continue to operate in a  degraded  condition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unlike MacLean’s cocktail party guests, RFC896’s TCP/IP endpoints can’t stand closer together (in the short term - they can in the longer-term by reconfiguring network topology), and instead need to be asked by the network itself to reduce the volume they are speaking at.&lt;/p&gt;

&lt;p&gt;And, in &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/3458336.3465286&quot;&gt;Metastable Failures in Distributed Systems&lt;/a&gt;, Bronson et al say:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;MacLean’s cocktail parties exhibit this same phenomenon. When the number of guests exceeds $N_0$ and the party becomes loud, it is not sufficient for the number of guests to merely drop below $N_0$ for it to become quiet again. The background noise has become &lt;em&gt;stuck&lt;/em&gt; in a high state, and only tapping a glass or a significant reduction in party goers is sufficient for the party to become quiet again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are Cocktail Parties Metastable?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It seems so, at least if we expand MacLean’s model slightly. First, consider that each group can improve their SNR $S^2$ (to keep $S^2 &amp;gt; S_m^2$) in two ways: increasing their speaking power $P$ or reducing their group diameter $d$. Our guests, being aware of each other’s personal space, will first respond to reduced noise (caused by lower $N$) by increasing $d$ (towards the minimal comfortable distance $d_0$) even if it means increasing their power $P$ (up to their maximum $P_m$). In this case, reducing $N$ below $N_0$ is not sufficient for the party to become quiet again once it has become loud (at least until $N$ is reduced far enough for $d_0$ to be reached and the social awkwardness of tight quarters to pass).&lt;/p&gt;

&lt;p&gt;This is a &lt;em&gt;bit&lt;/em&gt; of a stretch, but shows how relatively small details of these &lt;em&gt;tipping point&lt;/em&gt; models can lead to behaviors that &lt;em&gt;stick&lt;/em&gt; beyond the tipping point even when the stimulus is removed. I may take another look at this model using simulation in a later post.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; You can always rely on this blog to bring you the latest, most topical, research.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; I think I first learned about this paper on Dave Arnold’s &lt;a href=&quot;https://www.patreon.com/cookingissues&quot;&gt;Cooking Issues&lt;/a&gt; podcast, probably sometime around 2015.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; The internet would have us believe that cocktail parties were invented in 1917 (&lt;a href=&quot;https://vinepair.com/articles/the-history-of-the-cocktail-party/&quot;&gt;for&lt;/a&gt;, &lt;a href=&quot;https://www.tastingtable.com/1215416/the-feminist-history-of-cocktail-parties/&quot;&gt;some&lt;/a&gt;, &lt;a href=&quot;https://en.wikipedia.org/wiki/Cocktail_party&quot;&gt;examples&lt;/a&gt;), but it seems likely that humans have enjoyed standing around in groups talking and drinking intoxicating fruity drinks for millennia. After all, wine dates back around 8000 years, and its hard to believe that everybody had the patience to wait for complete fermentation until the 20th century.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Not being an acoustics expert, I’ll have to take MacLean’s word for this, but the calculation is very similar to how it would work out for radar, so I have no need for suspicion.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; MacLean’s method includes “Making the approximation of using differentials for finite differences”, the precise opposite of my usual approach.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; MacLean sounds like a fun person to know. According to &lt;a href=&quot;https://www.nytimes.com/1964/12/22/archives/william-mlean-educator-dead-professor-of-engineering-at-brooklyn.html&quot;&gt;his New York Times obituary&lt;/a&gt;, he was an &lt;a href=&quot;https://pubs.aip.org/asa/jasa/article-abstract/27/2/297/746013/A-Method-of-Transducing-an-Ultrasonic-Shadowgraph?redirectedFrom=fulltext&quot;&gt;early&lt;/a&gt; &lt;a href=&quot;https://pubs.aip.org/asa/jasa/article-abstract/28/3/502/617111/Outlining-Effect-in-Ultrasonic-Images?redirectedFrom=PDF&quot;&gt;pioneer&lt;/a&gt; of ultrasound imaging, and &lt;em&gt;did research in electronics, acoustics, electromagnetism, microwaves, satellite solar cells, the hazards of swimming‐pool lighting, radar, the defects of electric heating pads, electrical capacitators, optics, sound waves and the magnetic inspection of inaccessible pipes.&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Invariants: A Better Debugger?</title>
      <link>http://brooker.co.za/blog/2023/07/28/ds-testing.html</link>
      <pubDate>Fri, 28 Jul 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/07/28/ds-testing</guid>
      <description>&lt;h1 id=&quot;invariants-a-better-debugger&quot;&gt;Invariants: A Better Debugger?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;🎵Some things never change🎵&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Like many of my blog posts, this started out as a long email to a colleague. I expanded it here because I thought folks might find it interesting.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I don’t tend to use debuggers. I’m not against them. I’ve seen folks do amazing things with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gdb&lt;/code&gt;, and envy their skills. I just don’t tend to reach for a debugger very often.&lt;/p&gt;

&lt;p&gt;I’m also not a huge fan of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;printf&lt;/code&gt; debugging. It can be useful, it’s easy to implement, and works well in both one-box and distributed systems. But very quickly the reams of output become overwhelming, and slow me down rather than helping me reason about things.&lt;/p&gt;

&lt;p&gt;My go-to approach when faced with bugs is testing. More specifically, testing &lt;em&gt;invariants&lt;/em&gt;. Most specifically, writing unit tests that assert those invariants after a system or algorithm takes each step.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Invariants&lt;/em&gt;, like &lt;em&gt;assertions&lt;/em&gt;, are things that must be true during or after the execution of a piece of code. “This array is sorted”, “the first item is the smallest”, “items in the &lt;em&gt;deleted&lt;/em&gt; state must have a &lt;em&gt;deleted time&lt;/em&gt;”, that kind of thing. Invariants are broader than assertions: they can assert properties of a piece of data, properties of an entire data structure, or even properties of collections of data structures spread across multiple machines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some History&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Way back in undergrad at UCT, I was trying to implement &lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/282918.282923&quot;&gt;Guibas and Stolfi’s algorithm for Delaunay triangulation&lt;/a&gt; for a class project&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. My implementation very nearly worked, but there was one example where it just gave the wrong answer. I spent days banging my head on the problem with printfs and debuggers and just wasn’t making progress. The TA was no help. I asked a CS graduate student I knew that lived nearby, and his approach just blew my mind.&lt;/p&gt;

&lt;p&gt;He sat me down with a piece of paper, and went through the algorithm step-by-step, asking &lt;em&gt;what is true about the data structure at this step?&lt;/em&gt; after each one. Together, we came up with a set of step-by-step invariants, and some global invariants that must hold true after every run of the algorithm. Within minutes of getting back to my desk and writing the tests for the invariants, I had found my bug: a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;gt;&lt;/code&gt; which should have been &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;gt;=&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Over the years, I’ve kept coming back to this approach. It’s turned out to be useful when writing dense algorithmic code, when capturing business logic, and when implementing distributed systems. It’s also one of of the things that &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext&quot;&gt;attracted me strongly to TLA+ a decade ago&lt;/a&gt;. The way TLA+ thinks about correctness is all based on global invariants.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Paxos&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Paxos is famously difficult to reason about. I’m not going to pretend that I’ve ever found it easy, but I believe people struggle more than needed because they don’t pay enough attention to Section 2.2 of &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt;. In it, Lamport goes step-by-step through the development of a set of invariants for implementing consensus. Starting with some incorrect ones:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;P1. An acceptor must accept the first proposal that it receives.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and layering on requirements before settling on the right invariant. Notably, during this development, the invariants switch from something that can be easily asserted on a single node, to larger properties of the whole system that could only really be asserted by some omniscient higher power.&lt;/p&gt;

&lt;p&gt;In a real system, or even in an integration test, its very hard to implement this kind of omniscience. In a model checker, like TLA+’s TLC, it’s trivial, but that doesn’t help all that much for the real implementations. Omnisciently asserting global invariants is one of the most powerful abilities granted by deterministic simulation testing (such as with &lt;a href=&quot;https://github.com/tokio-rs/turmoil&quot;&gt;turmoil&lt;/a&gt;). In the simulator, you can stop time, check invariants, then tick forward deterministically. And by &lt;em&gt;you&lt;/em&gt;, I mean &lt;em&gt;a test&lt;/em&gt;. Tests, unlike debuggers, are easily automatable and repeatable. They’re also able to check things that humans could never keep straight in their heads.&lt;/p&gt;

&lt;p&gt;Like Paxos’s key invariant:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;P2c . For any v and n, if a proposal with value v and number n is issued,
then there is a set S consisting of a majority of acceptors such that
either (a) no acceptor in S has accepted any proposal numbered less
than n, or (b) v is the value of the highest-numbered proposal among
all proposals numbered less than n accepted by the acceptors in S.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Multipaxos, Raft, and pretty much every other distributed protocol have invariants like these. Reasoning about them and testing them automatically is, in my mind, an under-appreciated superpower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: HNSW&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1603.09320&quot;&gt;Hierarchical Navigable Small World Graphs&lt;/a&gt; are a popular data structure for performing approximate K Nearest Neighbor searches on large sets of high dimensionality vectors. HNSW isn’t at it’s core, too conceptually difficult (&lt;a href=&quot;https://www.pinecone.io/learn/series/faiss/hnsw/&quot;&gt;this is a good introduction&lt;/a&gt;), but is also way harder to reason about than most of the algorithms we come across day to day. The large size of the vectors, big data, and complexity of graph connectivity makes it difficult to reason about HNSW in a debugger.&lt;/p&gt;

&lt;p&gt;But many implementation bugs can be shaken out by thinking about the invariants. What are the things that must be true about the data structure after inserting a new element? What are the things that must be true of the set of entry points passed down into each layer of the search?&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;em&gt;entry point&lt;/em&gt; must be present in the highest populated layer.&lt;/li&gt;
  &lt;li&gt;Each layer is a subset of the previous layer&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Nodes don’t just disappear as you go down the layers.&lt;/li&gt;
  &lt;li&gt;Every node is accessible from the highest appearance of the entry point.&lt;/li&gt;
  &lt;li&gt;All nodes appear in layer 0.&lt;/li&gt;
  &lt;li&gt;Each layer is approximately &lt;em&gt;e&lt;/em&gt; times larger than the one above it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of these invariants are rather expensive to test for, such as the last one which requires &lt;em&gt;O(N log N)&lt;/em&gt; work. They aren’t always practical to assert on each step in a production implementation, but are very practical to test for in a set of unit tests. My experience has been that reasoning about, listing, and then testing for invariants like this is a much better way to test data structures like this than testing through the interfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Systems Design&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When working on AWS Lambda’s container loading system (which I’ve &lt;a href=&quot;https://brooker.co.za/blog/2023/05/23/snapshot-loading.html&quot;&gt;blogged about before&lt;/a&gt;, and we describe in our paper &lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/brooker&quot;&gt;On-demand Container Loading in AWS Lambda&lt;/a&gt;), we needed to make sure that chunks weren’t lost during garbage collection. Highly-concurrent large-scale distributed systems of this kind can be extremely difficult to reason about, and so we needed to start our thinking with a set of invariants. As we say in the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Past experience with distributed garbage collection has taught us that the problem is both complex (because the tree of chunk references is changing dynamically) and uniquely risky (because it is the one place in our system where we delete customer data).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Despite this complexity, the system invariants turn out to be relatively simple:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;All new chunks are written into a root in the &lt;em&gt;active&lt;/em&gt; state.&lt;/li&gt;
  &lt;li&gt;All read chunks are under roots in either the &lt;em&gt;active&lt;/em&gt; or &lt;em&gt;retired&lt;/em&gt; state.&lt;/li&gt;
  &lt;li&gt;Roots move monotonically through the &lt;em&gt;active&lt;/em&gt;, &lt;em&gt;retired&lt;/em&gt;, &lt;em&gt;expired&lt;/em&gt;, and &lt;em&gt;deleted&lt;/em&gt; states. They never move backwards through this state chain.&lt;/li&gt;
  &lt;li&gt;Chunks can only be deleted if they are referenced only by &lt;em&gt;expired&lt;/em&gt; roots.&lt;/li&gt;
  &lt;li&gt;A root can only move to &lt;em&gt;deleted&lt;/em&gt; once all its chunks have been deleted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Provided the system preserves these invariants, no data can be lost, even in the face of arbitrary concurrency and scale. This handful of invariants is much easier to reason about than the full system implementation, and writing it down allowed us to come up with a clear formal argument for why these invariants are sufficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Lesson&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Invariants are a powerful tool for reasoning about algorithms, data structures, and distributed systems. It’s worth thinking through a set of invariants for any complex system or algorithm you design or implement. It’s also worth building your implementation in such a way that even global invariants can be easily tested in a deterministic and repeatable way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; This is a very cool algorithm, and actually not too complicated. But you can tell by the fact that the abstract contains the phrase &lt;em&gt;separation of the geometrical and topological aspects of the problem&lt;/em&gt; that it’s also not the most straightforward thing to reason about.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Advanced implementations may choose to reduce their memory or storage footprint by relaxing these invariants.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>My Favorite Bits of OSDI/ATC'23</title>
      <link>http://brooker.co.za/blog/2023/07/13/osdi.html</link>
      <pubDate>Thu, 13 Jul 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/07/13/osdi</guid>
      <description>&lt;h1 id=&quot;my-favorite-bits-of-osdiatc23&quot;&gt;My Favorite Bits of OSDI/ATC’23&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Talking to 3D people is cool again.&lt;/p&gt;

&lt;p&gt;This week brought &lt;a href=&quot;https://www.usenix.org/conference/atc23/technical-sessions&quot;&gt;USENIX ATC’23&lt;/a&gt; and &lt;a href=&quot;https://www.usenix.org/conference/osdi23/technical-sessions&quot;&gt;OSDI’23&lt;/a&gt; together in Boston. While I’ve followed OSDI and ATC papers for years, it’s the first time I’ve been to either of them (I’ve have been to NSDI a couple times). It was a really good time. In this post I’ll cover a couple of my favorite papers&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, and trends I noticed.&lt;/p&gt;

&lt;p&gt;Overall, it was great to meet a bunch of folks in person who I’ve only interacted with online, and nice to be back to in-person conferences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thoughts and Trends&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;When we presented the &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/agache&quot;&gt;Firecracker paper&lt;/a&gt; at NSDI’20, several people said to me that they were worried about the fact we had chosen Rust, because it raised the risk that Firecracker wouldn’t be useful once Rust was no longer in vogue. This year at OSDI, pretty much everybody I talked to was building in Rust. Obvious exceptions are folks doing AI/ML work (Python still seems big there), and folks looking to get into the mainline Linux kernel. I couldn’t be more happy to see memory safety start to become the default practice in systems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Loads of folks were talking about emergent system properties like &lt;a href=&quot;https://brooker.co.za/blog/2021/05/24/metastable.html&quot;&gt;metastability&lt;/a&gt;. Unfortunately, not a lot of folks seem to be writing papers about it, or getting grants to work on it. I did talk to a couple folks with upcoming papers, and I really hope the hallway interest turns into more publications. &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3458336.3465286&quot;&gt;Metastable failures in distributed systems&lt;/a&gt; and &lt;a href=&quot;https://www.usenix.org/conference/osdi22/presentation/huang-lexiang&quot;&gt;Metastable Failures in the Wild&lt;/a&gt; are some of the most important systems work of the last few years, in my opinion. There’s a lot more to do here.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I got a rough feeling that more papers were paying more attention to security issues than in years past. Subtle issues like timing side-channels especially. Another trend I like to see. Security and systems have always been linked, so this isn’t new, but there does seem to be a reduction in completely security-naive work.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Some of the Papers I Enjoyed The Most&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.usenix.org/conference/osdi23/presentation/cheng&quot;&gt;Take Out the Trache&lt;/a&gt; by Audrey Cheng et al&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. This paper makes an astute observation about how caches help with latency the most when everything a transaction needs is cached, and so traditional cache eviction strategies don’t make the right decisions. They then present new metrics, and a nice design for improving things. Worth reading if you’re building any kind of database or distributed cache.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/ginzburg&quot;&gt;VectorVisor&lt;/a&gt; by Samuel Ginzburg et al. What if we compiled normal applications to WASM, then ran them on GPUs? And it actually worked? This is the kind of academic systems work I love the most: bold, innovative, and solving a problem that doesn’t really exist yet but definitely could in the future.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/jin&quot;&gt;EPF: Evil Packet Filter&lt;/a&gt; by Di Jin et al. Operating system kernels like Linux use various internal mechanisms that make it harder to go from kernel bug to working exploit. This paper looks at how useful the current BPF implementation can be for thwarting these mechanisms.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.usenix.org/conference/osdi23/presentation/berger&quot;&gt;Triangulating Python Performance Issues with SCALENE&lt;/a&gt; by Emery Berger et al. A selection of cool approaches for profiling CPU, GPU, and memory in Python programs. Emery finished his talk with a tantalizing demo: generating performance patches automatically by combining LLMs with profiler results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many papers I haven’t read yet, but have heard good things about. I want to look at &lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/tollner&quot;&gt;MELF&lt;/a&gt;, &lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/yasukata&quot;&gt;zpoline&lt;/a&gt;, &lt;a href=&quot;https://www.usenix.org/conference/osdi23/presentation/sadok&quot;&gt;Ensō&lt;/a&gt;, and &lt;a href=&quot;https://www.usenix.org/conference/osdi23/presentation/chang&quot;&gt;vMVCC&lt;/a&gt; in more detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon’s Papers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We presented two papers at ATC this year:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/brooker&quot;&gt;On-demand container loading in AWS Lambda&lt;/a&gt; by me, Mike Danilov, Chris Greenwood, and Phil Piwonka. I wrote a post about this paper &lt;a href=&quot;https://brooker.co.za/blog/2023/05/23/snapshot-loading.html&quot;&gt;back in May&lt;/a&gt;. We won a best paper award for this work!&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.usenix.org/conference/atc23/presentation/idziorek&quot;&gt;Distributed Transactions at Scale in Amazon DynamoDB&lt;/a&gt; by a great group of folks from the DynamoDB team, looks at DynamoDB’s serializable atomic transaction scheme based on Timestamp Ordering (TO) and 2PC. This paper is a perfect antidote to the widespread idea that transactions can’t or don’t scale. Combined with the team’s &lt;a href=&quot;https://www.usenix.org/conference/atc22/presentation/elhemali&quot;&gt;ATC’22 paper&lt;/a&gt;, this is an excellent deep dive into how a massive scale (&lt;a href=&quot;https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-for-the-win/&quot;&gt;105.2 million TPS for one workload&lt;/a&gt;) database works under the covers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cloning and Snapshot Safety&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A number of papers in the program implemented VM or process cloning, typically for accelerating serverless workloads. This thread of work, related to our &lt;a href=&quot;https://brooker.co.za/blog/2022/11/29/snapstart.html&quot;&gt;own work on Lambda Snapstart&lt;/a&gt;, is bound to have a lot of influence over how systems are built in the coming decades. But I was disappointed to see most of these papers not paying attention to some of the &lt;em&gt;uniqueness&lt;/em&gt; risks of cloning. As we describe in &lt;a href=&quot;https://arxiv.org/pdf/2102.12892.pdf&quot;&gt;Restoring Uniqueness in MicroVM Snapshots&lt;/a&gt;, naively cloning VMs leads to situations where UUIDs, cryptographic keys, or IVs can be duplicated between clones. I’d love to see folks working on cloning insist on solving this problem in their solutions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Soapbox&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two things came up that I found extremely disappointing. First, there were a lot of folks who should have been there (especially paper authors) who couldn’t get visas to come to the US. It’s unacceptable and counterproductive to have a visa policy where folks who are doing cutting-edge research in an economically-critical areas can’t trivially travel to the USA.&lt;/p&gt;

&lt;p&gt;Second, a group of folks presented the results of the &lt;em&gt;CS Conference Climate &amp;amp; Harassment Survey&lt;/em&gt;. I’d recommend reading &lt;a href=&quot;https://fediscience.org/@dan@discuss.systems/110697210451922952&quot;&gt;Dan Ports’ post&lt;/a&gt; for a summary of the results. In short, 40% of the community have experienced harassment at conferences (not necessarily this conference, or a USENIX conference), and 30% of non-male attendees don’t feel welcome. This is  unacceptable, and we need to do better&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; These are some of my favorites of the ones I’ve read, or saw talks for. If you presented a paper and it’s not on this list, you can safely assume I haven’t had time to check out your excellent work yet.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Great DB work from the folks at UC Berkley? Hard to believe.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; I unfortunately missed the dedicated session on this topic, and look forward to attending similar sessions at future conferences.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Bélády's Anomaly Doesn't Happen Often</title>
      <link>http://brooker.co.za/blog/2023/06/23/belady.html</link>
      <pubDate>Fri, 23 Jun 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/06/23/belady</guid>
      <description>&lt;h1 id=&quot;béládys-anomaly-doesnt-happen-often&quot;&gt;Bélády’s Anomaly Doesn’t Happen Often&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Anomaly is a really fun word. Try saying it ten times.&lt;/p&gt;

&lt;p&gt;It was 1969. The Summer of Love wasn’t raging&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, Hendrix was playing the anthem, and Forest Gump was running rampant. In New York, IBM researchers Bélády, Nelson, and Schedler were hot on the trail of something strange. They had a &lt;em&gt;paging machine&lt;/em&gt;, a computer which kept its memory in &lt;em&gt;pages&lt;/em&gt;, and sometimes moved those pages to storage. Weird&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. It wasn’t only the machine that was weird, it was their performance results. Sometimes, giving the machine more memory made it slower. Without modern spook-hunting conveniences like Scooby Doo and Bill Murray, they had to hunt the ghost themselves.&lt;/p&gt;

&lt;p&gt;What Bélády and team found is something now called &lt;a href=&quot;https://en.wikipedia.org/wiki/B%C3%A9l%C3%A1dy%27s_anomaly&quot;&gt;Bélády’s anomaly&lt;/a&gt;. In &lt;a href=&quot;https://dl.acm.org/doi/10.1145/363011.363155&quot;&gt;their 1969 paper&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, they describe it like this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Running on a paging machine and using the FIFO replacement algorithm, there are instances when the program runs faster if one reduces the storage space allotted to it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More generally, this can happen with any FIFO cache: growing the cache can lead to worse results. This could, in theory, be a big problem for any tuning system or process which makes the assumption that growing the cache leads to better performance&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. This doesn’t happen with LRU. Just with FIFO, and with algorithms like LFU. Some point to Bélády’s anomaly as a good reason for avoiding FIFO caches, even in systems where the reduced read-time coordination and space overhead would be a big win.&lt;/p&gt;

&lt;p&gt;But how frequent is Bélády’s anomaly really? Do we, as system builders, really need to avoid FIFO caches because of it?&lt;/p&gt;

&lt;p&gt;One way to answer that question is how often we’re likely to come across Bélády’s anomaly purely by chance. It turns out that it doesn’t happen very often at all. Starting with access patterns selected randomly with a uniform distribution of keys:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/belady_freq_unif.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;and with a Zipf distribution of keys:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/belady_freq_zipf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In each simulation here, we’re comparing caches of size &lt;em&gt;N&lt;/em&gt; and &lt;em&gt;N+1&lt;/em&gt;, and counting the cases where the smaller cache has a superior hit rate. There are &lt;em&gt;N+2&lt;/em&gt; unique pages in the system, the number that seems to maximize the frequency of the anomaly.&lt;/p&gt;

&lt;p&gt;These really don’t happen that frequently, with fewer than 0.175% of uniform access patterns showing the anomaly. Also, while the badness of the anomaly is &lt;a href=&quot;https://arxiv.org/abs/1003.1336&quot;&gt;unbounded&lt;/a&gt;, none of these randomly-found strings show more than a small number of additional hits.&lt;/p&gt;

&lt;p&gt;We’re making the assumption that these random access patterns are representative of real-world access patterns. That’s likely to be close to true in multi-tenant and multi-workload systems with large numbers of users or workloads, but may not be true in single-tenant single-workload database. There’s also some risk that adversaries could construct strings which take advantage of this anomaly, but if that’s possible it also seems possible for adversaries to create uncachable workloads more generally.&lt;/p&gt;

&lt;p&gt;As a systems builder, Bélády’s anomaly isn’t a big concern to me. As somebody who appreciates a good edge case, I just love it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correction: Bélády’s Anomaly in LFU caches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jasony.me/&quot;&gt;Juncheng Yang&lt;/a&gt; pointed out that I made a rather critical mistake in the first version of this post: I said that Bélády’s anomaly only occurs in FIFO, and doesn’t occur for other algorithms like LFU. I was mistaken: Bélády’s anomaly indeed &lt;em&gt;can&lt;/em&gt; show up in LFU caches, although initial simulation results seem to suggest that this happens rather less often with LFU than FIFO&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt; (although with shorter sequences, an effect that I don’t yet really understand).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/belady_freq_unif_LFU.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Juncheng &lt;a href=&quot;https://twitter.com/1a1a11a/status/1677846869433614336&quot;&gt;also said&lt;/a&gt; that &lt;a href=&quot;https://www.usenix.org/conference/fast-03/arc-self-tuning-low-overhead-replacement-cache&quot;&gt;ARC&lt;/a&gt; has a significantly higher rate of anomalies than even FIFO. I haven’t had a chance to test that, but it makes intuitive sense. This change doesn’t alter my conclusion (Bélády’s anomaly probably isn’t worth worrying about), but it’s important to get these things right.&lt;/p&gt;

&lt;p&gt;The next question on my mind is about algorithms like &lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/170036.170081&quot;&gt;LRU-k&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, which exist on some continuum between LRU and LFU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Weird enough that, these days, paging is just &lt;em&gt;how computers work&lt;/em&gt; for the vast majority of machines.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; As Cristina Abad &lt;a href=&quot;https://twitter.com/cabad3/status/1672071784328314880&quot;&gt;pointed out on Twitter&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Like a lot of 1960s systems work, this paper is a delight. It’s easy to read, just five pages, and full of interesting analysis. Even the title, &lt;em&gt;An anomaly in space-time characteristics of certain programs running in a paging machine&lt;/em&gt;, is delightful. Why can’t systems papers like this get published anymore?&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; That was 1967.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Initial results because I’m traveling, and so have to run shorter simulations because I don’t have access to the beefy machine I used for the ones above. And because I could easily have made another silly error here.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; Have I ever shared the story of how I made a fool of myself the first time I met the legendary Betty O’Neil (co-inventor of LRU-k, snapshot isolation, and many other things)?&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>What is a container?</title>
      <link>http://brooker.co.za/blog/2023/06/19/container.html</link>
      <pubDate>Mon, 19 Jun 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/06/19/container</guid>
      <description>&lt;h1 id=&quot;what-is-a-container&quot;&gt;What is a container?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;What are words, anyway?&lt;/p&gt;

&lt;p&gt;A common cause of confusion and miscommunication I see is different people using different definitions of words. Sometimes the definitions are subtly different (as with &lt;a href=&quot;https://brooker.co.za/blog/2018/02/25/availability-liveness.html&quot;&gt;availability&lt;/a&gt;). Sometimes they’re completely different, and we’re just talking about different things entirely. A common example is the word &lt;em&gt;container&lt;/em&gt;, a popular term for a popular technology that means at least four different things.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;An approach to packaging an application along with its dependencies (sometimes a whole operating system user space), that can then run on a minimal runtime environment with a clear contract&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;A set of development, deployment, architectural, and operational approaches built around applications packaged this way.&lt;/li&gt;
  &lt;li&gt;A set of operational, security, and performance isolation tools that allow multiple applications to share an operating system without interfering with each other. On Linux, this tools include &lt;em&gt;chroot&lt;/em&gt;, &lt;em&gt;cgroups&lt;/em&gt;, &lt;em&gt;namespaces&lt;/em&gt;, &lt;em&gt;&lt;a href=&quot;https://man7.org/linux/man-pages/man2/seccomp.2.html&quot;&gt;seccomp&lt;/a&gt;&lt;/em&gt;, and others.&lt;/li&gt;
  &lt;li&gt;A set of implementations of these practices (the proper nouns, Docker, Kubernetes, ECS, etc).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These four definitions are surprisingly independent. The idea of packaging applications this way predates the other three, and will likely be around after they are gone. The practices and approaches are enabled by the tools, but don’t really require them. The Linux kernel-level interfaces, and the semantics and security they provide, are a basis for many of the implementations today, but most of the semantics are available in different ways&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;To pick an example, when we talk about &lt;a href=&quot;https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/&quot;&gt;container image support in AWS Lambda&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; we mostly mean the first one: enabling customers to get the advantages of packaging their code that way, with a small overlap with practices (some become easier to use with this support available), and the fourth (some of these tools can be used to create the images in ways that fit into a broader ecosystem).&lt;/p&gt;

&lt;p&gt;Or, to pick another example, when people say &lt;em&gt;containers are not a security boundary&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, they are mostly talking about the third category (with some overlap into the fourth). It barely touches on the first and second category, which are generally a big win for security. That full conversation is subtle, so I won’t go into it here.&lt;/p&gt;

&lt;p&gt;When you use the word &lt;em&gt;container&lt;/em&gt;, consider whether your audience is using the same definition as you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; For example, with MicroVMs like &lt;a href=&quot;https://github.com/firecracker-microvm/firecracker&quot;&gt;Firecracker&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Those people include me.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; If you’d like to dive into the details, check out &lt;a href=&quot;https://arxiv.org/abs/2305.13162&quot;&gt;our paper about adding container support to AWS Lambda&lt;/a&gt; or my &lt;a href=&quot;https://brooker.co.za/blog/2023/05/23/snapshot-loading.html&quot;&gt;blog post summary of it&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; This question of reducing the size of the contract between the container and the runtime is an interesting one. In most typical container implementations, this contract still includes hundreds of APIs, and other complex interaction points like filesystems. Only on the more extreme end, like MicroVMs &lt;em&gt;virtio&lt;/em&gt; interfaces (see the &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/agache&quot;&gt;Firecracker paper&lt;/a&gt; for our approach there) and things like &lt;em&gt;SECCOMP_SET_MODE_STRICT&lt;/em&gt; do these APIs become truly small. However, across the whole container spectrum they’re smaller and simpler than those presented by &lt;em&gt;libc&lt;/em&gt; and &lt;em&gt;openssl&lt;/em&gt; and the other thousands of libraries you’ll commonly find in a default Linux user space.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Container Loading in AWS Lambda</title>
      <link>http://brooker.co.za/blog/2023/05/23/snapshot-loading.html</link>
      <pubDate>Tue, 23 May 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/05/23/snapshot-loading</guid>
      <description>&lt;h1 id=&quot;container-loading-in-aws-lambda&quot;&gt;Container Loading in AWS Lambda&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Slap shot?&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;Back in 2019, we started thinking about how allow Lambda customers to use container images to deploy their Lambda functions. In theory this is easy enough: a container image is an image of a filesystem, just like the &lt;em&gt;zip&lt;/em&gt; files we already supported. The difficulty, as usual with big systems, was performance. Specifically latency. More specifically &lt;em&gt;cold start&lt;/em&gt; latency. For eight years &lt;em&gt;cold start&lt;/em&gt; latency has been one of our biggest investment areas in Lambda, and we wanted to support container images without increasing latency.&lt;/p&gt;

&lt;p&gt;But how do you take the biggest contributor to latency (downloading the image), increase the work it needs to do 40x (up to 10GiB from 256MiB), without increasing latency? The answer to that question is in our new paper &lt;a href=&quot;https://arxiv.org/pdf/2305.13162.pdf&quot;&gt;On-demand Container Loading in AWS Lambda&lt;/a&gt;, which will be appearing at Usenix ATC’23.&lt;/p&gt;

&lt;p&gt;In this post, I’ll pull out some highlights from the paper that I think folks might find particularly interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The biggest win in container loading comes from &lt;em&gt;deduplication&lt;/em&gt;: avoiding moving around the same piece of data multiple times. Almost all container images are created from a relatively small set of very popular base images, and by avoiding copying these base images around multiple times and caching them near where they are used, we can make things move much faster. Our data shows that something like 75% of container images contain less than 5% unique bytes.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/dedupe_cdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This isn’t a new observation, and several other container loading systems already take advantage of it. Most of the existing systems&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; do this at the &lt;em&gt;layer&lt;/em&gt; or &lt;em&gt;file&lt;/em&gt; level, but we chose to do it at the &lt;em&gt;block&lt;/em&gt; level. We unpack a snapshot (deterministically, which turns out to be tricky) into a single flat filesystem, then break that filesystem up into 512KiB chunks. We can then hash the chunks to identify unique contents, and avoid having too many copies of the same data in the cache layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lazy Loading&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most of the data in container images isn’t unique, and even less of it is actually used by the processes running in the container (in general). &lt;a href=&quot;https://www.usenix.org/conference/fast16/technical-sessions/presentation/harter&quot;&gt;Slacker&lt;/a&gt; by Harter et al was one of the first papers to provide great data on this. Here’s figure 5 from their paper:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/slacker_fig_5.png&quot; alt=&quot;Figure 5 from Harter et al&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Notice the gap between &lt;em&gt;reads&lt;/em&gt; and &lt;em&gt;repo size&lt;/em&gt;? That’s the savings that are available from loading container data when it is actually read, rather than downloading the entire image. Harter et al found that only 6.5% of container data is loaded on average. This was the second big win we were going for: the ~15x acceleration available from avoiding downloading whole images.&lt;/p&gt;

&lt;p&gt;In Lambda, we did this by taking advantage of the layer of abstraction that &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/agache&quot;&gt;Firecracker&lt;/a&gt; provides us. Linux has a useful feature called &lt;a href=&quot;https://www.kernel.org/doc/html/next/filesystems/fuse.html&quot;&gt;FUSE&lt;/a&gt; provides an interface that allows writing filesystems in userspace (instead of kernel space, which is harder to work in). We used FUSE to build a filesystem that knows about our chunked container format, and responds to reads by fetching just the chunks of the container it needs when they are needed.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/lambda_fuse_arch.png&quot; alt=&quot;Figure 5 from Harter et al&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The chunks are kept in a tiered cache, with local in-memory copies of very recently-used chunks, local on-disk copies of less recent chunks, and per-availability zone caches with nearly all recent chunks. The whole set of chunks are stored in S3, meaning the cache doesn’t need to provide durability, just low latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tiered Caching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The next big chunk of our architecture is that tiered cache: local, AZ-level, and authority in S3. As with nearly all data accesses in nearly all computer systems, some chunks are accessed much more frequently than others. Despite our local on-worker (L1) cache being several orders of magnitude smaller than the AZ-level cache (L2) and that being much smaller than the full data set in S3 (L3), we still get 67% of chunks from the local cache, 32% from the AZ level, and less than 0.1% from S3.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/lambda_hit_rate.png&quot; alt=&quot;Graph showing hit rates on various cache tiers&quot; /&gt;&lt;/p&gt;

&lt;p&gt;It’s not a surprise to anybody who builds computer systems that caches are effective, but the extreme effectiveness of this one surprised us somewhat. The per-AZ cache is extremely effective (perhaps too effective, which I’ll talk about in a future post).&lt;/p&gt;

&lt;p&gt;Another interesting property of our cache is that we’re careful not to keep exactly one copy of the most common keys in the cache. We mix a little time-varying data, a &lt;em&gt;salt&lt;/em&gt;, into the function that chooses the content-based names for chunks. This means that we cache a little more data than we need to, and lose a little bit of hit rate, but in exchange we reduce the &lt;em&gt;blast radius&lt;/em&gt; of bad chunks. If we keep exactly one copy of the most popular chunks, corruption of that chunk could affect nearly all functions. With &lt;em&gt;salt&lt;/em&gt;, the worst case of chunk loss touches only a small percentage of functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Erasure Coding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The architecture of our shared AZ-level cache is a fairly common one: a fleet of cache machines, a variant of &lt;a href=&quot;https://en.wikipedia.org/wiki/Consistent_hashing&quot;&gt;consistent hashing&lt;/a&gt; to map chunk names onto caches, and an HTTP interface&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. One thing that’s fairly unusual is that we’re using erasure coding to bring down tail latency and reduce the impact of cache node failures. I covered the tail latency angle in my post on &lt;a href=&quot;https://brooker.co.za/blog/2023/01/06/erasure.html&quot;&gt;Erasure Coding versus Tail Latency&lt;/a&gt;, but the operational angle is also important.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/ec_latency.png&quot; alt=&quot;Graph showing latency impact of Erasure Coding&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Think about what happens in a classic consistent hashed cache with 20 nodes when a node failure happens. Five percent of the data is lost. The hit rate drops to a maximum of 95%, which is a more than 5x increase in misses given that our normal hit rate is over 99%. At large scale machines fail all the time, and we don’t want big changes in behavior when that happens. So we use a technique called erasure coding to completely avoid the impact. In erasure coding, we break each chunk up into $M$ parts in a way that it can be recreated from any $k$. As long as $k - M &amp;gt; 1$ we can survive the failure of any node with zero hit rate impact (because the other $k$ nodes will pick up the slack).&lt;/p&gt;

&lt;p&gt;That makes software deployments easier too. We can just deploy to the fleet a box at a time, without carefully making sure that data has moved to new machines before we touch them. It’s a little bit of code complexity on the client side, in exchange for a lot of operational simplicity and fault tolerance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The overall architecture of our container loading system consists of approximately 6 blocks. Three of those (&lt;a href=&quot;https://aws.amazon.com/ecr/&quot;&gt;ECR&lt;/a&gt;, &lt;a href=&quot;https://aws.amazon.com/kms/&quot;&gt;KMS&lt;/a&gt;, S3) are existing systems with internal architectures of their own, and three (the flattening system, the AZ-level cache, and the on-worker loading components) are things that we designed and built for this particular project.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/lambda_container_arch.png&quot; alt=&quot;Overall architecture of Lambda&apos;s container loading path&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Each of those components has different scale needs, different performance needs, was deployed in different ways, and has different security needs. So we designed them as different components and they get their own block in the block diagram. These blocks, in turn, interact with the other blocks that make up Lambda, including the control plane that tracks metadata, the invoke plane that sends work to workers, and the isolation provided by Firecracker and related components.&lt;/p&gt;

&lt;p&gt;All large systems are built this way, as compositions of components with different goals and needs. And, crucially, different groups of people responsible for building, operating, and improving them over time. Choosing where to put component boundaries is somewhat science (look for places where needs are different), somewhat art (what are the &lt;em&gt;right&lt;/em&gt; APIs?), and somewhat fortune telling (how will we want to evolve the system in future?). I’m happy with what we did there, but also confident that in the long term we’ll want to adapt and change it. That’s the nature of system architecture.&lt;/p&gt;

&lt;p&gt;As &lt;a href=&quot;https://www.allthingsdistributed.com/2023/05/monoliths-are-not-dinosaurs.html&quot;&gt;Werner Vogels says&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;However, I want to reiterate, that there is not one architectural pattern to rule them all. How you choose to develop, deploy, and manage services will always be driven by the product you’re designing, the skillset of the team building it, and the experience you want to deliver to customers (and of course things like cost, speed, and resiliency).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I loved writing this paper (with my co-authors) because it’s a perfect illustration of what excites me about the work I do. We identified a real problem for our customers, thought through solutions, and applied a mix of architecture, algorithms, and existing tools to solve the problem. Building systems like this, and watching them run, is immensely rewarding. Building, operating, and improving something like this is a real team effort, and this paper reflects deep work from across the Lambda team and our partner teams.&lt;/p&gt;

&lt;p&gt;This system gets performance by doing as little work as possible (deduplication, caching, lazy loading), and then gets resilience by doing slightly more work than needed (erasure coding, salted deduplication, etc). This is a tension worth paying attention to in all system designs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; There’s a fairly complete literature review in the paper. I’m not going to repeat it here, so if you’re interested in how similar systems do it, go check it out there.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; This was something of a surprise. We built to the prototype with HTTP (using &lt;a href=&quot;https://github.com/seanmonstar/reqwest&quot;&gt;reqwest&lt;/a&gt; and &lt;a href=&quot;https://hyper.rs/&quot;&gt;hyper&lt;/a&gt;), fully expecting to swap it out for a binary protocol. What we found was that the cache machines could easily saturate their 50Gb/s NICs without breaking a sweat, so we’re keeping HTTP for now.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Open and Closed, Omission and Collapse</title>
      <link>http://brooker.co.za/blog/2023/05/10/open-closed.html</link>
      <pubDate>Wed, 10 May 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/05/10/open-closed</guid>
      <description>&lt;h1 id=&quot;open-and-closed-omission-and-collapse&quot;&gt;Open and Closed, Omission and Collapse&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Were you born in a cave?&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;This, from &lt;a href=&quot;http://www.cs.toronto.edu/~bianca/papers/nsdi_camera.pdf&quot;&gt;Open Versus Closed: A Cautionary Tale&lt;/a&gt; by Schroeder et al&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; is one of the most important concepts in systems performance:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Workload generators may be classified as based on a closed system model, where new job arrivals are only triggered by job completions (followed by think time), or an open system model, where new jobs arrive independently of job completions. In general, system designers pay little attention to whether a workload generator is closed or open.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Or, if you’d prefer it as an image, from the same paper:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/open_closed.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;While the paper does a good job explaining why it’s so important, I don’t think it even fully justifies what a big difference the open and closed modes of operation have on measurement, benchmarking, and system stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some Examples&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s consider a very simple system, along the lines of the one in the image above: a single server, an unbounded queue, and either &lt;em&gt;open&lt;/em&gt; or &lt;em&gt;closed&lt;/em&gt; customer arrival processes. First, we’ll consider an easy case, where the server latency is exponentially distributed with a mean of 0.1ms. What does the client-observed latency look like for a single-client closed system, closed system with 10 clients, or an open system with a Poisson arrival process?&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;To answer that question, we need to pick a value for the server utilization ($\rho$), the proportion of the time the server is busy. Here, we consider the case where the server is busy 80% of the time ($\rho = 0.8$)&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/oc_exp_ecdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This illustrates one of the principles from the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Principle (i): For a given load, mean response times are significantly lower in closed systems than in open systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;but it also shows us something else: the tail is much longer for the open case than the closed one. We can see more of that if we zoom in on just the tail of the latency distribution (percentiles higher than the 90th):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/oc_exp_ecdf_zoomed.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This difference is even more stark if we consider a server-side distribution with worse behavior. Say, for example, one that with an average response time of 0.1ms 99.9% of the time, and 10ms 0.1% of the time. That’s a bit of an extreme example, but not one that is impossible on an oversubscribed or garbage-collected system, or even one that needs to touch a slow storage device for some requests. What does the response time look like?&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/oc_bimod_ecdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;and, zooming in on the tail:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/oc_bimod_ecdf_zoomed.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As we would expect, the 99th percentile for the closed system is much lower. For the single-client closed system, where our hypothesis is that the client-observed latency is equal to the server latency because no queue can form, the 99th percentile is below 1ms. For the open system, its over 25ms. This is happening simply because a queue can form. Without a limit on client-side concurrency, the queue grows during the pause, and then must pay down that queue when the long request is over. Sometimes another long response time will happen before the queue drains, stacking latency up further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmarks and Coordinated Omission&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From the results above, you should be able to see that a &lt;em&gt;closed&lt;/em&gt; benchmark running against a system which will be &lt;em&gt;open&lt;/em&gt; in production could significantly under estimate the tail latency observed by real-world clients (and vice versa). This is a phenomenon that Gil Tene popularized under the name &lt;em&gt;coordinated omission&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, which has brought some (but not enough) awareness of it. This isn’t a small or academic point. While our bimodal example is a little extreme, it is not out of the realm of what we see in the real world, and shows that a closed benchmark could underestimate 99th percentile latency by a factor of at least 25.&lt;/p&gt;

&lt;p&gt;That mistake is a really easy to make, because the simplest, easiest-to-write, benchmark loop falls into exactly this trap. Here’s a closed-loop benchmark:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;while True:
  send request
  wait for response
  write down response time
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;compare it to an open loop implementation:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;while True:
  sleep 1ms
  send request
  (asynchronously write down the response time on completion in a different thread)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The closed loop one is the one I’d probably write if I wasn’t thinking about it. It’s the easiest one to write (at least in a language without nice first-class asynchrony), and the obvious way to approach the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open Loops, Timeouts, and Congestive Collapse&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Coordinated omission and misleading benchmark results aren’t even the most important thing about open loop systems. In my mind, the most important thing to understand is &lt;em&gt;congestive collapse&lt;/em&gt;. Probably the simplest version to understand has to do with client behavior, specifically timeouts and retries. The open loop model is optimistic. In the real world of timeouts and retries, its optimistic to believe that jobs arrive independently of job completions. Indeed, even if the underlying arrival rate is Poisson, there is also some additional rate of traffic that arrives due to timeouts.&lt;/p&gt;

&lt;p&gt;Let’s go back to our bimodal example from earlier, and look at the queue length over the simulation time for the open and closed cases. As expected, the closed cases drive shorter queues, and the open case’s latency is driven by the queue growing and shrinking as long server-side latency drives short periods where requests are coming in faster than they can be served.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/oc_bimod_qlen.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the long-term, though, because our utilization is only 80% ($\rho = 0.8$), the server always eventually catches up and the queue drains. One way that often happens in production is because of client behavior, specifically retries&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; after timeouts. What if we take the bimodal system, and make the seemingly very small change of retrying if the response takes longer than 15ms? That seems safe, because it’s still more than 15x the server’s 99th percentile latency. Here’s what our queue length looks like:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/oc_bimod_timeout_qlen.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Oh no! Our queue is growing without bound! We have introduced a catastrophe: as soon as the response time exceeds 15ms for a request, we add another request, which slightly increases the arrival rate (increasing $\rho$), which slightly increases latency, which slightly increases the probability that future requests will take more than 15ms, which slightly increases the rate of future retries, and so on to destruction.&lt;/p&gt;

&lt;p&gt;Closed systems don’t suffer from this kind of catastrophe, unless they have client behaviors that mistakenly turn them into open systems due to retries. Almost all stable production systems aren’t really &lt;em&gt;open&lt;/em&gt;, and instead approximate closed behavior by limiting either concurrency or arrival rate&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Despite this being a blog post about a queue theory paper, I’ve tried not to use a lot of queue theory results here, and instead used the results of numerical simulations. I’ve done that for two reasons. One is that I want to make a point about tail latency, and queue theory results around the tail latency of arbitrary distributions aren’t particularly accessible (or, in some cases, don’t exist). Second, a lot of people engage with numerical examples more than they do with theoretical results.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; For more on the importance of $\rho$ see &lt;a href=&quot;https://brooker.co.za/blog/2021/08/05/utilization.html&quot;&gt;Latency Sneaks Up on You&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; &lt;a href=&quot;https://www.scylladb.com/2021/04/22/on-coordinated-omission/&quot;&gt;On Coordinated Omission&lt;/a&gt; by Ivan Prisyazhynyy is a good introduction.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; For a more general look at retry-related problems, and one solution, check out &lt;a href=&quot;https://brooker.co.za/blog/2022/02/28/retries.html&quot;&gt;Fixing retries with token buckets and circuit breakers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; One of the best decisions we made early on in building AWS Lambda was to make &lt;em&gt;concurrency&lt;/em&gt; the unit of scaling rather than &lt;em&gt;arrival rate&lt;/em&gt;. This makes it significantly easier both for the provider and the customer to avoid congestive collapse behaviors in their systems. The way Lambda uses concurrency is &lt;a href=&quot;https://docs.aws.amazon.com/lambda/latest/dg/lambda-concurrency.html&quot;&gt;describe in the documentation&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; These results were generate with a &lt;a href=&quot;https://brooker.co.za/blog/2022/04/11/simulation.html&quot;&gt;simple simulation&lt;/a&gt;. If you would like to check my work, the simulator code that generated these results is &lt;a href=&quot;https://github.com/mbrooker/simulator_example/tree/main/omission&quot;&gt;available on Github&lt;/a&gt;. The code is less than 200 lines of Python, and should be accessible without any knowledge of queue theory.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>The Four Hobbies, and Apparent Expertise</title>
      <link>http://brooker.co.za/blog/2023/04/20/hobbies.html</link>
      <pubDate>Thu, 20 Apr 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/04/20/hobbies</guid>
      <description>&lt;h1 id=&quot;the-four-hobbies-and-apparent-expertise&quot;&gt;The Four Hobbies, and Apparent Expertise&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;Around the end of high school, I started to get really into photography. My friend (let’s call him T) was also into it, which should have been great fun. But it wasn’t. Going shooting with him was never great, for a reason I didn’t figure out till much later. I wanted to take photos. T mostly enjoyed tinkering with cameras. As I’ve spent more time on different hobbies, it’s become clear that this is a common pattern. Every hobby, pastime&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, or sport, is really four hobbies.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/four_hobbies.png&quot; alt=&quot;The four hobbies, arranged as quadrants&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The first axis is &lt;em&gt;doing&lt;/em&gt; versus &lt;em&gt;talking&lt;/em&gt;, and the second is &lt;em&gt;the hobby&lt;/em&gt; versus &lt;em&gt;the kit&lt;/em&gt;. In nearly every case I’ve seen, people roughly sort themselves into one of these categories.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Doing the thing.&lt;/em&gt; These are the folks who enjoy doing the actual activity: taking photos, skiing, golfing, hiking, hunting, whatever. You’ll find them out in the forest, on the slopes, or on the course.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Collecting the kit.&lt;/em&gt; These folks enjoy collecting, maintaining, tuning, and fiddling with the kit. They tend to be attracted to kit-heavy hobbies like photography, but it seems like you can find them everywhere.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Talking about the thing.&lt;/em&gt; This group enjoys discussing the activity. In-person, on forums, on Twitter, on Reddit, or anywhere else. They’ll talk technique, or pro competition, or about their day on the course.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Talking about the kit.&lt;/em&gt; Like the previous group, these people enjoy the discussion. Instead of talking about the activity, they’ll talk about kit. Whether it’s if this season’s model is better than last’s, or the optimal iron temperature, they want to talk gear.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There’s some crossover between each of these categories, of course, but most communities and people primarily self-select into one of them. Outsiders unaware of this selection may come from the wrong quadrant, and chafe against the community before adapting or leaving.&lt;/p&gt;

&lt;p&gt;But does it matter? It matters because the hobbies are more enjoyable, and the communities more welcoming and harmonious, if you pick which hobby you want to take part in. Say you want to pick up 3D printing, for example. You may go to a community and ask how to get started. This’ll send you down one of four paths: you could be encouraged to buy a printer you can start using right away, or could be encouraged to build your own from an online design, could be pulled into discussing the best filament, or the finer points of CoreXY vs bed slingers. If you want to make some stuff, three of these paths are likely to turn you off the hobby. Similarly, if you love to tinker, three won’t meet that need. And so on.&lt;/p&gt;

&lt;p&gt;Once you’re established, you may find the hobby you enjoy and be able to tell the difference between communities. Or you may luck into the one you like. Often, though, you’ll feel like you’re in the wrong place and not sure why. When starting a hobby or sport, be sure to pick which of the four you want to take part in.&lt;/p&gt;

&lt;p&gt;Kit heads will often say (or imply) that having the best kit leads to the best performance. That doesn’t seem true. I couldn’t ski the Lhotse Couloir even on the best gear ever made. My uncle would beat me at golf even if I had Tiger’s clubs and he just had my old 7 iron. That doesn’t mean that gear collecting and tweaking aren’t good hobbies, just that we need to be practical about the impact of gear on performance&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Appearance of Expertise&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This breakdown matters for another reason, too, and extends beyond hobbies into our professional lives. It’s got to do with the appearance and visibility of expertise. Let’s assume for a moment that expertise primarily comes from experience. We’d expect the most experienced folks to be found in the top left quadrant: the practitioners. They could be practitioners of programming, of data analysis, or of leadership, but that’s where they would be found.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/four_hobbies_gradient.png&quot; alt=&quot;The four hobbies, arranged as quadrants, showing hypothesized gradients of visibility and expertise&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Which ones are the most visible? It’s the bottom two quadrants. The discussers, forum posters, Hacker News commenters, and serial conference speakers. If each person has a finite amount of time to spend either learning or sharing, we’d expect to find a negative correlation between output and experience. This, in turn, lowers the overall quality of content on any given subject&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. As a second-order effect, communication is also a learned skill, so we might expect the ability to argue and persuade to also be negatively correlated with expertise (in that the arguer has spent their time learning to argue, at the cost of time spent developing expertise).&lt;/p&gt;

&lt;p&gt;Kit and tools are another imperfect signal for expertise. Clearly, in our field, there’s a significant benefit to knowing a set of tools well, and being able to use these tools as an extension of our minds&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. On the other hand, it is very easy to confuse the work of getting &lt;em&gt;vim&lt;/em&gt; just-so with actual productivity, and the &lt;em&gt;emacs&lt;/em&gt; expert as an expert on the larger field&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. Observationally, I would say that there’s little correlation between expertise and kit-optimization in our field, positive or negative.&lt;/p&gt;

&lt;p&gt;The lesson here is to be careful with the signals you use as proxies for competence. “Has the perfect Visual Studio config”, “has spoken at loads of conferences”, and “visible on Hacker News”&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; seem like strong signals, when the reality seems to be that they are weak ones, at best.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Communication and Competence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I believe that one of the most important things an engineer or technical leader can do for their career is to practice, and develop, strong communication skills. It may seem like this belief is at odds with the point of this post. In some sense it is. Spending an entire career in that top-left quadrant is attractive to a lot of people (and fulfilling too). But being able to step out of it is valuable, and gives the opportunity to significantly increase impact and recognition.&lt;/p&gt;

&lt;p&gt;So there’s a tension here, both for those looking to plot their career path, and for those looking to find experts to learn from. I don’t think its possible to entirely resolve this tension, but it is important to be thoughtful about it. Long-term, a mix is best. But I would say that, wouldn’t I?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; It even extends into other things that aren’t mere pastimes, like parenting.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Academia is interesting here, because one could expect the same logic to apply. However, the incentives to write and share expertise there are explicit, at least somewhat correcting for this phenomenon.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Too Heidegger?&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Or “has a blog”, like the one you’re reading.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; This even seems to apply in cases where the gear is doing more of the work. You don’t need to look around online long to find assertions about the importance of gear above all else, from &lt;em&gt;no good software has ever been written in Java&lt;/em&gt; to &lt;em&gt;it’s impossible to make useful parts on a Tormach&lt;/em&gt;. Clearly tools have limits, but online discussions seem to overstate those limits more frequently than understating them.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; This applies to programming languages too. There are plenty of online communities (and one particular one I won’t mention) who seem to believe that the choice of programming language or paradigm is the most important thing, not only to the success of a projects and companies, but to the very existence of our field. Little evidence seems to back these strong assertions.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Surprising Scalability of Multitenancy</title>
      <link>http://brooker.co.za/blog/2023/03/23/economics.html</link>
      <pubDate>Thu, 23 Mar 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/03/23/economics</guid>
      <description>&lt;h1 id=&quot;surprising-scalability-of-multitenancy&quot;&gt;Surprising Scalability of Multitenancy&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;When most folks talk about the economics of cloud systems, their focus is on automatically scaling for long-term seasonality: changes on the order of days (&lt;em&gt;fewer people buy things at night&lt;/em&gt;), weeks (&lt;em&gt;fewer people visit the resort on weekdays&lt;/em&gt;), seasons, and holidays. Scaling for this kind of seasonality is useful and important, but there’s another factor that can be even more important and is often overlooked: short-term peak-to-average. Roughly speaking, the cost of a system scales with its (short-term&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) peak traffic, but for most applications the value the system generates scales with the (long-term) average traffic.&lt;/p&gt;

&lt;p&gt;The gap between “paying for peak” and “earning on average” is critical to understand how the economics of large-scale cloud systems differ from traditional single-tenant systems.&lt;/p&gt;

&lt;p&gt;Why is it important?&lt;/p&gt;

&lt;p&gt;It’s important because multi-tenancy (i.e. running a lot of different workloads on the same system) very effectively reduces the peak-to-average ratio that the overall system sees. This is highly beneficial for two reasons. The first-order reason is that it improves the economics of the underlying system, by bringing costs (proportional to &lt;em&gt;peak&lt;/em&gt;) closer to value (proportional to &lt;em&gt;average&lt;/em&gt;). The second-order benefit, and the one that is most directly beneficial to cloud customers, is that it allows individual workloads to have higher peaks without breaking the economics of the system.&lt;/p&gt;

&lt;p&gt;Most people would call that &lt;em&gt;scalability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 1: S3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Earlier this month, Andy Warfield from the S3 team did a &lt;a href=&quot;https://www.youtube.com/watch?v=sc3J4McebHE&amp;amp;t=1282s&quot;&gt;really fun talk at OSDI’23&lt;/a&gt; about his experiences working on S3. There’s a lot of gold in his talk, but there’s one point he made that I think is super important, and worth diving deeper into: heat management and multi-tenancy. Here’s the start of the relevant bit on heat&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; management:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/sc3J4McebHE?start=1282&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Andy makes a lot of interesting point here, but the key one has got to do with the difference between the &lt;em&gt;per object&lt;/em&gt; heat distribution, the &lt;em&gt;per aggregate&lt;/em&gt; heat distribution, and the &lt;em&gt;system-wide&lt;/em&gt; heat distribution.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Scale allows us to deliver performance for customers that would otherwise be prohibitive to build.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here, Andy is talking about that second-order benefit. By spreading customers workloads over large numbers of storage devices, S3 is able to support individual workloads with peak-to-average ratios that would be prohibitively expensive in any other architecture. Importantly, this happens without increasing the peak-to-average of the overall system, and so comes without additional cost to customers or the operator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2: Lambda&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Much like S3, Lambda’s scalability is directly linked to multi-tenancy. In our &lt;a href=&quot;https://www.usenix.org/system/files/nsdi20-paper-agache.pdf&quot;&gt;Firecracker paper&lt;/a&gt;, we explain why:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Each slot can exist in one of three states: initializing, busy,
and idle… Slots use different amounts
of resources in each state. When they are idle they consume
memory, keeping the function state available. When they are
initializing and busy, they use memory but also resources like
CPU time, caches, network and memory bandwidth and any
other resources in the system. Memory makes up roughly
40% of the capital cost of typical modern server designs, so
idle slots should cost 40% of the cost of busy slots. Achieving
this requires that resources (like CPU) are both soft-allocated
and oversubscribed, so can be sold to other slots while one is
idle.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This last point is the key one: we can’t dynamically scale how much memory or CPU an EC2 metal instance has, but we can plan to sell memory or CPU that is currently not being used by one customer to another customer. This is, fundamentally, a statistical bet. There is always some non-zero probability that demand will exceed supply on any given instance, and so we manage it very carefully.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We set some compliance goal X (e.g., 99.99%), so that functions
are able to get all the resources they need with no contention
X% of the time. Efficiency is then directly proportional to the
ratio between the Xth percentile of resource use, and mean
resource use. Intuitively, the mean represents revenue, and the
Xth percentile represents cost. Multi-tenancy is a powerful
tool for reducing this ratio, which naturally drops approximately with $\sqrt{N}$ when running $N$ uncorrelated workloads on
a worker.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And, of course, for this to work we need to be sure that the different workloads don’t all spike up at the same time, which requires that workloads are not correlated. Again, this is not only an economic effect, but also a scalability one. By working this way, Lambda can absorb both long and short spikes in load for any single workload very economically, allowing it to offer scalability that is difficult to match with traditional infrastructure.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Keeping these workloads uncorrelated requires that
they are unrelated: multiple workloads from the same application, and to an extent from the same customer or industry,
behave as a single workload for these purposes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This last point is very important, because it illustrates the difference between our real-world setting and idealized models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poisson Processes and the Real World&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What I said above is true for Poisson processes, but not nearly as powerful as what we see in the real world, which is interesting because Poisson processes are widely used to model the economics and scalability of systems. To understand why, we need to think a little bit about the sum of two Poisson processes. Say we have two customers of the system, one being a Poisson process with a mean arrival rate of 1 tps ($\lambda_1 = 1$), and one with a mean arrival rate of 4 tps ($\lambda_2 = 4$). The sum of the two is a Poisson process with an arrival rate of 5 tps ($\lambda_t = \lambda_1 + \lambda_2 = 1 + 4$)&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. This keeps going: no matter how many Poisson customers you add, you keep having a Poisson arrival process.&lt;/p&gt;

&lt;p&gt;That’s still good, because as you scale the system to handle the higher-rate process, the $c$ in the &lt;em&gt;M/M/c&lt;/em&gt; system goes up, and &lt;a href=&quot;https://brooker.co.za/blog/2020/08/06/erlang.html&quot;&gt;utilization increases&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But many real-world processes, like the ones that Andy talks about, are not Poisson. They behave much worse, with much higher spikes and much more autocorrelation than Poisson processes would exhibit. This isn’t some kind of numerical anomaly, but rather a simple observation about the world: traffic changes with time and use. I don’t use my computer’s hard drive once every $\frac{1}{\lambda}$ seconds, exponentially distributed. I use it a lot, then not really at all. I don’t use my car that way, or my toaster. And humans don’t use the cloud that way either. One use leads to another, and one user and another being on are not independent.&lt;/p&gt;

&lt;p&gt;But, if you can mix a lot of different workloads, with different needs and patterns, you can hide the patterns of each. That’s the fundamental economic effect of multi-tenancy, and the thing that a lot of people overlook when thinking about the economics of the cloud)&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Where the definition of &lt;em&gt;short-term&lt;/em&gt; depends on how quickly the system can scale up and down without incurring costs. Running on Lambda, short-term may be seconds. If you’re building datacenters, it may be months or years.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; To be clear, Andy’s talking about logical workload heat here (a &lt;em&gt;hot&lt;/em&gt; workload is one doing a lot of IO at a given moment), not physical &lt;em&gt;temperature&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; &lt;a href=&quot;https://math.stackexchange.com/questions/4446957/prove-sum-of-two-independent-poisson-processes-is-another-poisson-process&quot;&gt;This StackOverflow question covers it well&lt;/a&gt;, and there’s a more formal treatment in &lt;a href=&quot;https://www.amazon.com/Stochastic-Models-Operations-Research-Vol/dp/0486432599&quot;&gt;Stochastic Models in Operations Research&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; One way to think about this is that by summing over multiple non-stationary, autocorrelated, seasonal, and otherwise poorly behaved workloads we’re restoring the &lt;em&gt;poisson-ness&lt;/em&gt; of the overall workload. That’s not too far from the truth, because the Poisson process is the result of summing a large number of independent arrival processes. It’s not quite true, but directionally OK in my mind. On the other hand, some sensible people don’t support doing math based on vibes, and might object.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>False Sharing versus Perfect Placement</title>
      <link>http://brooker.co.za/blog/2023/03/07/false-sharing.html</link>
      <pubDate>Tue, 07 Mar 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/03/07/false-sharing</guid>
      <description>&lt;h1 id=&quot;false-sharing-versus-perfect-placement&quot;&gt;False Sharing versus Perfect Placement&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;em&gt;This is part 3 of an informal series on database scalability. The previous parts were on &lt;a href=&quot;https://brooker.co.za/blog/2023/01/30/nosql.html&quot;&gt;NoSQL&lt;/a&gt;, and &lt;a href=&quot;https://brooker.co.za/blog/2023/02/07/hot-keys.html&quot;&gt;Hot Keys&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://brooker.co.za/blog/2023/02/07/hot-keys.html&quot;&gt;the last installment&lt;/a&gt;, we looked at hot keys and how they affect the theoretical peak scale a database can achieve. Hidden in that post was an underlying assumption: that can successfully isolate the hottest key onto a shard of its own. If the key distribution is slow moving (hot keys now will still be hot keys later) then this is achievable. The system can reshard (for example by splitting an merging existing shards) until heat on shards are balanced.&lt;/p&gt;

&lt;p&gt;Unfortunately for us, this nice static distribution of key heat doesn’t seem to happen often. Instead, what we see is that popularity of keys changes over time. Popular products come and go. Events come and go. The 1966 FIFA world cup came and went&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. If the distribution of which keys are popular &lt;em&gt;right now&lt;/em&gt; changes often enough, then moving around data to balance shard heat becomes rather difficult and expensive to do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even Sharding and False Sharing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At the extreme end, where there is no stability in the key heat distribution, we may not be able to shard our database better than evenly (or, somewhat equivalently, randomly). This might work out well, with the hottest key on one shard, the second hottest on another, third hottest on another, and so on. It also might work out poorly, with the hottest and second hottest keys on the same shard. This leads to a kind of &lt;em&gt;false sharing&lt;/em&gt; problem, where shards are hotter than they strictly need to be, just by getting unlucky.&lt;/p&gt;

&lt;p&gt;How likely are we to get unlucky in this way?&lt;/p&gt;

&lt;p&gt;Let’s start with uniformly distributed keys, and think about a database with 10,000 keys, sharded into 2, 5 or 10 different shards. Ideally, for the 2 shard database we’d like to see the hottest shard get 50% of the traffic. For the 10 shard database 10%. This is what the distribution looks like with 50,000 simulation runs (solid lines are simulation results, dotted vertical lines show ‘perfect’ sharding)&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/unif_false_sharing_pdf.png&quot; alt=&quot;Simulation results for false sharing on uniform keys&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Not bad! In fact, with nearly all the simulation runs hitting the ideal bound, we can safely say that we don’t have a major false sharing problem here. Uniform, however, is easy mode. What about something a little more biased, like the Zipf distribution. This is what things look like for Zipf distributed keys:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/zipf_false_sharing_pdf.png&quot; alt=&quot;Simulation results for false sharing on zipf keys&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Ah, that’s much worse. We can see that there are some runs for the two-shard case where the hottest shard is getting nearly 80% of the database traffic! Even for the 10 shard case, the hottest shards are still getting over 40% of database traffic, compared to the ideal 10%. Let’s look at the cumulative version to get a feeling for how common this is.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/zipf_false_sharing.png&quot; alt=&quot;Simulation results for false sharing on zipf keys, cumulative&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here, for example, we can see in the 5 shard case that nearly 15% of the time the hottest shard is getting double the traffic we would ideally expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Trend&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My instinct, when I started looking at the simulation results, is that the amount of false sharing would decrease significantly as the database size gets larger, because there would be more “dilution” of the hot keys. Defining the amount of false sharing as the mean hot shard load divided by the ideal shard load, this is exactly what we see happen:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/zipf_false_sharing_trend.png&quot; alt=&quot;Simulation results for false sharing on zipf keys, trend&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, this drop is relatively slow, and so doesn’t save us from the underlying problem until database sizes become very big indeed, and size never truly solves the problem. But this is still one of those luxury problems where scale makes things (slightly) easier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does it matter?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Whether this false sharing effect is important or not depends on other factors in your system architecture. It may, however, be surprising when sharding is not as effective as we may have hoped. For example, if we split the database in half and get an 80:20 split rather than the expected 50:50 split, we might need to split further and into smaller shards that would have otherwise been ideal.&lt;/p&gt;

&lt;p&gt;This doesn’t only affect databases. The same effect will happen with sharded microservices, queues, network paths, compute hardware, or whatever else. In all these cases, this effect is practically important because it makes uniform or random sharding significantly less effective than it might be (and so require more elaborate sharding approaches), and might make sharding much less effective for heat distributions that are highly variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal and Spatial Locality&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The distributions above assume that the heat is spread out over the key space evenly, and not in a way that is aligned with the sharding scheme.&lt;/p&gt;

&lt;p&gt;For example, consider a database table with an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SERIAL&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AUTO_INCREMENT&lt;/code&gt; primary key, and the common pattern that recently-created rows tend to be accessed more often than older rows (customers are more likely to check on the status of recent orders, or add additional settings to new cloud resources, etc). If the sharding scheme is based on key ranges, all this heat may be focused on the shard that owns the range of most recent keys, leading to even worse shard heat distributions than the simulations above. Schemes with shard based on hashes (or other non-range schemes) avoid this problem, but introduce other problems by losing what may be valuable locality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Don’t tell the English. If they ask you about it, tell them it’s still the most important sporting event in history, then run.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; You can check out the simulation code &lt;a href=&quot;https://github.com/mbrooker/simulator_example/blob/main/false_sharing/false-sharing.R&quot;&gt;on GitHub&lt;/a&gt; if you want to check my work, or try different distributions for yourself.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Hot Keys, Scalability, and the Zipf Distribution</title>
      <link>http://brooker.co.za/blog/2023/02/07/hot-keys.html</link>
      <pubDate>Tue, 07 Feb 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/02/07/hot-keys</guid>
      <description>&lt;h1 id=&quot;hot-keys-scalability-and-the-zipf-distribution&quot;&gt;Hot Keys, Scalability, and the Zipf Distribution&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;&lt;i&gt;the&lt;/i&gt;: so hot right now.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;Does your distributed database (or microservices architecture, or queue, or whatever) &lt;em&gt;scale&lt;/em&gt;? It’s a good question, and often a relevant one, but almost impossible to answer. To make it a meaningful question, you also need to specify the workload and the data in the system. &lt;em&gt;Given this workload, over this data, does this database scale?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One common reason systems don’t scale is because of &lt;em&gt;hot keys&lt;/em&gt; or &lt;em&gt;hot items&lt;/em&gt;: things in the system that are accessed way more often than the average thing. To understand why, lets revisit our database architecture from the previous post:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/db_basic_arch.png&quot; alt=&quot;Abstract Database Architecture&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Sharding, the horizontal dimension in this diagram, only works if the workload is well distributed over the key space. If some keys are &lt;em&gt;too popular&lt;/em&gt; or &lt;em&gt;too hot&lt;/em&gt;, then their associated shard will become a bottleneck for the whole system. Adding more capacity will increase throughput for other keys, but the hottest ones will just experience errors or latency. In this post, we’ll look at some examples to understand how much of a bottleneck this actually is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Say hello to Olivia and Liam&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you were born in the USA in 2021, there’s about a &lt;a href=&quot;https://www.ssa.gov/cgi-bin/popularnames.cgi&quot;&gt;1% chance your name is either Olivia or Liam&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, or about a 0.01% chance your name is Blaise or Annabella. Baby names come in waves and fashions, and so some are always much more popular than others&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/baby_names.png&quot; alt=&quot;Chart showing frequency distribution of baby names in the USA in 2021&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now, imagine we were using the baby’s first name as a database key. Clearly, that would skew accesses heavily towards Olivia’s partition, affecting throughput for the whole database. But how much of a practical concern is that effect?&lt;/p&gt;

&lt;p&gt;Let’s start by building our intuition. For simplicity we’re going to consider just girls. We’d expect about 1% of babies to be called Olivia, and at least 1% of traffic then going to Olivia’s partition. So, if we’re trying to avoid errors or latency caused by overloading that partition, we’d expect that by the time the database was handling $\approx \frac{1}{0.01} = 100$ times the traffic Olivia’s partition can handle, then adding more shards won’t help.&lt;/p&gt;

&lt;p&gt;Unfortunately, that’s a very optimistic picture. Assuming that each incoming request is independently sampled from the names distribution, we run into a &lt;a href=&quot;https://brooker.co.za/blog/2018/01/01/balls-into-bins.html&quot;&gt;balls into bins&lt;/a&gt; problem&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Sometimes, by chance, the load on Olivia’s partition will exceed the limit earlier than expected, leading to overload earlier than expected. How often will this happen? Because I’m a coward and afraid of trying to reason about this in closed form, we’re going to &lt;a href=&quot;https://brooker.co.za/blog/2022/04/11/simulation.html&quot;&gt;simulate it&lt;/a&gt;, assuming that once there are too many requests in flight on a node then any more incoming will fail.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/hot_keys_babies.png&quot; alt=&quot;Chart showing database errors versus scale for names picked from the distribution of baby names&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What we see is that errors start picking up a &lt;em&gt;lot&lt;/em&gt; earlier than expected. Error start off at a low level around 50 nodes, and pick up rather quickly after that. The 100 we expected doesn’t even seem achievable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zipf Bursts Onto the Scene&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Zipf distribution (and power law distributions more generally, like the &lt;a href=&quot;https://en.wikipedia.org/wiki/Zeta_distribution&quot;&gt;Zeta distribution&lt;/a&gt;), are used as canonical examples of skewed key distributions in database textbooks, papers, and benchmarks. This makes some sense, because some natural things like text word frequencies, are Zipf distributed. It doesn’t make &lt;em&gt;that much&lt;/em&gt; sense, because it’s not clear how often those Zipf-distributed things are used as database keys anyway. City sizes, maybe.&lt;/p&gt;

&lt;p&gt;That aside, how does the Zipf distribution’s behavior differ from our baby names distribution? Zipf is very aggressive! Wikipedia says that &lt;em&gt;the&lt;/em&gt; is 7% of all words in typical written English. We would expect a Zipf-distributed key space to scale even worse than the baby names. Picking the parameter $s = 0.7$ to &lt;a href=&quot;https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0053227&quot;&gt;represent typical adult English&lt;/a&gt;, and $N = 1000$ to match our database of baby names, we can see how much Zipf distributed data limits our scalability.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/hot_keys_babies_zipf.png&quot; alt=&quot;Chart showing database errors versus scale for names picked from the distribution of baby names and Zipf distributed keys&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The errors take off much earlier here - once we’ve scaled to about 10 nodes - mostly driven by the nodes that own &lt;em&gt;the&lt;/em&gt; and &lt;em&gt;of&lt;/em&gt;. The important point here is that these scalability predictions are very sensitive to the shape of the distribution (especially at the head), and so using Zipf as a canonical skewed access distribution probably won’t reflect the results you’ll get on real data.&lt;/p&gt;

&lt;p&gt;As a systems community, we need to get better at benchmarking with real distributions reflective of real workloads. Parametric approaches like Zipf do have their place, but they are very frequently (one might say &lt;em&gt;too frequently&lt;/em&gt;) used outside that place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zipf at the Limit&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clearly, heavily skewed data affects error rates as the hottest nodes overheat. But does it strictly limit scalability? Is there a point where adding more shards will not allow &lt;em&gt;any&lt;/em&gt; more throughput at any cost? There must be a limit when the number of shards exceeds the key cardinality, what if we assuming a Zipf distribution and infinite cardinality? Let’s look at the definition of the Zipf distribution:&lt;/p&gt;

\[f(k; s, N) = \frac{k\^{-s}}{\sum_{n=1}\^N \frac{1}{n\^s}}\]

&lt;p&gt;For $s \leq 1$, the Zipf distribution isn’t well defined with $N = \infty$. But, going back to our approximate analysis above, we can say that scale is roughly proportional to $\frac{1}{f(1; s, N)}$, which means that it is roughly proportional to $\sum\^N_{n=1} \frac{1}{n\^s}$, which grows rather quickly with $N$. That’s some good news!&lt;/p&gt;

&lt;p&gt;What about the case where $s &amp;gt; 1$? Here, we have a nice, closed-form solution, since:&lt;/p&gt;

\[\sum_{n=1}\^\infty \frac{1}{n\^s} = \zeta(s)\]

&lt;p&gt;Which is relatively easy to calculate. For example $\zeta(1.1) \approx 10.6$, and $\zeta(2.0) \approx 2.6$. Data that follows these very steep power laws makes very poor keys indeed! Here’s what the limits look like for some larger values of $s$.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/zipf_limit.png&quot; alt=&quot;Chart showing scalability of databases with Zeta distributed keys&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The dashed lines show the ultimate limit: even with an infinitely large key space, if your keys are distributed this way you can’t beat those limits without unbounded error rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; And if you are, look out for Sebastian. He’s not who you think he is.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; I feel like this problem has been stalking me my entire career.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; A fair number of sources (including database papers and textbooks) use names as an example of Zipf-distributed (or otherwise powerlaw-distributed) data. Looking at this graph doesn’t seem to support that claim.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>NoSQL: The Baby and the Bathwater</title>
      <link>http://brooker.co.za/blog/2023/01/30/nosql.html</link>
      <pubDate>Mon, 30 Jan 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/01/30/nosql</guid>
      <description>&lt;h1 id=&quot;nosql-the-baby-and-the-bathwater&quot;&gt;NoSQL: The Baby and the Bathwater&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Is this a database?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is a bit of an introduction to a long series of posts I’ve been writing about what, fundamentally, it is that makes databases scale. The whole series is going to take me a long time, but hopefully there’s something here folks will enjoy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;On March 12 2006, Australia set South Africa the massive target of 434 runs to chase in a one-day international at the Wanderers in Johannesburg. South Africa, in reply, set a record that stands to this day: 438 runs in a successful chase. It’s hard to overstate what an outlier this was. The previous record for a successful run chase was only 332. Despite nearly two decades of &lt;em&gt;bigger is better&lt;/em&gt; scores in cricket, nothing has come close.&lt;/p&gt;

&lt;p&gt;It wasn’t just cricket scores that were getting bigger in the mid 2000s. Databases were too. The growth of the web, especially search and online shopping, were driving systems to higher scales than they had ever seen before. With this trend towards size came a repudiation of the things that had come before. No longer did we want SQL. No, now we wanted NoSQL.&lt;/p&gt;

&lt;p&gt;There are various historical lenses we can apply to the NoSQL movement, from branding (&lt;em&gt;No SQL&lt;/em&gt; or &lt;em&gt;Not Only SQL&lt;/em&gt;), to goals (&lt;em&gt;scalability&lt;/em&gt; vs &lt;em&gt;write availability&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; vs &lt;em&gt;open source&lt;/em&gt;), to operations (should developers or DBAs own the schema? Should DBAs still exist?), but there was clearly a movement&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; with at least some set of common goals. In this blog post I’m going to single-mindedly focus on one aspect of NoSQL: scalability. We’ll look at some of the things the NoSQL movement threw out, and ask ourselves whether those things actually helped achieve better scalability. On the way, we’ll start exploring the laws of scalability physics, and what really matters.&lt;/p&gt;

&lt;p&gt;So what did NoSQL throw out? Again, that varies from database to database, but it was approximately these things:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Explicit schema&lt;/li&gt;
  &lt;li&gt;Transactions&lt;/li&gt;
  &lt;li&gt;Strong consistency&lt;/li&gt;
  &lt;li&gt;Joins, secondary indexes, unique keys, etc.&lt;/li&gt;
  &lt;li&gt;The SQL language itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Looking through the lens of scalability, let’s consider the effect of these. Which were the dirty bathwater, and which the baby?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Scalability?&lt;/strong&gt;
At the risk of over-simplifying a little bit, scalability has basically two goals:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Increase the throughput of the database across the entire key-space beyond what a single machine can achieve. This is typically done with some variant of &lt;em&gt;sharding&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;Increase the throughput per key beyond what a single machine can achieve. This is typically done with some variant of &lt;em&gt;replication&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/db_basic_arch.png&quot; alt=&quot;Abstract Database Architecture&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Replication&lt;/em&gt; simply means that we keep multiple copies of each data item, and do some work at write time to keep those copies up-to-date. In exchange for that work, we get to read from the additional copies, which means we get to trade write-time work and read-time scalability in a useful way.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sharding&lt;/em&gt; simply means that we break the data set up into multiple pieces along some dimension. Unlike with replication, there isn’t a work tradeoff here: for reads and writes that touch just one item, nothing limits the scalability of the database (in theory) until we have as many shards as items.&lt;/p&gt;

&lt;p&gt;Things that limit scalability are things that restrict our ability to apply these two tools. For example:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Things that require shards to work together (&lt;em&gt;coordination&lt;/em&gt;) limit the effectiveness of sharding,&lt;/li&gt;
  &lt;li&gt;Things that require readers to coordinate with writers limit the effectiveness of replication,&lt;/li&gt;
  &lt;li&gt;Things that send a lot more work to some shards than others (&lt;em&gt;skew&lt;/em&gt;) limit the effectiveness of sharding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the other hand, things that are merely expensive to do (like compression, encryption, schema enforcement, or query parsing) may be very interesting for performance, but not particularly interesting for &lt;em&gt;scalability&lt;/em&gt;. In short, it is coordination that limits scalability. We can use this lens to revisit each of NoSQL’s simplifications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explicit Schema&lt;/strong&gt;
Schema, itself, is mostly a local (&lt;em&gt;item-by-item&lt;/em&gt; or &lt;em&gt;row-by-row&lt;/em&gt;) concern, and therefore discarding it doesn’t do much for scalability. On the other hand, schema brings with it easy access to a set of features (auto-increment, unique keys, etc), and a set of design patterns (normalization and its horsemen), that are extremely relevant to scalability. We’ll get to those when we talk about database features in a little bit. The other common point about schema is an operational scale one: changing schema in the database can be slow and risky, both because of the operation itself&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and because of the complexity of applications that depend on that schema. NoSQL’s movement to application-defined schema was a reaction to this operational reality, largely based on the idea that moving schema into the application would simplify these things. Reports of the success of this approach are mixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transactions&lt;/strong&gt;
Transactions clearly require coordination. You’re asking the database to do &lt;em&gt;this thing&lt;/em&gt; and &lt;em&gt;that thing&lt;/em&gt; at the same time. Atomic commitment, needed for the &lt;em&gt;A&lt;/em&gt; and &lt;em&gt;I&lt;/em&gt; in ACID, is &lt;a href=&quot;https://brooker.co.za/blog/2022/10/04/commitment.html&quot;&gt;particularly difficult to scale&lt;/a&gt;. While there are approaches to reducing the amount of coordination needed&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, dispensing with it entirely is clearly a significant scalability win.&lt;/p&gt;

&lt;p&gt;But, of course, it’s not that simple. The cost of transactions depends on the isolation level. For example, serializability requires readers to coordinate with writers&lt;sup&gt;&lt;a href=&quot;#foot10&quot;&gt;10&lt;/a&gt;&lt;/sup&gt; on each key, while snapshot isolation only requires writers to coordinate with writers. Lower levels require even less coordination. Second, whether coordination limits scalability in practice depends a lot on the access patterns. If Alice and Bob are working together, and Barney and Fred are working together, then we may be OK. If sometimes Alice works with Bob, and sometimes with Fred, and Barney works with everybody some of the time, then coordination may be much more expensive.&lt;/p&gt;

&lt;p&gt;As we go through the series we’ll look at this in detail, but for now, it’s true that transactions play a big role in scalability. But the relationship between transactions and scalability is complicated, and its not clear that you get a lot of scalability just from throwing out transactions&lt;sup&gt;&lt;a href=&quot;#foot9&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;. Throwing out all transactionality in the name of scalability seems unnecessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong Consistency&lt;/strong&gt;
Like transactions, the scalability of strong consistency is a deep topic that will get its own post in this series. Clearly, relaxing consistency makes some things significantly easier. For example, many systems (like DynamoDB), implement read scale-out fairly simply by allowing readers to read from replicas, without ensuring those replicas are up-to-date. This is clearly a nice simplification, but its not clear that it is strictly required to achieve the same level of scalability&lt;sup&gt;&lt;a href=&quot;#foot7&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Strong consistency may be something NoSQL didn’t need to throw out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Joins, Secondary Indexes, Unique Keys, etc&lt;/strong&gt;
This category is a bit of a grab bag, and there seem to be at least three different categories here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Joins and friends, which introduce read skew, and therefore make it more difficult to scale out the read path of the database through sharding. Read skew is related to the laws of physics in an interesting way, because it doesn’t matter much in isolation, but matters a great deal in the presence of writes. Read skew aside, its not clear joins affect scalability much at all.&lt;/li&gt;
  &lt;li&gt;Secondary indexes and friends, which introduce potential write skew (driven by index cardinality). Unlike read skew, write skew is a problem in itself, because we can’t throw replication at it in the same way.&lt;/li&gt;
  &lt;li&gt;Unique keys and friends (including sequences, auto-increment, etc) which are inherently scalability killers, because they require coordination and create write skew. Auto-increment and friends is potentially the worst case here, because it may force the database to serialize all writes through a single sequencer&lt;sup&gt;&lt;a href=&quot;#foot8&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;. Relaxing semantics may help, but that’s always true.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The SQL Language Itself&lt;/strong&gt;
This is the controversial one. There are two ways to look at this. One is that the language has no influence at all on scalability, because it’s just a way of expressing some work for the database to do. The other is that SQL’s semantics (such as ACID) and features (such as secondary indexes) mean that the scalable subset of SQL is small and hard to find, and so throwing out SQL is a win in itself.&lt;/p&gt;

&lt;p&gt;This is an interesting argument. SQL is more than a language, but a set of semantics and features and expectations and historical behaviors, all rolled into a ball. If you throw out SQL, then you can throw out all of those things, and package scalable semantics and features together in a new API. This baby has been in the bath a very long time, and it’s no longer clear where one ends and the other begins.&lt;/p&gt;

&lt;p&gt;The other argument to be made here is that SQL, as this declarative high-level language, makes it very easy for a programmer to ask the database to do expensive things (like coordination) in a way that may not be obvious. Lower-level APIs (like, as I’ve argued before, &lt;a href=&quot;https://brooker.co.za/blog/2022/01/19/predictability.html&quot;&gt;DynamoDB’s&lt;/a&gt;) make it much easier for the programmer to reason about what they are asking the database to do, and therefore reason about scalability and predictability. Alternatively, lower-level APIs force programmers to understand things about the system that may be hard to hide. To avoid falling into the classic &lt;em&gt;are abstractions good?&lt;/em&gt; question, I’ll simply point out that this is a key issue that I expect we’ll be grappling with forever.&lt;/p&gt;

&lt;p&gt;But mostly, SQL is a distraction. It’s the least important thing about NoSQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;
NoSQL, as fuzzy as it is, is a perfect example of the pendulum of technical trends. Even if we look at it just through the very limited lens of scalability, its clear that the movement identified some very real issues, and then overreacted to them. At least some of this overreaction was for good reasons: the &lt;em&gt;best of both&lt;/em&gt; approaches are complex and difficult to build, and so overreacting helped create a lot of systems that could solve real issues without solving those hard problems. That’s a good thing. On the other hand, at least some of it was because of a misunderstanding of what drives scalability. Its hard, without being in people’s heads, to know which is which, but we can know better now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; In this post, I separate the NoSQL movement for transactional applications from the movement away from (only) SQL for analytics applications, perhaps most famously MapReduce and friends. My focus here is on transactional applications (OLTP), while being clear that there isn’t a bright line between these things.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Write availability is a key concern in Werner Vogels 2009 article &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/1435417.1435432&quot;&gt;Eventually Consistent&lt;/a&gt;, which is a great summary of the state of the argument there. Availability is extremely important, but I’m not focusing there because I think it’s a topic that’s been very well covered already.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Who among us haven’t enjoyed sweaty minutes or hours waiting for that ALTER TABLE to complete, while praying it doesn’t break replication?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Of course, for both replication and sharding there are a million different ways to (&lt;em&gt;ahem&lt;/em&gt;) feed a cat. I’m not getting into those here, because I don’t think they matter a lot to the underlying dynamics. If you’re interested in the variants, check out Ziegler et al &lt;a href=&quot;https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf&quot;&gt;Is Scalable OLTP in the Cloud a Solved Problem?&lt;/a&gt;, or the &lt;a href=&quot;http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf&quot;&gt;Calvin paper&lt;/a&gt;, or the &lt;a href=&quot;https://irenezhang.net/papers/tapir-sosp15.pdf&quot;&gt;Tapir paper&lt;/a&gt;, or the &lt;a href=&quot;https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;Dynamo paper&lt;/a&gt;, or the &lt;a href=&quot;https://www.usenix.org/conference/atc22/presentation/elhemali&quot;&gt;DynamoDB paper&lt;/a&gt;, or the Spanner paper, etc. etc.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Replication is also important for durability, availability, and other important things.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; For example, see Bailis et al’s &lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions/&quot;&gt;work&lt;/a&gt; on &lt;a href=&quot;https://arxiv.org/pdf/1302.0309.pdf&quot;&gt;Highly Available Transactions&lt;/a&gt;, Wu et al’s work on &lt;a href=&quot;https://dsf.berkeley.edu/jmh/papers/anna_ieee18.pdf&quot;&gt;Anna&lt;/a&gt;, and &lt;a href=&quot;http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf&quot;&gt;Calvin&lt;/a&gt; and the extended deterministic database universe, for different looks at the nature of the coordination needed for transactions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot7&quot;&gt;&lt;/a&gt; It is strictly required in the asynchronous model, but we don’t live in the asynchronous model.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot8&quot;&gt;&lt;/a&gt; &lt;em&gt;Coordinate ALL THE THINGS!&lt;/em&gt; That’s a current reference the kids will get, right?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot9&quot;&gt;&lt;/a&gt; I like how Doug Terry makes this point in &lt;a href=&quot;https://www.usenix.org/conference/fast19/presentation/terry&quot;&gt;Transactions and Scalability in Cloud Databases—Can’t We Have Both?&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot10&quot;&gt;&lt;/a&gt; Specifically for read-write transactions, that is. Read-only readers can be serializable with no coordination with writers or other readers, provided a reasonable set of assumptions and constraints.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Erasure Coding versus Tail Latency</title>
      <link>http://brooker.co.za/blog/2023/01/06/erasure.html</link>
      <pubDate>Fri, 06 Jan 2023 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2023/01/06/erasure</guid>
      <description>&lt;h1 id=&quot;erasure-coding-versus-tail-latency&quot;&gt;Erasure Coding versus Tail Latency&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;There are zero FEC puns in this post, against my better judgement.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;Jeff Dean and Luiz Barroso’s paper &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/2408776.2408794&quot;&gt;The Tail At Scale&lt;/a&gt; popularized an idea they called &lt;em&gt;hedging&lt;/em&gt;, simply sending the same request to multiple places and using the first one to return. That can be done immediately, or after some delay:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;One such approach is to defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Smart stuff, and has become something of a standard technique for systems where tail latencies are a high multiple of the 50th or 95th percentile latencies&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. The downside of hedged requests is that it’s all-or-nothing: you either have to send twice, or once. They’re also modal, and don’t do much to help against lower percentile latencies. There’s a more general technique available that has many of the benefits of hedged requests, with improved flexibility and power to reduce even lower percentile latencies: erasure coding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Erasure Coding?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Erasure coding is the idea that we can take a blob of data, break it up into &lt;em&gt;M&lt;/em&gt; parts, in such a way that we can reconstruct it from any &lt;em&gt;k&lt;/em&gt; of those &lt;em&gt;M&lt;/em&gt; parts&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. They’re pretty ubiquitous in storage systems, block storage, object storage, higher RAID levels, and so on. When storage systems think about erasure codes, they’re usually thinking about durability: the ability of the system to tolerate $M - k$ disk or host failures without losing data, while still having only $\frac{M - k}{M}$ storage overhead. The general idea is also widely used in modern communication and radio protocols&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. If you’re touching your phone or the cloud, there are erasure codes at work.&lt;/p&gt;

&lt;p&gt;I won’t go into the mathematics of erasure coding here, but will say that it is somewhat remarkable to me that they exist for any combination of &lt;em&gt;k&lt;/em&gt; and &lt;em&gt;M&lt;/em&gt; (obviously for $k \leq M$). It’s one of those things that’s both profound and obvious, at least to me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Erasure Coding for Latency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Say I have an in-memory cache of objects. I can keep any object in the cache once, and always go looking for it in that one place (e.g. with consistent hashing). If that place is slow, overloaded, experiencing packet loss, or whatever, I’ll see high latency for all attempts to get that object. With hedging I can avoid that, if I store the object in two places rather than one, at the cost of doubling the size of my cache.&lt;/p&gt;

&lt;p&gt;But what if I wanted to avoid the slowness and not double the size of my cache? Instead of storing everything twice, I could break it into (for example) 5 pieces ($M = 5$) encoded in such a way that I could reassemble it from any four pieces ($k = 4$). Then, when I fetch, I send five get requests, and have the whole object as soon as four have returned. The overhead here on requests is 5x, on bandwidth is worst-case 20%, and on storage is 20%. The effect on tail latency can be considerable.&lt;/p&gt;

&lt;p&gt;The graph below, from an upcoming paper where we describe the storage system we built for Lambda container support and SnapStart, shows the measured latency of a 4-of-5 erasure coding scheme, versus the latency we would have experience from simply fetching 4-of-4 shards in parallel.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/ec_latency.png&quot; alt=&quot;Graph showing latency impact of Erasure Coding&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The huge effect on the tail here is obvious. What’s also worth paying attention to is that, unlike hedging-style schemes, there’s also a considerable win here at the lower percentiles like the median. In fact, the erasure coding scheme drops the median latency by about 20%. Similar wins are reported by Rashmi et al in &lt;a href=&quot;https://www.usenix.org/system/files/conference/osdi16/osdi16-rashmi.pdf&quot;&gt;EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding&lt;/a&gt;, and many others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Erasure Coding for Availability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Erasure coding in this context doesn’t only help with latency. It can also significantly improve availability. Let’s think about that cache again: when that one node is down, overloaded, busy being deployed, etc that object is not available. This property can make operating high hit-rate caches and storage systems particularly difficult: any kind of deployment or change can look to clients like a kind of rolling outage. However, with erasure coding, single failures (or indeed any $M - k$ number of failures) have no availability impact.&lt;/p&gt;

&lt;p&gt;How big this effect is depends on &lt;em&gt;M&lt;/em&gt; and &lt;em&gt;k&lt;/em&gt;. In the graph below, we look at the availability impact of various combinations of &lt;em&gt;M&lt;/em&gt; and &lt;em&gt;k&lt;/em&gt;, assuming that each component offers 99.99% (four nines) availability (and assuming independent failures).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/ec_availability.png&quot; alt=&quot;Graph showing availability impact of Erasure Coding&quot; /&gt;&lt;/p&gt;

&lt;p&gt;These are straight lines on an exponential axis, meaning (at least with these numbers) we get exponential improvement at linear cost as we increase &lt;em&gt;k&lt;/em&gt;! How often do you get a deal like that?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Erasure Coding for Load and Spread&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The flexibility of erasure coding also gives us a lot of control in how we spread out load, and avoid overheating particular nodes. For example, imagine we were using a scheme with $k = 2$ and $M = 10$. The write cost is rather high (10 writes), the storage cost is rather high (5x), but we have a huge amount of flexibility about where we send traffic. A request could ask any 2 or more of the 10 nodes and still get the entire answer, which means that a load-balancer could be very effective at spreading out load. Simple replication (aka $k = 1$) has the same effect, but isn’t quite as flexible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Erasure Coding is a ubiquitous technique in storage and physical networking, but something of an under-rated and under-used one in systems more generally. Next time you build a latency-sensitive or availability-sensitive cache or storage system, it’s worth considering. There are many production systems that do this already, but it doesn’t seem to be as widely talked about as it deserves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Although their argument for why it doesn’t add additional load tends to break down when we think about failure cases, and especially when we apply the lens of metastable failures. In the paper, they say:
    &lt;blockquote&gt;
      &lt;p&gt;Although naive implementations of this technique typically add unacceptable additional load, many variations exist that give most of the latency-reduction effects while increasing load only modestly. One such approach is to defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests.&lt;/p&gt;
    &lt;/blockquote&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This will, in the normal case, limit the number of hedged requests to around 5%. But what if there’s a correlated latency increase across the system (caused by traffic, infrastructure failures, or an empty cache, for example) which raises all request latencies to the &lt;em&gt;95th-percentile expected latency&lt;/em&gt;. At least until you update your expectation, you’ve doubled your traffic, and perhaps added more cancellation traffic. For this reason, this technique should always be combined with an approach like a token bucket that limits the additional requests to what you’d expect (say 5%). The same token bucket &lt;a href=&quot;https://brooker.co.za/blog/2022/02/28/retries.html&quot;&gt;we use for adaptive retries works well here&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; When I say “any &lt;em&gt;k&lt;/em&gt;” I’m referring to one class of codes, the &lt;a href=&quot;https://www.johndcook.com/blog/2020/03/07/mds-codes/&quot;&gt;maximum distance separable codes&lt;/a&gt;. There are whole families of non-MDS code that have other interesting properties, but only allow reconstruction from some combinations.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; The ability of Forward Error Correction (FEC) to lift useful data out of the noise floor is truly remarkable.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Under My Thumb: Insight Behind the Rules</title>
      <link>http://brooker.co.za/blog/2022/12/15/thumb.html</link>
      <pubDate>Thu, 15 Dec 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/12/15/thumb</guid>
      <description>&lt;h1 id=&quot;under-my-thumb-insight-behind-the-rules&quot;&gt;Under My Thumb: Insight Behind the Rules&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;My left thumb is exactly 25.4mm wide.&lt;/p&gt;

&lt;p&gt;Starting off in a new field, you hear a lot of &lt;em&gt;rules of thumb&lt;/em&gt;. Rules for estimating things, thinking about things, and (ideally) simplifying tough decisions. When I started in Radar, I heard:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;the transmitter makes up three quarters of the cost of a radar system&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and when I started building computer systems, I heard a lot of things like:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;hardware is free, developers are expensive&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and, the ubiquitous:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;premature optimization is the root of all evil.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;None of these things are true. Some are less true than others&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Mostly, they’re so context dependent that stripping them of their context renders them meaningless. On the other hand, heuristics like this can be exceptionally valuable, saving us time reasoning things through from first principles, and allowing rapid exploration of a design space. Can we make these truisms more true, and more useful, by turning to them into frameworks for quantitative thinking?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 5 Minute Rule&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Jim Gray’s &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/38713.38755&quot;&gt;famous 5 minute rule&lt;/a&gt;, from 1987:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Data referenced every five minutes should be memory resident.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Today, thirty five years later, Gray’s five minute rule is just as misleading as the ones above&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. What we’re left with isn’t a rule, but a powerful and durable insight. Gray and Putzolu’s observation was that we can calculate the cost of something (storing a page of data in memory) and the cost of replacing that thing (reading the data from storage), and &lt;em&gt;quantitatively estimate&lt;/em&gt; how long we should keep the thing.&lt;/p&gt;

&lt;p&gt;They did it like this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The derivation of the five minute rule goes as follows: A disc, and half a controller comfortably deliver 15 random accesses per second and are priced at about 15K$ So the price per disc
access per second is about 1K$/a/s. The extra CPU and channel cost for supporting a disc is 1K$/a/s. So one disc access per second costs about 2K$/a/s.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;A megabyte of main memory costs about 5K$, so a kilobyte costs 5$. If making a 1Kb data record main-memory resident saves 1a/s, then it saves about 2K$ worth of disc accesses at a cost of 5$, a good deal. If it saves .1a/s then it saves about 200$, still a good deal. Continuing this, the break-even point is one access every 2000/5 - 400 seconds. So, any 1KB record accessed more frequently than every 400 seconds should live in main memory.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;$5000 a Megabyte! Wow! But despite the straight-from-the-80s memory pricing, the insight here is durable. We can plug our storage costs, memory costs, and access costs into the story problem and get some real insight into the problems of today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware is Free?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s go back to&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;hardware is free, developers are expensive.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Can we make that more quantitative?&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm&quot;&gt;Bureau of Labor Statistics says&lt;/a&gt; that the median US software developer earns $52.41 an hour. A Graviton core in EC2, as of today, costs around $0.04 an hour. So it’s worth spending an hour of developer time to save anything more than around 1300 core hours. That’s about two months, so we can get write a better rule:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It’s worth spending an hour of developer time to save two core months.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Just as with Gray and Putzolu’s rule, this one is highly sensitive to your constants (developer pay, core cost, overheads, etc). But the quantitative method is durable, as is the idea that we can quickly &lt;em&gt;quantitatively estimate&lt;/em&gt; things like this. That idea is much more powerful than rules of thumb.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Irredeemable?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some rules, on the other hand, are stubbornly difficult to turn into quantitative tools. Take Jevon’s Paradox, for example&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;in the long term, an increase in efficiency in resource use will generate an increase in resource consumption rather than a decrease.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you’ve spent any time at all online, you’ll have run across folks using Jevon’s Paradox as if it were some immutable law of the universe to dismiss any type of conversation or economic effort. If we’re battling with truisms, I prefer &lt;a href=&quot;https://twitter.com/zeynep/status/1478766408691556353?lang=en&quot;&gt;Zeynep’s Law&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Zeynep’s law: Until there is substantial and repeated evidence otherwise, assume counterintuitive findings to be false, and second-order effects to be dwarfed by first-order ones in magnitude.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both of these truisms seem true. They get us nodding our heads, and may even get us thinking. Unfortunately their use is limited by the fact that they don’t provide us with any tools for thinking about when they are valid, and extending them to meet our own context. From a quantitative perspective, they may be irredeemable. Not useless, just limited in power.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Engineering is, and software engineering sometimes stubbornly is not, a quantitative and economic discipline. I think we’d do well to emphasize the quantitative and economic side of our field. In the words of Arthur M. Wellington:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It would be well if engineering were less generally thought of, and even defined, as the art of constructing. In a certain important sense it is rather the art of not constructing; or, to define it rudely but not inaptly, it is the art of doing that well with one dollar, which any bungler can do with two after a fashion.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Or maybe it isn’t. In &lt;a href=&quot;https://infoscience.epfl.ch/record/230398/files/adms-talk.pdf&quot;&gt;The five-minute rule thirty years later&lt;/a&gt; from 2017, Appuswamy et al find that for the combination of DRAM and SATA SSD there’s around a 7 minute rule. That’s very close! On the other hand, SSD performance has changed so much since 2022 that the rule is probably broken again.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; And then there are the universally constant ones, like π=3 and g=10, which don’t change, but whether they are right or not is very dependent on context. Except π, which is always 3.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Definition from &lt;a href=&quot;https://www.frontiersin.org/articles/10.3389/fenrg.2018.00026/full&quot;&gt;Unraveling the Complexity of the Jevons Paradox: The Link Between Innovation, Efficiency, and Sustainability&lt;/a&gt;, which is worth reading.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Lambda Snapstart, and snapshots as a tool for system builders</title>
      <link>http://brooker.co.za/blog/2022/11/29/snapstart.html</link>
      <pubDate>Tue, 29 Nov 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/11/29/snapstart</guid>
      <description>&lt;h1 id=&quot;lambda-snapstart-and-snapshots-as-a-tool-for-system-builders&quot;&gt;Lambda Snapstart, and snapshots as a tool for system builders&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Clones.&lt;/p&gt;

&lt;p&gt;Yesterday, AWS announced &lt;a href=&quot;https://aws.amazon.com/blogs/aws/new-accelerate-your-lambda-functions-with-lambda-snapstart/&quot;&gt;Lambda Snapstart&lt;/a&gt;, which uses VM snapshots to reduce cold start times for Lambda functions that need to do a lot of work on start (starting up a language runtime&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, loading classes, running &lt;em&gt;static&lt;/em&gt; code, initializing caches, etc). Here’s a short 1 minute video about it:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/AgxvrZLI1mc&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Or, for a lot more context on Lambda and how we got here&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/R11YgBEZzqE?start=3275&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Snapstart is a super useful capability for Lambda customers. I’m extremely proud of the work the team did to make it a reality. We’ve been talking about this work for &lt;a href=&quot;https://www.youtube.com/watch?v=ADOfX2LiEns&quot;&gt;over two years&lt;/a&gt;, and working on it for longer. The team did some truly excellent engineering work to make Snapstart a reality.&lt;/p&gt;

&lt;p&gt;Beyond Snapstart in Lambda, I’m also excited about the underlying technology (microVM snapshots), and the way they give us (as system builders and researchers) a powerful tool for building new kinds of secure and scalable systems. In this post, I talk about some interesting aspects of Snapstart, and how they point to interesting possible areas for research on systems, storage, and even cryptography.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Snapstart?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without snapstart, the &lt;em&gt;cold start time&lt;/em&gt; of a Lambda function is a combination of time time taken to download the function (or container), to start the language runtime&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and to run any initialization code inside the function (including any &lt;em&gt;static&lt;/em&gt; code, doing class loading, and even JIT compilation). The cold start time doesn’t include MicroVM boot, because that time is not specialized to the application, and can be done in advance. A cold start looks like this&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/serverless_cold_start.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;With snapstart, a snapshot of the microVM is taken after these initialization steps (including all the memory, device state, and CPU registers). This snapshot can be used multiple times (also called &lt;em&gt;cloned&lt;/em&gt;), to create multiple sandboxes that are initialized in the same way. Cloning like this has two benefits:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The work of initialization is amortized over many sandboxes, rather than being done once for every sandbox. Lambda runs one sandbox for ever concurrent function execution, so for a function with N concurrent executions this reduces initialization work from O(N) to O(1).&lt;/li&gt;
  &lt;li&gt;The time taken by execution no longer accrues to the cold-start time, significantly reducing cold start latency for applications that do a lot of work on startup (which is typically applications that use large language runtimes like the JVM, or large frameworks and libraries).&lt;/li&gt;
  &lt;li&gt;JIT compilation and other &lt;em&gt;warmup&lt;/em&gt; tasks can be done at initialization time, often avoiding the uneven performance that can be introduced by JITing soon after language runtime&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; startup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In diagram form, the snapstart startup regime looks like this&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/snapstart.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problems of Uniqueness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Perhaps the biggest challenge with clones is that they’re, well, clones. They contain the same CPU, the same memory, and even the same CPU registers. &lt;a href=&quot;https://www.youtube.com/watch?v=nuW0pUG3PrQ&quot;&gt;Being too alike can cause problems&lt;/a&gt;. For example, as we say in &lt;a href=&quot;https://arxiv.org/pdf/2102.12892.pdf&quot;&gt;Restoring Uniqueness in MicroVM Snapshots&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Most modern distributed systems depend on the ability of nodes to generate unique or random values. RFC4122 version 4 UUIDs are widely used as unique database keys, and as request IDs used for log correlation and distributed tracing. …&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Many common distributed systems protocols, including consensus protocols like Paxos, and ordering protocols like vector clocks, rely on the fact that participants can uniquely identify themselves.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Cryptography is the most critical application of unique data. Any predictability in the data used to generate cryptographic keys—whether long-lived keys for applications like storage encryption or ephemeral keys for protocols like TLS fundamentally compromise the confidentiality and authentication properties offered by cryptography.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In other words, its really important for systems to be able to generate random numbers, and MicroVM clones might find that difficult to do. If they rely on hardware features like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rdrand&lt;/code&gt; then there’s no problem, but any software-based PRNGs will simply create the same stream of numbers unless action is taken. To solve this problem, our team has been working with the OpenSSL, Linux, and Java open source communities to make sure that common PRNGs like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java.security.SecureRandom&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/urandom&lt;/code&gt; reseed correctly when snapshots are cloned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another challenge with working with clones is connection state in protocols like TCP. There are actually two problems here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Time&lt;/em&gt;. If a connection is established during initialization and the clone is used later, it’s likely that the remote end has given up on the connection.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;State&lt;/em&gt;. Protocols like TCP provide reliable delivery using state at each end of a connection (like a sequence number), with the assumption that there is one client for the lifetime of the connection. If that one client suddenly becomes two clients, the protocol is broken and the connection must be dropped.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simple solution is to reestablish connections after snapshots are restored. As with reseeding PRNGs, this requires time and work, especially for secure protocols like TLS, which somewhat dilutes cold start benefit. There’s a significant research and development opportunity here, focusing on fast reestablishment of secure protocols, clone-aware protocols, clone-aware proxies, and even deeply protocol-aware session managers (like RDS proxy)&lt;sup&gt;&lt;a href=&quot;#foot7&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Moving Data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s take another look at this snapstart diagram:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/snapstart.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If the clones are running on the same machine the snapshot was created on, then this diagram is reasonably accurate. However, if we want to scale snapshot restores out over a system the size of AWS Lambda, then we need to make sure the snapshot data is available where and when it is needed. This hidden work - both in distributing data when it’s needed and in sending restores to the right places to meet the data - was the largest challenge of building Snapstart. Turning memory reads into network storage accesses, as would happen with a naive demand-loading system, would very quickly cancel out the latency benefits of Snapstart.&lt;/p&gt;

&lt;p&gt;We’re going to say more about our particular solution to this problem in the near future, but I believe that there are interesting general challenges here for systems and storage researchers. Loading memory on demand can be done if the data layer offers low-enough latency, close enough to the latency of a local memory read. We can also avoid loading memory on demand by predicting memory access patterns, loading memory contents ahead of when they are needed. This seems hard to do in general, but is significantly simplified by the ability to learn from the behavior of multiple MicroVM clones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layers Upon Layers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are multiple places where we can snapshot a MicroVM: just after its booted, after a language runtime has been started, or after customer code initialization. In fact, we don’t have to choose just one of these. Instead, we can snapshot after boot, restore that snapshot to start the language runtime, and store just the changes. Then restore that snapshot, run customer initialization, and store just those changes. This has the big benefit of creating a tree of snapshots, where the same Linux kernel or runtime chunks don’t need to be downloaded again and again.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/snapstart_tree.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Unlike traditional approaches to memory deduplication (like &lt;a href=&quot;https://docs.kernel.org/admin-guide/mm/ksm.html&quot;&gt;Kernel Samepage Merging&lt;/a&gt;) there’s no memory vs. CPU tradeoff here: identical pages are shared based on their provenance rather than on post-hoc scanning&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. This significantly reduces the scope of the data movement challenge, by requiring widely-used components to be downloaded infrequently, reducing data movement by as much as 90% for common workloads.&lt;/p&gt;

&lt;p&gt;Incremental snapshotting also allows us to use seperate cryptographic keys for different levels of data, including service keys for common components, and per-customer and customer-controlled keys for customer data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hungry, Hungry Hippos&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Linux, &lt;a href=&quot;https://www.linuxatemyram.com/&quot;&gt;rather famously&lt;/a&gt;, loves to eat all the memory it can lay its hands on. In the traditional single-system setting this is the right thing to do: the marginal cost of making an empty memory page full is very nearly zero, so there’s no need to think much about the marginal benefit of keeping around a disk-backed page of memory (whether its an mmap mapping, or an IO buffer, or whatever). However, in cloud settings like Lambda Snapstart, this calculus is significantly different: keeping around disk-backed pages that are unlikely to be used again makes snapshots bigger, with little benefit. The same applies to caches at all layers, whether they’re in the language runtime, in application code, or in libraries.&lt;/p&gt;

&lt;p&gt;Tools like &lt;a href=&quot;https://www.kernel.org/doc/html/v5.17/vm/damon/index.html&quot;&gt;DAMON&lt;/a&gt; provide a good ability to monitor and control the kernel’s behavior. I think, though, that there will be a major change in thinking required as systems like Snapstart become more popular. There seems to be an open area of research here, in adapting caching behaviors (perhaps dynamically) to handle changing marginal costs of full and empty memory pages. Linux’s behavior - and the one most programmers build into their applications - is one behavior on a larger spectrum, that is only optimal at the point of zero marginal cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snapshots Beyond Snapstart&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I can’t say anything here about future plans for using MicroVM snapshots at AWS. But I do believe that they are a powerful tool for system designers and researchers, which I think are currently under-used. Firecracker has the ability to restore a MicroVM snapshot in as little as 4ms (or about 10ms for a full decent-sized Linux system), and it’s no doubt possible to optimize this further. I expect that sub-millisecond restore times are possible, as are restore times with a CPU cost not much higher than a traditional fork (or even a traditional thread start). This reality changes the way we think about what VMs can be used for - making them useful for much smaller, shorter-lived, and transient applications than most would assume.&lt;/p&gt;

&lt;p&gt;Firecracker’s full and incremental snapshot support &lt;a href=&quot;https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md&quot;&gt;is already open source&lt;/a&gt;, and already offers great performance and density&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. But Firecracker is far from the last word in restore-optimized VMMs. I would love to see more research in this area, exploring what is possible from user space and kernel space, and even how hardware virtualization support can be optimized for fast restores.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Video Form&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most of this post covers material I also covered in this talk, if you’d prefer to consume it in video form.&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/ADOfX2LiEns&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Diagram from Brooker et al, &lt;em&gt;&lt;a href=&quot;https://arxiv.org/pdf/2102.12892.pdf&quot;&gt;Restoring Uniqueness in MicroVM Snapshots&lt;/a&gt;&lt;/em&gt;, 2021.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; I particularly enjoyed Laurence Tratt’s recent look at VM startup in &lt;a href=&quot;https://tratt.net/laurie/blog/2022/more_evidence_for_problems_in_vm_warmup.html&quot;&gt;More Evidence for Problems in VM Warmup&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; In this post I’ve used the words &lt;em&gt;language runtime&lt;/em&gt; to refer to language VMs like the JVM, to avoid confusion with virtualization VMs like Firecracker MicroVMs. This isn’t quite the right word, but it seemed worth avoiding the potential for confusion.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; As we covered in detail in &lt;em&gt;&lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/agache&quot;&gt;Firecracker: Lightweight Virtualization for Serverless Applications&lt;/a&gt;&lt;/em&gt; at NSDI’20.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; I really like the &lt;em&gt;compute cache&lt;/em&gt; framing that Peter uses in this keynote (from 1:00:30 onwards). It’s different from the one that I use in this post, but very clearly explains why cold starts exist, and why they matter to customers of systems like Lambda. The discussion of being unwilling to compromise on security is also important, and has been a driving force behind our work in this space for the last seven years.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; Granted, this does miss some merge opportunities for pages that end up becoming identical over time, but in a world of crypto and ASLR that happens infrequently anyway.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot7&quot;&gt;&lt;/a&gt; One interesting approach is the one described by Erika Hunhoff and Eric Rozner in &lt;em&gt;&lt;a href=&quot;http://ericrozner.com/papers/nsdi2020-poster.pdf&quot;&gt;Network Connection Optimization for Serverless Workloads&lt;/a&gt;&lt;/em&gt;. Unfortunately, they don’t seem to have taken this work further.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Amazon's Distributed Computing Manifesto</title>
      <link>http://brooker.co.za/blog/2022/11/22/manifesto.html</link>
      <pubDate>Tue, 22 Nov 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/11/22/manifesto</guid>
      <description>&lt;h1 id=&quot;amazons-distributed-computing-manifesto&quot;&gt;Amazon’s Distributed Computing Manifesto&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Manifesto made manifest.&lt;/p&gt;

&lt;p&gt;In the Johannesburg of 1998, I was rocking a middle parting, my friend group was abuzz about the news that there was water (and therefore monsters) on Europa, and all the cool kids were getting satellite TV at home&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Over in Seattle, the folks at Amazon.com had started to notice that their architecture was in need of rethinking. &lt;a href=&quot;https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/96985bfb-79b1-41e9-b552-fd5ad5af6fd3.pdf&quot;&gt;$147 million&lt;/a&gt; in sales in 1997, and over $600 million in 1998, were proving to be challenging to deal with. In 1998, as &lt;a href=&quot;https://www.allthingsdistributed.com/2022/11/amazon-1998-distributed-computing-manifesto.html&quot;&gt;Werner Vogels recently shared&lt;/a&gt; folks at Amazon wrote a &lt;em&gt;distributed computing manifesto&lt;/em&gt; describing the problems they were seeing and the solutions they saw to those problems.&lt;/p&gt;

&lt;p&gt;The document itself, which you can (and should!) &lt;a href=&quot;https://www.allthingsdistributed.com/2022/11/amazon-1998-distributed-computing-manifesto.html&quot;&gt;read in full over on Werner’s blog&lt;/a&gt; is both something of a time capsule, and surprisingly relevant to many of the systems architecture debates going on today, and the challenges that nearly all growing architectures inevitably face.&lt;/p&gt;

&lt;p&gt;From the manifesto:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The applications that run the business access the database directly and have knowledge of the data model embedded in them. This means that there is a very tight coupling between the applications and the data model, and data model changes have to be accompanied by application changes even if functionality remains the same.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Despite being called a &lt;em&gt;distributed computing manifesto&lt;/em&gt;, the Amazon of 1997 was already a distributed system by any reasonable measure. The problem was one of interfaces: the data store was serving as the interface between components and concerns, leading to tight coupling between storage and business logic. The architecture was difficult to scale, not in requests per second, but to adapt to new lines of business and the rate of overall change.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This approach does not scale well and makes distributing and segregating processing based on where data is located difficult since the applications are sensitive to the interdependent relationships between data elements.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The proposed solution is services. This document predates the term microservices&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, but that’s pretty much what they were talking about. Moving the data behind well-defined interface that encapsulate business logic, reducing the coupling between different parts of the system.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We propose moving towards a three-tier architecture where presentation (client), business logic and data are separated. This has also been called a service-based architecture. The applications (clients) would no longer be able to access the database directly, but only through a well-defined interface that encapsulates the business logic required to perform the function.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Perhaps the most interesting part of the manifesto for me is the description of the cultural change that needs to go along with the change in architecture. Merely drawing a different block diagram wasn’t going to be enough to get the outcome the authors wanted.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;There are several important implications that have to be considered as we move toward a service-based model.
…
A second implication of a service-based approach, which is related to the first, is the significant mindset shift that will be required of all software developers. Our current mindset is data-centric, and when we model a business requirement, we do so using a data-centric approach. Our solutions involve making the database table or column changes to implement the solution and we embed the data model within the accessing application. The service-based approach will require us to break the solution to business requirements into at least two pieces. The first piece is the modeling of the relationship between data elements just as we always have. This includes the data model and the business rules that will be enforced in the service(s) that interact with the data. However, the second piece is something we have never done before, which is designing the interface between the client and the service so that the underlying data model is not exposed to or relied upon by the client.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This mindset shift - from database schema to API - has been fundamental to the rise of SoA and microservices as the default architecture over the last two decades. Now, in 2022, with embedded databases and two-tier architectures coming back in fashion, it’s interesting to see data-centric thinking somewhat converge with API-centric thinking. A broad toolkit is a good thing, but one hopes that the architects of this new generation of two-tier systems are learning from the lessons of the multi-gigabyte monoliths of old&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Another groundbreaking part of the manifesto was thinking about the role of workflows in distributed architectures. Starting with the observation that the existing order flow, despite its tight coupling on the backend, was already a workflow:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We already have an “order pipeline” that is acted upon by various business processes from the time a customer order is placed to the time it is shipped out the door. Much of our processing is already workflow-oriented, albeit the workflow “elements” are static, residing principally in a single database.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;the scaling challenges of that architecture:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;…the current database workflow model will not scale well because processing is being performed against a central instance. As the amount of work increases…, the amount of processing against the central instance will increase to a point where it is no longer sustainable. A solution to this is to distribute the workflow processing so that it can be offloaded from the central instance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and the prescription:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Instead of processes coming to the data, the data would travel to the process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When I started at Amazon a decade later, I found this way of thinking enlightening. Before I joined Amazon, I spent some time thinking about how to distribute radar simulations, an interestingly compute- and data-heavy workflow problem. Google’s &lt;a href=&quot;http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf&quot;&gt;MapReduce&lt;/a&gt; paper had come out in 2004, and had become something of a ubiquitous model for data-centric distributed communication. We made some attempts to apply MapReduce to our problems, without success. I can’t help but wonder if I had seen this writing from Amazon about workflows at the time whether we would have had a lot more success with that model.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;manifesto&lt;/em&gt; is a fascinating piece of history, both of Amazon’s technical approach, and of the effect that web scale was having on the architectures of distributed systems. A lot has changed in the industry since then, and Amazon’s approach has evolved significantly, but there are still fascinating lessons here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Despite my rocking ‘do, I was not one of the cool kids, if you can believe it.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; At least in its current definition.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; As &lt;a href=&quot;http://hpts.ws/papers/2022/JamesHamilton20221010.pdf&quot;&gt;James Hamilton has talked about&lt;/a&gt;, one of Amazon’s monoliths (Obidos) was big enough in the early 2000s that it was becoming impossible to link on a 32 bit machine. In many ways the size, and unreliability, of Obidos informed a lot of the &lt;em&gt;reliable system from unreliable parts&lt;/em&gt; thinking that went into AWS’s early architecture later in the same decade.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Writing Is Magic</title>
      <link>http://brooker.co.za/blog/2022/11/08/writing.html</link>
      <pubDate>Tue, 08 Nov 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/11/08/writing</guid>
      <description>&lt;h1 id=&quot;writing-is-magic&quot;&gt;Writing Is Magic&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Magic can be dangerous.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sometimes when folks ask me for advice at work, I write them very long emails to answer their question. Sometimes, those emails are generally interesting and not work-specific, so I share them here. A couple days ago somebody asked me about how to get better at communicating their ideas and opinions, how to extend their influence, and how to drive consensus. This was my reply.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are many ways to be influential. You can form 1:1 relationships with people, have small group meetings, do talks, send out a code review, or argue in Slack. All of those can be valuable at the right time. But there’s one tool that I choose most often: long-form writing. Writing is the closest thing I know to magic.&lt;/p&gt;

&lt;p&gt;Nearly every time I need to drive a difficult, subtle, or contentious decision, I write a document. Sometimes that’s half a page, sometimes its six pages. Sometimes much longer, although brevity is valuable. I see a few benefits to this approach that keep me coming back it it again and again.&lt;/p&gt;

&lt;p&gt;First, clarity. I’m sure you know the quote “Writing is nature’s way of letting you know how sloppy your thinking is”&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, and knowing how sloppy your thinking is allows you to sharpen it, test your arguments, and test different explanations. I find, more often than not, that I understand something much less well when I sit down to write about it than when I’m thinking about it in the shower. In fact, I find that I change my own mind on things a lot when I try write them down. It really is a powerful tool for finding clarity in your own mind. Once you have clarity in your own mind, you’re much more able to explain it to others.&lt;/p&gt;

&lt;p&gt;Second, time. Getting people’s full, focused, attention on your ideas is very hard. Reading, if your team has a strong document culture, is one of the only ways to do that. You give people a couple pages to read, ideally on paper, and they’re likely to be quiet and focus on understanding your ideas for at least a few minutes. You get to be there, in their heads, with nothing else, for a while. You get to lay out an argument, tell a story, present some data, or ask their opinion without interruption, without back-and-forth. Just your voice. There are very few other ways to do that.&lt;/p&gt;

&lt;p&gt;Third, scale. Sometimes documents only live for an hour or so. They’re there to drive a decision, and when the decision is made the document is dead. Much more often, in my experience, they live well beyond the initial reading. I’ve had people ask me questions about documents I wrote more than a decade ago, that they’re still finding useful today. I love reading CS papers from before I was born. Writing scales in time much better than speaking. I also encourage people to share documents. They can go from the initial audience, to a whole team, to other stakeholders, to people working on similar problems. Writing scales in space much better than nearly any other way of communicating. Sometimes your writing may scale more than you want it to, in either time our space. You do need to watch out for that.&lt;/p&gt;

&lt;p&gt;Fourth, authority. For some reason, people tend to believe the things I write more strongly than the things I say. Maybe its the fact I took the time I write suggests its worth their time to read. I’ve found many times that I’ve said the same thing over and over and over, and then once I write it down suddenly its The Law. You need to be super careful with that effect, if you see it yourself. It can stifle discussion and communication. It can lead to people treating your words as more certain than they really are, of turning your musings into dogma&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Sometimes it doesn’t work at all, and you’ll write something down to no effect, but I don’t see that very often.&lt;/p&gt;

&lt;p&gt;Finally, memory. I don’t know if my memory is uniquely terrible, but I really do tend to forget things. I vividly remember the tuck shop menu from my primary school, 30 years ago. I don’t always remember my justification for decisions I was pushing for last week. Writing is my own record. My own way to go back and see what I was thinking then, and whether its still true. Sometimes I find that I was much smarter two weeks ago than I am now. Sometimes its the opposite. Either way, the record is something I find valuable.&lt;/p&gt;

&lt;p&gt;Writing takes time. Writing well takes a lot of time. On the other hand, the output of writing is almost always more clarity, and sometimes a clear decision. Over my career, I think I’ve wasted at least ten times more time going around and around in conversations without finding consensus than I have writing documents that didn’t turn out to be valuable. It’s very seldom that I think back over writing something and conclude that it wasn’t a good investment of my time. That can happen, and you have to watch for it, but it doesn’t happen to me a lot.&lt;/p&gt;

&lt;p&gt;My last point is about reading culture. Sitting down to read, really read for understanding, is a learned skill. It’s something I recommend that everybody practice, and model in their organizations. I like to discourage people from arguing in document comments. I especially discourage people from nitpicking. Seeking feedback &lt;em&gt;on the document&lt;/em&gt; rather than &lt;em&gt;on the ideas in the document&lt;/em&gt; is, of course, useful if you want to get better at writing. But it’s way easier for a reader to nitpick grammar than it is to engage with ideas, so I like to time box that kind of feedback or take it to a different channel.&lt;/p&gt;

&lt;p&gt;This isn’t a complete answer to your question, but a partial one. Write more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Lamport &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/2736348&quot;&gt;attributes this&lt;/a&gt; to &lt;a href=&quot;https://en.wikipedia.org/wiki/Dick_Guindon&quot;&gt;Dick Guindon&lt;/a&gt;, but there are other credible attributions out there. No matter who said it first, it seems true.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Lev Grossman’s Magicians series has this ongoing theme about how dangerous magic is to the people to practice it, and a lot of the difficulty isn’t in harnessing power, its in having that power not destroy you when you do. I think about that a lot. Influence is something worth becoming great at, and I really admire some of the people who are the best at it, but you need to be really careful.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Give Your Tail a Nudge</title>
      <link>http://brooker.co.za/blog/2022/10/21/nudge.html</link>
      <pubDate>Fri, 21 Oct 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/10/21/nudge</guid>
      <description>&lt;h1 id=&quot;give-your-tail-a-nudge&quot;&gt;Give Your Tail a Nudge&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Tricks are fun.&lt;/p&gt;

&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;We all care about tail latency (also called &lt;em&gt;high percentile&lt;/em&gt; latency, also called &lt;em&gt;those times when your system is weirdly slow&lt;/em&gt;). Simple changes that can bring it down are valuable, especially if they don’t come with difficult tradeoffs. &lt;a href=&quot;https://arxiv.org/pdf/2106.01492.pdf&quot;&gt;Nudge: Stochastically Improving upon FCFS&lt;/a&gt; presents one such trick. The Nudge paper interests itself in tail latency compared to First Come First Served (FCFS)&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, for a good reason:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;While advanced scheduling algorithms are a popular topic in theory papers, it is unequivocal that the most popular scheduling policy used in practice is still First-Come First-Served (FCFS).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is all true. Lots of proposed mechanisms, pretty much everybody still uses FCFS (except for some systems using LIFO&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, and things like CPU and IO schedulers which often use more complex heuristics and priority levels&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;). But this simplicity is good:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;However, there are also theoretical arguments for why one should use FCFS. For one thing, FCFS minimizes the maximum response time across jobs for any finite arrival sequence of jobs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The paper then goes on to question whether, despite this optimality result, we can do better than FCFS. After all, minimizing the maximum doesn’t mean doing better across the whole tail. They suggest a mechanism that does that, which they call Nudge. Starting with some intuition:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The intuition behind the Nudge algorithm is that we’d like to basically stick to FCFS, which we know is great for handling the extreme tail (high 𝑡), while at the same time incorporating a little bit of prioritization of small jobs, which we know can be helpful for the mean and lower 𝑡.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And going on to the algorithm itself:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;However, when a “small” job arrives and finds a “large” job immediately ahead of it in the queue, we swap the positions of the small and large job in the queue. The one caveat is that a job which has already swapped is ineligible for further swaps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wow, that really is a simple little trick! If you prefer to think visually, here’s Figure 1 from the Nudge paper:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/nudge_figure_1.png&quot; alt=&quot;Diagram showing Nudge swapping a small and large task&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But does it work?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I have no reason to doubt that Nudge works, based on the analysis in the paper, but that analysis is likely out of the reach of most practitioners. More practically, like all closed-form analysis it asks and answers some very specific questions, which isn’t as useful for exploring the effect that applying Nudge may have on our systems. So, like the coward I am, I turn to simulation.&lt;/p&gt;

&lt;p&gt;The simulator (&lt;a href=&quot;https://github.com/mbrooker/simulator_example/blob/main/nudge/nudge.py&quot;&gt;code here&lt;/a&gt;) follows the &lt;a href=&quot;https://brooker.co.za/blog/2022/04/11/simulation.html&quot;&gt;simple simulation&lt;/a&gt; approach I like to apply. It considers a system with a queue (using either Nudge or FCFS), a single server, Poisson arrivals, and service times picked from three different Weibull distributions with different means and probabilities. You might call that an M/G/1 system, if you like &lt;a href=&quot;https://en.wikipedia.org/wiki/Kendall%27s_notation&quot;&gt;Kendall’s Notation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What we’re interested in, in this simulation, is the effect across the whole tail, and for different loads on the system. We define load (calling it ⍴ for traditional reasons) in terms of two other numbers: the mean arrival rate λ, and the mean completion rate μ, both in units of jobs/second.&lt;/p&gt;

\[\rho = \frac{\lambda}{\mu}\]

&lt;p&gt;Obviously when $\rho &amp;gt; 1$ the queue is filling faster than it’s draining and &lt;a href=&quot;https://brooker.co.za/blog/2021/08/05/utilization.html&quot;&gt;you’re headed for catastrophe more quickly than you think&lt;/a&gt;. Considering the effect of queue tweaks for different loads seems interesting, because we’d expect them to have very little effect at low load (the queue is almost always empty), and want to make sure they don’t wreck things at high load.&lt;/p&gt;

&lt;p&gt;Here are the results, as a cumulative latency distribution, comparing FCFS with Nudge for three different values of ⍴:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/blog/images/nudge_ecdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That’s super encouraging, and suggests that Nudge works very well across the whole tail in this model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More questions to answer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are a lot more interesting questions to explore before putting Nudge into production. The most interesting one seems to be whether it works with our real-world tail latency distributions, which can have somewhat heavy tails. The Nudge paper says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In this paper, we choose to focus on the case of light-tailed job size distributions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;but defends this by saying (correctly) that most real-world systems truncate the tails of their job size distributions (with mechanisms like timeouts and limits):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Finally, while heavy-tailed job size distributions are certainly prevalent in empirical workloads …, in practice, these heavy-tailed workloads are often truncated, which immediately makes them light-tailed. Such truncation can happen because there is a limit imposed on how long jobs are allowed to run.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which is almost ubiquitous in practice. It’s very hard indeed to run a stable distributed system where job sizes are allowed to have unbounded cost&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Whether our tails are &lt;em&gt;bounded enough&lt;/em&gt; for Nudge to behave well is a good question, which we can also explore with simulation.&lt;/p&gt;

&lt;p&gt;The other important question, of course, is how it generalizes to larger systems with multiple layers of queues, multiple servers, and more exciting arrival time distributions. Again, we can explore all those questions through simulation (you might be able to explore them in closed-form too, but that’s beyond my current skills).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;
Overall, Nudge is a very cool result. In its effectiveness and simplicity it reminds me of &lt;a href=&quot;https://brooker.co.za/blog/2012/01/17/two-random.html&quot;&gt;the power of two random choices&lt;/a&gt; and &lt;a href=&quot;https://dl.acm.org/doi/10.1145/792538.792546&quot;&gt;Always Go Left&lt;/a&gt;. It may be somewhat difficult to implement, especially if the additional synchronization required to do the compare-and-swap dance is a big issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; You might also call this First In First Out, or FIFO.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; First Come Last Served?&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; IO Schedulers are an especially interesting topic, although one that has become less interesting with the rise of SSDs (and, to an extent, smarter HDD interfaces like NCQ). Old school IO schedulers like &lt;a href=&quot;https://github.com/torvalds/linux/blob/master/block/elevator.c&quot;&gt;Linux’s elevator&lt;/a&gt; could bring amazing performance gains by reducing head movement in hard drives. Most folks these days are just firing their IOs at an SSD (with a &lt;em&gt;noop&lt;/em&gt; scheduler).&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Some systems do, of course, especially data analytics systems which could be counting the needles in a very large haystack. These systems do turn out to be difficult to build (although building them in the cloud and being able to share the same capacity pool between different uncorrelated workloads from different customers helps a lot).&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Atomic Commitment: The Unscalability Protocol</title>
      <link>http://brooker.co.za/blog/2022/10/04/commitment.html</link>
      <pubDate>Tue, 04 Oct 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/10/04/commitment</guid>
      <description>&lt;h1 id=&quot;atomic-commitment-the-unscalability-protocol&quot;&gt;Atomic Commitment: The Unscalability Protocol&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;2PC is my enemy.&lt;/p&gt;
&lt;script src=&quot;https://polyfill.io/v3/polyfill.min.js?features=es6&quot;&gt;&lt;/script&gt;

&lt;script&gt;
  MathJax = {
    tex: {inlineMath: [[&apos;$&apos;, &apos;$&apos;], [&apos;\\(&apos;, &apos;\\)&apos;]]}
  };
&lt;/script&gt;

&lt;script id=&quot;MathJax-script&quot; async=&quot;&quot; src=&quot;https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;Let’s consider a single database system, running on one box, good for 500 requests per second.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌───────────────────┐
│     Database      │
│(good for 500 rps) │
└───────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What if we want to access that data more often than 500 times a second? If by &lt;em&gt;access&lt;/em&gt; we mean &lt;em&gt;read&lt;/em&gt;, we have a lot of options. If be &lt;em&gt;access&lt;/em&gt;, we mean &lt;em&gt;write&lt;/em&gt; or even &lt;em&gt;perform arbitrary transactions on&lt;/em&gt;, we’re in a trickier situation. Tricky problems aside, we forge ahead by splitting our dataset into two shards:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌───────────────────┐  ┌───────────────────┐
│ Database shard 1  │  │ Database shard 2  │
│(good for 500 rps) │  │(good for 500 rps) │
└───────────────────┘  └───────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If we’re just doing single row reads and writes, we’re most of the way there. We just need to add a routing layer that can decide which shard to send each access to, and we’re done&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;              ┌────────────┐                
              │   Router   │                
              └────────────┘                
                     ┬                      
           ┌─────────┴───────────┐          
           ▼                     ▼          
┌───────────────────┐  ┌───────────────────┐
│ Database shard 1  │  │ Database shard 2  │
│(good for 500 rps) │  │(good for 500 rps) │
└───────────────────┘  └───────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But what if we have transactions? To make the complexity reasonable, and speed us on our journey, let’s define a &lt;em&gt;transaction&lt;/em&gt; as an operation that does writes to multiple rows, based on some condition, atomically. By &lt;em&gt;atomically&lt;/em&gt; we mean that either all the writes happen or none of them do. By &lt;em&gt;based on some condition&lt;/em&gt; we mean the transactions can express ideas like “reduce my bank balance by R10 as long as it’s over R10 already”.&lt;/p&gt;

&lt;p&gt;But how do we ensure atomicity across multiple machines? This is a classic computer science problem called &lt;a href=&quot;https://en.wikipedia.org/wiki/Atomic_commit&quot;&gt;Atomic Commitment&lt;/a&gt;. The classic solution to this classic problem is &lt;a href=&quot;https://en.wikipedia.org/wiki/Two-phase_commit_protocol&quot;&gt;Two-phase commit&lt;/a&gt;, maybe the most famous of all distributed protocols. There’s a &lt;em&gt;lot&lt;/em&gt; we could say about atomic commitment, or even just about two-phase commit. In this post, I’m going to focus on just one aspect: atomic commitment has weird scaling behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Fast is our New Database?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The obvious question after sharding our new database is &lt;em&gt;how fast is it?&lt;/em&gt; How much throughput can we get out of these two machines, each good for 500 transactions a second.&lt;/p&gt;

&lt;p&gt;The optimist’s answer is 500 + 500 = 1000. We doubled capacity, and so can now do more work. But we need to remind the optimist that we’re solving a distributed transaction problem here, and that at least some transactions go to both shards.&lt;/p&gt;

&lt;p&gt;For the next step in our analysis, we want to measure the mean number of shards any given transaction will visit. Let’s call it &lt;em&gt;k&lt;/em&gt;. For &lt;em&gt;k = 1&lt;/em&gt; we get perfect scalability! For &lt;em&gt;k = 2&lt;/em&gt; we get no scalability at all: both shards need to be visited on every transaction, so we only get 500 transactions a second out of the whole thing. The capacity of the database is the sum of the per-node capacities, divided by &lt;em&gt;k&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do we spread the data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We haven’t mentioned, so far, how we decide which data to put onto which shard. This is a whole complex topic and active research area of its own. The problem is a tough one: we want to spread the data out so about the same number of transactions go to each shard (avoiding &lt;em&gt;hot shards&lt;/em&gt;), and we want to minimize the number of shards any given transaction touches (minimize &lt;em&gt;k&lt;/em&gt;). We have to do this in the face of, potentially, very non-uniform access patterns.&lt;/p&gt;

&lt;p&gt;But let’s put that aside for now, and instead model how &lt;em&gt;k&lt;/em&gt; changes with the number of rows in each transaction (&lt;em&gt;N&lt;/em&gt;), and number of shards in the database (&lt;em&gt;s&lt;/em&gt;). Borrowing from &lt;a href=&quot;https://stats.stackexchange.com/a/296053&quot;&gt;this StackExchange answer&lt;/a&gt;, and assuming that each transaction picks uniformly from the key space, we can calculate:&lt;/p&gt;

&lt;p&gt;$k = s \left( 1 - \left( \frac{s-1}{s} \right) ^ N \right)$&lt;/p&gt;

&lt;p&gt;You can picture that in your head, right? If, like me, you probably can’t, it looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/blog_k_versus_n_s.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;k&lt;/em&gt; is fairly nicely behaved for small &lt;em&gt;N&lt;/em&gt; or small &lt;em&gt;s&lt;/em&gt;, but things start to get ugly when both &lt;em&gt;N&lt;/em&gt; and &lt;em&gt;s&lt;/em&gt; are large. Remember that the absolute maximum throughput we can get out of this database is&lt;/p&gt;

&lt;p&gt;$\mathrm{Max TPS} \propto \frac{s}{k}$&lt;/p&gt;

&lt;p&gt;Let’s consider the example of &lt;em&gt;N=10&lt;/em&gt;. How does the maximum TPS vary with &lt;em&gt;s&lt;/em&gt; as we increase the number of shards from 1 to 10:&lt;/p&gt;

&lt;p&gt;$\mathrm{Max TPS}(s = 1..10, N=10)
    \propto [1.000000, 1.000978, 1.017648, 1.059674, 1.120290, 1.192614, 1.272359, 1.356991, 1.444974, 1.535340]$&lt;/p&gt;

&lt;p&gt;Oof! For &lt;em&gt;N = 10&lt;/em&gt;, adding a second shard only increases our throughput by something like 1% for uniformly distributed keys! The classic solution is to hope that your keys aren’t uniformly distributed, and that you can keep &lt;em&gt;k&lt;/em&gt; low without causing hotspots. A nice solution, if you can get it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But wait, it gets worse!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where our old friend, concurrency, comes back to haunt us. Let’s think about what happens when we get into the state where each shard can only handle one more transaction&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, and two transactions come in, each wanting to access both shards.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;             ┌────┐    ┌────┐               
             │ T1 │    │ T2 │               
             └────┘    └────┘               
                │         │                 
                │         │                 
          ┌─────┴─────────┴──────┐          
          │                      │          
          ▼                      ▼          
┌───────────────────┐  ┌───────────────────┐
│ Database shard 1  │  │ Database shard 2  │
│ (can only handle  │  │ (can only handle  │
│     one more)     │  │     one more)     │
└───────────────────┘  └───────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Clearly, only one of T1 and T2 can succeed. They can also, sadly, both fail. If T1 gets to shard 1 first, and T2 gets to shard 2 first, neither will get the capacity it needs from the other shard. Then both fail&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. We can look at this using a simulation, and see how pronounced the effect can be:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/paper_synth_with_limit_unif_goodput.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this simulation, with Poisson arrivals, offered load far in excess of the system capacity, and uniform key distribution, goodput for &lt;em&gt;N = 10&lt;/em&gt; drops significantly as the shard number increases, and doesn’t recover until &lt;em&gt;s = 6&lt;/em&gt;. This effect is surprising, and counter-intuitive. Effects like this make transaction systems somewhat uniquely hard to scale out. For example, splitting a single-node database in half could lead to worse performance than the original system.&lt;/p&gt;

&lt;p&gt;Fundamentally, this is because scale-out depends on &lt;a href=&quot;https://brooker.co.za/blog/2021/01/22/cloud-scale.html&quot;&gt;avoiding coordination&lt;/a&gt; and atomic commitment is all about coordination. Atomic commitment is the anti-scalability protocol.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Obviously not &lt;em&gt;done done&lt;/em&gt;. Building scale-out databases even for single-row accesses turns out to be super hard in other ways. For a good discussion of that, check out the 2022 &lt;a href=&quot;https://www.usenix.org/conference/atc22/presentation/vig&quot;&gt;DynamoDB paper&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Because of thread limits, or concurrency limits, or connection limits, or anything else that limits the total number of outstanding transactions that the shard can handle. The details matter a whole lot in practice, but matter little in this simple model.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; You might be thinking that we could just queue both of them up. Which we &lt;em&gt;could&lt;/em&gt;, but that would have other bad impacts. In general, long queues are really bad for system stability.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Histogram vs eCDF</title>
      <link>http://brooker.co.za/blog/2022/09/02/ecdf.html</link>
      <pubDate>Fri, 02 Sep 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/09/02/ecdf</guid>
      <description>&lt;h1 id=&quot;histogram-vs-ecdf&quot;&gt;Histogram vs eCDF&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Accumulation is a fun word.&lt;/p&gt;

&lt;p&gt;Histograms are a rightfully popular way to present data like latency, throughput, object size, and so on. Histograms avoid some of the difficulties of picking a summary statistic, or group of statistics, which is &lt;a href=&quot;https://brooker.co.za/blog/2017/12/28/mean.html&quot;&gt;hard to do right&lt;/a&gt;. I think, though, that there’s nearly always a better choice than histograms: the empirical cumulative distribution function (eCDF). To understand why, let’s look at an example, starting with the histogram&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/blog_hist_10bucket.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This latency distribution is very strongly bimodal. It’s the kind of thing you might expect from a two-tiered cache: a local tier with very low latency, and a remote tier with latency in the 2 to 3ms range. Super common in systems and databases. The histogram illustrates that bimodality very well. It’s easy to see that the second mode is somewhere around 2.5ms. The next two questions on my mind would be: &lt;em&gt;how much do these two spikes contribute?&lt;/em&gt; and &lt;em&gt;where is the first spike?&lt;/em&gt; In histogram form, its hard to answer these questions. The first we’d need to answer by doing some mental area-under-the-curve estimation, and the second is obscured by bucketing.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/blog_ecdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s the same data in eCDF form. You can think of it as the histogram &lt;em&gt;summed up&lt;/em&gt;, or &lt;em&gt;integrated&lt;/em&gt;, or &lt;em&gt;accumulated&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. The first thing you may notice is how easy it has become to see the relative contribution of our first and second mode. The first mode contributes around 70% of measurements. If this is a cache system, we immediately know that our cache hit rate is around 70%. We also know that the 65th percentile is very low, and the 75th is very high. In fact, we can read these percentile&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; values right off the graph by finding the 0.65 and 0.75 points on the Y axis, moving right until we hit the curve, and reading their value off the X axis. Magic!&lt;/p&gt;

&lt;p&gt;The second question, about the location of the first spike, can be answered by zooming in. That works because the eCDF, unlike the histogram, doesn’t require bucketing, so we can zoom around in X and Y as much as we like without changing the shape of the curve. Say, for example, we wanted to look at the tail in more detail. Let’s zoom in on the top right.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/blog_ecdf_zoomed.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Again, we can easily read off high percentiles from the graph, and don’t have to worry about how changing bucket widths.&lt;/p&gt;

&lt;p&gt;I believe that for nearly all purposes in systems design and operations, eCDFs are a better choice than histograms for presenting data to humans.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What other cool stuff can eCDFs do?&lt;/em&gt;
Another cool thing eCDFs make easy is generating random numbers from a measured distribution. Remember how we could go up the Y axis to find the value of different percentiles? Computers can do that too: generate a random number between 0 and 1, and then “read off” the X axis value. I’ll leave the exercise of doing that efficiently and accurately to the reader.&lt;/p&gt;

&lt;p&gt;Just as easy as finding the value of a percentile is finding the percentile of a value. This is less frequently useful in systems, but occasionally it is nice to be able to ask “how much of an outlier is this value?”. For example, say you’re building a filesystem that can only store files less than 1MiB. Take the eCDF of file distributions in the world, find 1MB on the X axis, and the Y value will be the percentage of files your system will be able to store.&lt;/p&gt;

&lt;p&gt;It’s trivial to transform an eCDF into a histogram, by bucketing up first differences (f&lt;sub&gt;n+1&lt;/sub&gt; - f&lt;sub&gt;n&lt;/sub&gt;). You can’t go from the histogram to the eCDF so easily &lt;em&gt;in general&lt;/em&gt;, because bucketing loses data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why do you say eCDF and not just CDF?&lt;/em&gt;
At least in my head, the CDF is the &lt;em&gt;true&lt;/em&gt; cumulative distribution function of the underlying distribution, and the eCDF is an empirical estimate of it (related by the &lt;a href=&quot;https://en.wikipedia.org/wiki/Glivenko%E2%80%93Cantelli_theorem&quot;&gt;fundamental theorem of statistics&lt;/a&gt;. Whether you feel that’s a useful distinction depends on whether you think there is an underlying distribution that’s exists separately from our measurements of it (and whether you still think that in the face of the non-stationarity of nearly all systems).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Isn’t a histogram a data structure and not a graph?&lt;/em&gt;
Some folks use &lt;em&gt;histogram&lt;/em&gt; to mean the graph (as I do above), and some folks use it to mean a data structure designed for summarizing a stream of values. There are many flavors of these, but most have a set of buckets (exponentially spaced is common) and sum into the buckets. The “real” eCDF is calculated directly from the samples themselves without bucketing, but can also be estimated from these histograms-as-data-structures. If you’re summarizing your data using one of these data structures, it’s nice to store as many buckets as feasible (and indeed many more than you’d show a human).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; I use &lt;em&gt;histogram&lt;/em&gt; to also cover &lt;em&gt;frequency polygon&lt;/em&gt; here, because most people don’t recognize the distinction. I don’t think it’s a particularly useful distinction anyway. You also might say ePDF.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; It is, after all, the &lt;em&gt;cumulative&lt;/em&gt; distribution function. It’s nice when things say what they are.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; I say &lt;em&gt;percentile&lt;/em&gt; here, but this is true for all quantiles. You can read off the quartiles, deciles, quintiles, heptiles, etc right off the graph in the same way.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>What is Backoff For?</title>
      <link>http://brooker.co.za/blog/2022/08/11/backoff.html</link>
      <pubDate>Thu, 11 Aug 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/08/11/backoff</guid>
      <description>&lt;h1 id=&quot;what-is-backoff-for&quot;&gt;What is Backoff For?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Back off man, I&apos;m a scientist.&lt;/p&gt;

&lt;p&gt;Years ago I wrote a blog post about &lt;a href=&quot;https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/&quot;&gt;exponential backoff and jitter&lt;/a&gt;, which has turned out to be enduringly popular. I like to believe that it’s influenced at least a couple of systems to add jitter, and become more stable. However, I do feel a little guilty about pushing the popularity of jitter without clearly explaining what backoff and jitter do, and do not do.&lt;/p&gt;

&lt;p&gt;Here’s the pithy statement about backoff:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Backoff helps in the short term. It is only valuable in the long term if it reduces total work&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Consider a system that suffers from &lt;em&gt;short spikes&lt;/em&gt;&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; of overload. That could be a flash sale, top-of-hour operational work, recovery after a brief network partition, etc. During the overload, some calls are failing, primarily due to the overload itself. Backing off in this case is extremely helpful: it spreads out the spike, and reduces the amount of chatter between clients and servers. Jitter is especially effective at helping broaden spikes. If you want to see this in action, look at the time series in &lt;a href=&quot;https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/&quot;&gt;the exponential backoff and jitter post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, consider a &lt;em&gt;long spike&lt;/em&gt; of overload. There are two cases here.&lt;/p&gt;

&lt;p&gt;One is where we have a large (maybe effectively unlimited) number of clients, and they’re independently sending work. For example, think about a website with a million customers, each visiting it about once a day. Each client backing off in this case &lt;em&gt;does not help&lt;/em&gt;, because it does not reduce the work being sent to the system. Each client is still going to press F5 the same number of times, so delaying their presses doesn’t help.&lt;/p&gt;

&lt;p&gt;The other is where we have a smaller number of clients, each sending a serial stream of requests. For example, think of a fleet of workers polling a queue. Each client backing off in this case &lt;em&gt;helps a lot&lt;/em&gt; because they are serial. Backing off means they send less work, and less is asked of the service&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;That may sound like a subtle distinction, but the bottom line is this: does backoff actually reduce the work done? In the case of lots of clients, it doesn’t, because each new client entering the system doesn’t know other have backed off.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The only way to deal with long-term overload is to reduce load, deferring load does not work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Now, on to retries. As I wrote about in &lt;a href=&quot;https://brooker.co.za/blog/2022/02/28/retries.html&quot;&gt;Fixing retries with token buckets and circuit breakers&lt;/a&gt;, retries have an amplifying effect on the work a service is asked to do. During long-term overload, retries may increase the work to be done. The way to fix that is with a good retry policy, such as the token bucket based adaptive strategy described in the post.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Backoff is not a substitute for a good retry policy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Backoff is not a good retry policy. Or, at least, is hard to use as one.&lt;/p&gt;

&lt;p&gt;Backoff is only a good retry policy in systems with small numbers of sequential clients, where the introduced delay between retries delays &lt;em&gt;future first tries&lt;/em&gt;. If this property is not true, and the next &lt;em&gt;first try&lt;/em&gt; is going to come along at a time independent of retry backoff, then backing off retries &lt;em&gt;does nothing&lt;/em&gt; to help long-term overload. It just defers work to a future time&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;That doesn’t mean that backing off between retries is a bad idea. It’s a good idea, but only helps for &lt;em&gt;short term&lt;/em&gt; overload (spikes, etc). It does not reduce the total work in the system, or total amplification factor of retries.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A good approach to retries combines backoff, jitter, and a good retry policy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are complimentary mechanisms, and neither solves the whole problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First tries and second tries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The other way to think about this is to think about &lt;em&gt;first tries&lt;/em&gt; and &lt;em&gt;second tries&lt;/em&gt;. A first try is the first time a given client tries to do a piece of work. There are two ways systems can get overloaded: &lt;em&gt;too many first tries&lt;/em&gt; and &lt;em&gt;too many second tries&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you have too many first tries, you need to have fewer. If you’ve got a bounded number of clients, getting each of them to back off is an effective strategy&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. With a bounded number of clients, backoff is an effective way to do that. If you have an unbounded number of clients, backoff is not an effective way to do that. They only hear the bad news after their first try, so no amount of backoff will reduce their first try rate.&lt;/p&gt;

&lt;p&gt;If you’ve got an OK number of first tries, but some error rate is driving up second try (retry traffic), then you need to reduce the number of second tries. Backoff is an effective way to reduce the number of second tries &lt;em&gt;now&lt;/em&gt;, by &lt;em&gt;deferring them into the future&lt;/em&gt;. If you think you’ll be able to handle them better in the future, that’s a win. But backoff is not an effective way to reduce the overall number of second tries &lt;em&gt;in total&lt;/em&gt; for long-running overload. For that, you need something like the adaptive retry approach.&lt;/p&gt;

&lt;p&gt;Unless, of course, your clients are relatively small in number, and their next &lt;em&gt;first try&lt;/em&gt; is only going to get made after this round of &lt;em&gt;second tries&lt;/em&gt; is done. Then backoff will reduce the overall rate of &lt;em&gt;second tries&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This ‘number of clients’ thinking can be a bit confusing, because it’s not really about number of clients. Its about the number of parallel things doing work. Code that spawns a thread for each call, or dispatch each call to an event loop, can become an effectively unbounded number of things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulation Results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can validate these assertions by looking at some simulation results. First, let’s look at a simulation that compares four strategies: three retries with and without backoff, and adaptive with and without backoff, for a case with a very large number of clients. What we see in these results matches the assertions: in this case, which reflects a long-running overload with an unbounded number of clients, per-request retry backoff has nearly no effect on traffic amplification.&lt;/p&gt;

&lt;p&gt;Again, the unbounded number of clients can mean lots of clients, or just clients that spawn threads or async work for requests. The important property is whether they wait on a request to be done (or fail) before they do the next one.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/backoff_sim_results.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;But the news is not all bad. As soon as we have a limited number of serial clients, we see that backoff is effective at avoiding amplification of retries, &lt;em&gt;and&lt;/em&gt; at reducing the number of first tries. In this case, backoff is very effective at improving the behavior of the system. The results for adaptive retry are similar, and show that backoff is similarly useful.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/limited_client_backoff_results.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; You may be wondering here about famous and super successful systems like TCP and Ethernet CSMA/CD which use backoff approaches like exponential backoff and AIMD very effectively. The same reasoning applies to them: their backoff strategies are only effective because the number of clients is relatively small, and slowing clients down reduces overall work in the system (helping it find a new dynamic set point).&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; It may even delay recovery after overload because those deferred backoffs are a kind of implicit queue of work that needs to be done before the system is fully recovered.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Again, this is what TCP does.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; I keep saying &lt;em&gt;short&lt;/em&gt; and &lt;em&gt;long&lt;/em&gt; without a lot of details. Roughly, &lt;em&gt;short&lt;/em&gt; means a time approximately around the time clients are willing to wait, and &lt;em&gt;long&lt;/em&gt; is longer than that.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Getting into formal specification, and getting my team into it too</title>
      <link>http://brooker.co.za/blog/2022/07/29/getting-into-tla.html</link>
      <pubDate>Fri, 29 Jul 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/07/29/getting-into-tla</guid>
      <description>&lt;h1 id=&quot;getting-into-formal-specification-and-getting-my-team-into-it-too&quot;&gt;Getting into formal specification, and getting my team into it too&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Getting started is the hard part&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Sometimes I write long email replies to people at work asking me questions. Sometimes those emails seem like they could be useful to more than just the recipient. This is one of those emails: a reply to a software engineer asking me how they could adopt formal specification in their team, and how I got into it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Sometime around 2011 I was working on some major changes to the EBS control plane. We had this &lt;em&gt;anti-entropy&lt;/em&gt; system, which had the job of converging the actual system state (e.g. the state of the volumes on the storage fleet, and clients on the EC2 fleet&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) with the intended system state in the control plane (e.g. the customer requested that this volume is deleted). We had a mess of ad-hoc code that took four sources of state (two storage servers, one EC2 client, the control plane), applied a lot of logic, and tried to figure out the steps to take to converge the states. Lots and lots of code. Debugging it was hard, and bugs were frequent.&lt;/p&gt;

&lt;p&gt;Most painfully, I think, wasn’t that the bugs were frequent. It’s that they came in bursts. The code would behave for months, then there would be a network partition, or a change in another system, and loads of weird stuff would happen all at once. Then we’d try to fix something, and it’d just break in another way.&lt;/p&gt;

&lt;p&gt;So we all took a day and drew up a huge state table on this big whiteboard in the hall, and circles and arrows showing the state transitions we wanted. A day well spent: we simplified the code significantly, and whacked a lot of bugs. But I wanted to do better. Specifically, I wanted to be able to know whether this mess of circles and arrows would always converge the state. I went looking for tools, and found and used &lt;a href=&quot;https://alloytools.org/&quot;&gt;Alloy&lt;/a&gt; for a while. Then Marc Levy introduced me to &lt;a href=&quot;https://spinroot.com/spin/whatispin.html&quot;&gt;Spin&lt;/a&gt;, which I used for a while but never became particularly comfortable with.&lt;/p&gt;

&lt;p&gt;The next year we were trying to reason through some changes to replication in EBS, and especially the control plane’s role in ensuring correctness&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. I was struggling to use Alloy to demonstrate the properties I cared about&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. As something of a stroke of luck, I went to a talk by Chris Newcombe and Tim Rath titled “Debugging Designs” about their work applying formal specification to DynamoDB and Aurora. That talk gave me the tool I needed: &lt;a href=&quot;https://lamport.azurewebsites.net/tla/tla.html&quot;&gt;TLA+&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Over the next couple years, I used TLA+ heavily on EBS, and got a couple of like-minded folks into it too. It resonated best with people who saw the same core problem I did: it was too hard to get the kinds of distributed software we were building right, and testing wasn’t solving our problems. I think of this as a kind of mix of hubris (&lt;em&gt;software can be correct&lt;/em&gt;), humility (&lt;em&gt;I can’t write correct software&lt;/em&gt;) and laziness (&lt;em&gt;I don’t want to fix this again&lt;/em&gt;). Some people just didn’t believe that it was a battle that could be won, and some hadn’t yet burned their fingers enough to believe they couldn’t win it without help.&lt;/p&gt;

&lt;p&gt;Somewhere along the line, Chris lead us in writing the paper that became &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext&quot;&gt;How Amazon Web Services Uses Formal Methods&lt;/a&gt;, which appeared on Leslie Lamports’s website in 2014 and eventually in CACM in 2015. We spent some time with Leslie Lamport talking about the paper (which was a real thrill), and he wrote &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184705-who-builds-a-house-without-drawing-blueprints/fulltext&quot;&gt;Who Builds a House Without Drawing Blueprints?&lt;/a&gt;, framing our paper. I also tried to convince him that TLA+ would be nicer to write with a Scheme-style s-expression syntax&lt;sup&gt;&lt;a href=&quot;#foot7&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. He didn’t buy it.&lt;/p&gt;

&lt;p&gt;Since then, I’ve used TLA+ to specify core properties of things I care about in every team I’ve been on at AWS. More replication work in EBS, state convergence work in Lambda, better configuration distribution protocols, trying to prevent VM snapshots &lt;a href=&quot;https://arxiv.org/abs/2102.12892&quot;&gt;returning duplicate random numbers&lt;/a&gt;, and now a lot of work in distributed databases. Byron Cook, Neha Rungta&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, Murat Demirbas, and many other people who are actual formal methods experts (unlike me) joined, and have been doing some great work across the company. Overall, I probably reach for TLA+ (or, increasingly, &lt;a href=&quot;https://github.com/p-org/P&quot;&gt;P&lt;/a&gt;) every couple months, but when I do it adds a lot of value. Teams around me are looking at &lt;a href=&quot;https://github.com/awslabs/shuttle&quot;&gt;Shuttle&lt;/a&gt; and &lt;a href=&quot;https://github.com/dafny-lang/dafny&quot;&gt;Dafny&lt;/a&gt;, and some other tools. And, of course, there’s the work S3 continues to do on &lt;a href=&quot;https://www.amazon.science/publications/using-lightweight-formal-methods-to-validate-a-key-value-storage-node-in-amazon-s3&quot;&gt;lightweight formal methods&lt;/a&gt;. I’m also using &lt;a href=&quot;https://brooker.co.za/blog/2022/04/11/simulation.html&quot;&gt;simulation&lt;/a&gt; more and more (or getting back into it, my PhD work was focused on simulation).&lt;/p&gt;

&lt;p&gt;So how do you get into it? First, recognize that it’s going to take some time. P is a little easier to pick up, but TLA+ does take a bit of effort to learn&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. It also requires some math. Not a lot - just logic and basic set theory - but some. For me, spending that effort requires a motivating example. The best ones are where there’s a clear customer benefit to improving the quality of the code, the problem is a tricky distributed protocol or security boundary&lt;sup&gt;&lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt; or something else that really really needs to be right, and there’s a will to get it right. Sometimes, you have to create the will. Talk about the risks of failure, and how teams across the company have found it hard to build correct systems without formal specification. Get people on your side. Find the folks in the team with the right level of hubris and humility, and try get them excited to join you.&lt;/p&gt;

&lt;p&gt;Whether formal specification will be worth it depends a lot on your problems. I’ve mostly used it for distributed and concurrent protocols. Tricky business logic (like the volume state merge I mentioned) can definitely benefit. I’m not very experienced in code verification, but clearly there’s a lot of value in tools that can reason directly about code. I’ve been meaning to get into that when I have some time. But mostly, you need to have an example where correctness really matters to your customers, your business, or your team. Those aren’t hard to find around here, but there might happen to not be many of them near you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; If you’re interested in what these words mean, Marc Olson and Prarthana Karmaker did a talk at ReInvent 2021 titled &lt;a href=&quot;https://www.youtube.com/watch?v=kaWzAEVZ6k8&quot;&gt;Amazon EBS under the hood: A tech deep dive&lt;/a&gt;. Some of the background is also covered in our &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Millions of Tiny Databases&lt;/a&gt; paper.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; This work eventually morphed into Physalia, as we describe in &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Millions of Tiny Databases&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; My choice of Alloy was inspired by reading Pamela Zave’s &lt;a href=&quot;http://www.pamelazave.com/chord.html&quot;&gt;work on Chord&lt;/a&gt;, especially &lt;a href=&quot;http://www.pamelazave.com/chord-ccr.pdf&quot;&gt;Using Lightweight Modeling To Understand Chord&lt;/a&gt;, but it’s never felt like the right tool for that kind of job. It’s really nice for other things, though.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; There’s a nice interview with Neha about her career path &lt;a href=&quot;https://www.amazon.science/working-at-amazon-from-nasa-ames-research-center-to-automated-reasoning-group-aws-neha-rungta&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Although resources like Hillel Wayne’s &lt;a href=&quot;https://www.learntla.com/&quot;&gt;Learn TLA+&lt;/a&gt; have made it a lot more approachable. Lamport’s &lt;a href=&quot;https://smile.amazon.com/Specifying-Systems-Language-Hardware-Engineers/dp/032114306X/&quot;&gt;Specifying Systems&lt;/a&gt; isn’t a hard book, and is well worth picking up, but doesn’t hold your hand.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; See, for example, the work the Kani folks have done on Firecracker in &lt;a href=&quot;https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html&quot;&gt;Using the Kani Rust Verifier on a Firecracker Example&lt;/a&gt;, or this video with &lt;a href=&quot;https://www.youtube.com/watch?v=J9Da3VsLH44&quot;&gt;Byron Cook talking about formal methods and security&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot7&quot;&gt;&lt;/a&gt; I still don’t like the TLA+ syntax. It’s nice to read, but the whitespace rules are weird, and the operators are a bit weird, and I think that makes it less accessible for no particularly good reason. And don’t get me started on the printed documentation using a different character set (e.g. real ∃, ∀, ∈ rather than their escaped variants). It seems like a minor thing, but boy did I find it challenging starting out.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>The DynamoDB paper</title>
      <link>http://brooker.co.za/blog/2022/07/12/dynamodb.html</link>
      <pubDate>Tue, 12 Jul 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/07/12/dynamodb</guid>
      <description>&lt;h1 id=&quot;the-dynamodb-paper&quot;&gt;The DynamoDB paper&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;The other database called Dynamo&lt;/p&gt;

&lt;p&gt;This week at USENIX ATC’22, a group of my colleagues&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; from the AWS DynamoDB team are going to be presenting their paper &lt;a href=&quot;https://www.usenix.org/conference/atc22/presentation/vig&quot;&gt;Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service&lt;/a&gt;. This paper is a rare look at a real-world distributed system that runs at massive scale.&lt;/p&gt;

&lt;p&gt;From the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In 2021, during the 66-hour Amazon Prime Day shopping event, Amazon systems … made trillions of API calls to DynamoDB, peaking at 89.2 million requests per second&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;89 million requests per second is a big database by any standards (and that’s just Amazon’s use of DynamoDB)!&lt;/p&gt;

&lt;p&gt;What’s exciting for me about this paper is that it covers DynamoDB’s journey, and how it has changed over time to meet customers’ needs. There are relatively few papers that cover this kind of change over time. For example:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The uniform distribution of throughput across partitions is based on the assumptions that an application uniformly accesses keys in a table and the splitting a partition for size equally splits the performance. However, we discovered that application workloads frequently have non-uniform access patterns both over time and over key ranges. When the request rate within a table is non-uniform, splitting a partition and dividing performance allocation proportionately can result in the hot portion of the partition having less available performance than it did before the split. Since throughput was allocated statically and enforced at a partition level, these non-
uniform workloads occasionally resulted in an application’s reads and writes being rejected, called throttling, even though the total provisioned throughput of the table was sufficient to meet its needs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the kind of assumption in a system design—that splitting makes performance better—that’s really easy to overlook when designing a system, and potentially difficult to fix when you’re in production. A lot of what makes systems like DynamoDB so useful is that they have these lessons baked-in, and the folks who’re using them don’t need to learn the same lesson themselves.&lt;/p&gt;

&lt;p&gt;A key little bit of history&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;These architectural discussions culminated in Amazon DynamoDB, a public service launched in 2012 that shared most of the name of the previous Dynamo system but little of its architecture.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reading the rest of the DynamoDB paper you can see the influence that Dynamo had, but also some major differences in the architecture. Most notable, probably, is that DynamoDB uses multi-Paxos&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; for keeping replicas in sync:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The replicas for a partition form a replication group. The replication group uses Multi-Paxos for leader election and consensus.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and a fairly straightforward leader election model for consistent reads and writes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Only the leader replica can serve write and strongly consistent read requests. Upon receiving a write request, the leader of the replication group for the key being written generates a write-ahead log record and sends it to its peer (replicas). … Any replica of the replication group can serve eventually consistent reads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Like most big systems at AWS, the DynamoDB team is using formal methods (specifically TLA+) to specify and model check core parts of their system:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We use formal methods extensively to ensure the correctness of our replication protocols. The core replication protocol was specified using TLA+.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Caches and Metastability&lt;/strong&gt;&lt;a name=&quot;metastable&quot;&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another great lesson from the paper is a reminder about the risks of caches (see &lt;a href=&quot;https://brooker.co.za/blog/2021/08/27/caches.html&quot;&gt;Caches, Modes, and Unstable Systems&lt;/a&gt;):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;When a router received a request for a table it had not seen before, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What’s not to love about a 99.75% cache hit rate? The failure modes!&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The downside is that caching introduces bimodal behavior. In the case of a cold start where request routers have empty caches, every DynamoDB request would result in a metadata lookup, and so the service had to scale to serve requests at the same rate as DynamoDB&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So this metadata table needs to scale from handling 0.25% of requests, to handling 100% of requests. A 400x potential increase in traffic! Designing and maintaining something that can handle rare 400x increases in traffic is super hard. To address this, the DynamoDB team introduced a distributed cache called MemDS.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A new partition map cache was deployed on each request router host to avoid the bi-modality of the original request router caches.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which leads to more background work, but less amplification in the failure cases.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The constant traffic to the MemDS fleet increases the load on the metadata fleet compared to the conventional caches where the traffic to the backend is determined by cache hit ratio, but prevents cascading failures to other parts of the system when the caches become ineffective.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These cascading failures can lead to &lt;a href=&quot;https://brooker.co.za/blog/2021/05/24/metastable.html&quot;&gt;metastable failure modes&lt;/a&gt;, and so preventing them architecturally and getting closer to &lt;a href=&quot;https://aws.amazon.com/builders-library/reliability-and-constant-work/&quot;&gt;constant work&lt;/a&gt; is important. Again, this is the kind of insight that comes from having run big systems for a long time, and a big part of the value that’s baked into DynamoDB.&lt;/p&gt;

&lt;p&gt;Check out the paper. If you’re interested in databases, distributed systems, or the realities of running at-scale systems, its well worth your time!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig (as we often do at AWS, this list is in alphabetical order, not the typical academic “first author” order you may be most familiar with).&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Referring here to the system described in De Candia et al, &lt;a href=&quot;https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;Dynamo: Amazon’s Highly Available Key-value Store&lt;/a&gt;. That paper is rightfully quite famous and influential.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Paxos, as usual, appearing as the bottom turtle a scale-out system.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Formal Methods Only Solve Half My Problems</title>
      <link>http://brooker.co.za/blog/2022/06/02/formal.html</link>
      <pubDate>Thu, 02 Jun 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/06/02/formal</guid>
      <description>&lt;h1 id=&quot;formal-methods-only-solve-half-my-problems&quot;&gt;Formal Methods Only Solve Half My Problems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;At most half my problems. I have a lot of problems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The following is a one-page summary I wrote as a submission to &lt;a href=&quot;http://hpts.ws/&quot;&gt;HPTS’22&lt;/a&gt;. Hopefully it’s of broader interest.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Formal methods, like TLA+ and P, have proven to be extremely valuable to the builders of large scale distributed systems&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, and to researchers working on distributed protocols. In industry, these tools typically aren’t used for full verification. Instead, effort is focused on interactions and protocols that engineers expect to be particularly tricky or error-prone. Formal specifications play multiple roles in this setting, from bug finding in final designs, to accelerating exploration of the design space, to serving as precise documentation of the implemented protocol. Typically, verification or model checking of these specifications is focused on safety and liveness. This makes sense: safety violations cause issues like data corruption and loss which are correctly considered to be among the most serious issues with distributed systems. But safety and liveness are only a small part of a larger overall picture. Many of the questions that designers face can’t be adequately tackled with these methods, because they lie outside the realm of safety, liveness, and related properties.&lt;/p&gt;

&lt;p&gt;What latency can customers expect, on average and in outlier cases? What will it cost us to run this service? How do those costs scale with different usage patterns, and dimensions of load (data size, throughput, transaction rates, etc)? What type of hardware do we need for this service, and how much? How sensitive is the design to network latency or packet loss? How do availability and durability scale with the number of replicas? How will the system behave under overload?&lt;/p&gt;

&lt;p&gt;We address these questions with prototyping, closed-form modelling, and with simulation. Prototyping, and benchmarking those prototypes, is clearly valuable but too expensive and slow to be used at the exploration stage. Developing prototypes is time-consuming, and prototypes tend to conflate core design decisions with less-critical implementation decisions. Closed-form modelling is useful, but becomes difficult when systems become complex. Dealing with that complexity sometimes require assumptions that reduce the validity of the results. Simulations, generally Monte Carlo and Markov Chain Monte Carlo simulations, are among the most useful tools. Like prototypes, good simulations require a lot of development effort, and there’s a lack of widely-applicable tools for simulating system properties in distributed systems. Simulation results also tend to be sensitive to modelling assumptions, in ways that require additional effort to explore. Despite these challenges, simulations are widely used, and have proven very useful. Systems and database research approaches are similar: prototyping (sometimes with frameworks that make prototyping easier), some symbolic models, and some modelling and simulation work&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;What I want is tools that do both: tools that allow development of formal models in a language like Pluscal or P, model checking of critical parameters, and then allow us to ask those models questions about design performance. Ideally, those tools would allow real-world data on network performance, packet loss, and user workloads to be used, alongside parametric models. The ideal tool would focus on sensitivity analyses, that show how various system properties vary with changing inputs, and with changing modelling assumptions. These types of analyses are useful both in guiding investments in infrastructure (“how much would halving network latency reduce customer perceived end-to-end latency?”), and in identifying risks of designs (like finding workloads that perform surprisingly poorly).&lt;/p&gt;

&lt;p&gt;This is an opportunity for the formal methods community and systems and database communities to work together. Tools that help us explore the design space of systems and databases, and provide precise quantitative predictions of design performance, would be tremendously useful to both researchers and industry practitioners.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Later commentary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This gap is one small part of a larger gap in the way that we, as practitioners, design and build distributed systems. While we have some in-the-small quantitative approaches (e.g. reasoning about device and network speeds and feeds), some widely-used modelling approaches (e.g. Markov modelling of storage and erasure code durability), most of our engineering approach is based on experience and opinion. Or, worse, à la mode best-practices or “that’s how it was in the 70s” curmudgeonliness. Formal tools have, in the teams around me, made a lot of the strict correctness arguments into quantitative arguments. Mental models like &lt;a href=&quot;https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf&quot;&gt;CAP&lt;/a&gt;, &lt;a href=&quot;https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf&quot;&gt;PACELC&lt;/a&gt;, and &lt;a href=&quot;https://arxiv.org/pdf/1901.01930.pdf&quot;&gt;CALM&lt;/a&gt; have provided ways for people to reason semi-formally about tradeoffs. But I haven’t seen a similar transition for other properties, like latency and scalability, and it seems overdue.&lt;/p&gt;

&lt;p&gt;Quantitative design has three benefits: it gives us a higher chance of finding designs that work, it forces us to think through requirements very crisply, and it allows us to explore the design space nimbly. We’ve very successfully applied techniques like prototyping and &lt;a href=&quot;https://brooker.co.za/blog/2022/04/11/simulation.html&quot;&gt;ad hoc simulation&lt;/a&gt; to create a partially quantitative design approach, but it seems like its time for broadly applicable tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; See, for example &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3477132.3483540&quot;&gt;Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3&lt;/a&gt;, and &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext&quot;&gt;How Amazon Web Services Uses Formal Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; E.g. the classic &lt;a href=&quot;https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf&quot;&gt;Concurrency control performance modeling: alternatives and implications&lt;/a&gt;, from 1987.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>What is a simple system?</title>
      <link>http://brooker.co.za/blog/2022/05/03/simplicity.html</link>
      <pubDate>Tue, 03 May 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/05/03/simplicity</guid>
      <description>&lt;h1 id=&quot;what-is-a-simple-system&quot;&gt;What is a simple system?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Is this pretentious?&lt;/p&gt;

&lt;p&gt;Why do I need cryptography when I could simply hide the contents of my communications rotating every letter by 13? Why do I need a distributed storage system when I could simply store my files on this one server? Why do I need a database when I could simply use a flat file?&lt;/p&gt;

&lt;p&gt;Do any of those things, and feel joy in a job well done. A simple solution. Perhaps you’re hiding your communications from a child, storing little data with low value, and avoiding concurrency. Simplicity in a goal achieved. But useless in the face of an adult adversary, or a desire for persistence beyond the fallibility of hardware, or even of two people trying to do a job at once.&lt;/p&gt;

&lt;p&gt;This presents us with something of a challenge: we know that simplicity is good, and excess simplicity is useless.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Everything should be as simple as can be,
Says Einstein,
But not simpler.&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Indeed, but that doesn’t get us much further in understanding how simple things can be. It is only possible to evaluate simplicity in context of the complete closure of the world. The problems a system solves, technical, organizational, educational, and historical. This presents a problem, due to the difficulties of encapsulating the world.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;When we try to pick out anything by itself, we find it hitched to everything else in the Universe.&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Brooks turned to Aristotle to attempt to answer this question, and found complexities accidental and essential. The accidental complexities are introduced by our human failings: ignorance, pride, and curiosity. The essential are produced by our environment, as separate from ourselves and our technology. A perfect jewel, dirtied only by our clumsy hands. We are encouraged to avoid excessive curiosity, as if learning may take us further from the light of perfect simplicity. Could a perfect craftsman with perfect tools produce a perfect product?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Successive theories in any mature science will be such that they ‘preserve’ the theoretical relations and the apparent referents of earlier theories (i.e., earlier theories will be ‘limiting cases’ of later theories).&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Simplicity exists in a historical context. It refers to the needs handed down by the sages of the past, whether or not we understand their wisdom. C and Unix are simple. Any appearance that the solve problems no longer relevant, or fail to solve the problems of today, is simply due to your ignorance. Windows and Excel are complicated, and any semblance of simplicity is illusion.&lt;/p&gt;

&lt;p&gt;Simplicity may also refer to our ability to shed the demands of the past, and focus only on the transient present and rapidly approaching future. Immediately deny the past, its lessons irrelevant and theories inapplicable. Simplicity is achieved by the new. C is a tool too simple for our adversarial world, and too complex for our abstracted one.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Looks pretty much the same, yeah,” Armstrong replied of the lunar module. “You know, there’s an old saying in aviation that ‘if it looks good it flies good.’ And this has to be the exception to the rule. Because it flew very well. But it is the probably the ugliest flying machine that was ever been designed.”&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Simplicity then, exists only in the eye of the beholder. Hopefully more like a contact lens, and less like an eyelash. Despite the correlation of appearance and result, exceptions abound. Ugly and functional. Beautiful but useless. Exceptions, but common enough to challenge any definition based on aesthetics alone.&lt;/p&gt;

&lt;p&gt;Each culture, company, team, and organization has their own aesthetic sense. What’s simpler: Go, Scheme, or assembly?&lt;/p&gt;

&lt;p&gt;Goodhart warns that attempts to measure simplicity, let along success, will lead to the measurements becoming useless. After all, any attempt to quantify anything about software or systems is doomed to inevitable failure, and we may as well not try. Opinion is better than data, especially mine.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If we postulate, and we just have, that within un-, sub- or supernatural forces the probability is that the law of probability will not operate as a factor&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Defining simplicity as outcome fails when faced with the improbable. Simplicity is easily achieved within reasonable probabilities, and fails when probabilities become unreasonable. The guard on the grinder and the harness on the climber are accidental complexity, or maybe simplicity only in the face of accidents. Complexity that handles the improbable is only complexity until it becomes essential. Good luck makes everything look unnecessarily complex.&lt;/p&gt;

&lt;p&gt;What is a simple system?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; My copy of Zukofsky’s “A” quotes &lt;a href=&quot;https://en.wikipedia.org/wiki/Hugh_Kenner&quot;&gt;Hugh Kenner&lt;/a&gt; as calling it “the most hermetic poem in English”. I haven’t read every poem in English, and so can’t vouch for this superlative.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; From Muir, &lt;a href=&quot;https://vault.sierraclub.org/john_muir_exhibit/writings/misquotes.aspx#1&quot;&gt;apparently accurately&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; From Laudan’s &lt;a href=&quot;https://philosophy.hku.hk/courses/dm/phil2130/AConfutationOfConvergentRealism2_Laudan.pdf&quot;&gt;A Confutation of Convergent Realism&lt;/a&gt;, which made a big dent in my world view in my early 20s.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; From &lt;a href=&quot;https://www.cbsnews.com/news/man-on-the-moon-50th-anniversary-of-the-apollo-11-landing-cbs-news-special/&quot;&gt;“Man on the Moon”: The 50th anniversary of the Apollo 11 landing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Guildenstern, on their way to a sticky end.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Simple Simulations for System Builders</title>
      <link>http://brooker.co.za/blog/2022/04/11/simulation.html</link>
      <pubDate>Mon, 11 Apr 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/04/11/simulation</guid>
      <description>&lt;h1 id=&quot;simple-simulations-for-system-builders&quot;&gt;Simple Simulations for System Builders&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Even the most basic numerical methods can lead to surprising insights.&lt;/p&gt;

&lt;p&gt;It’s no secret that I’m a big fan of formal methods. I use &lt;a href=&quot;https://github.com/p-org/P&quot;&gt;P&lt;/a&gt; and &lt;a href=&quot;https://lamport.azurewebsites.net/tla/tla.html&quot;&gt;TLA+&lt;/a&gt; often. I like these tools because they provide clear ways to communicate about even the trickiest protocols, and allow us to use computers to reason about the systems we’re designing before we build them&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. These tools are typically focused on safety (&lt;em&gt;Nothing bad happens&lt;/em&gt;) and liveness (&lt;em&gt;Something good happens (eventually)&lt;/em&gt;)&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Safety and liveness are crucial properties of systems, but far from being all the properties we care about. As system designers we typically care about many other things that aren’t strictly safety or liveness properties. For example:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;What latency can customers expect, on average and in outlier cases?&lt;/li&gt;
  &lt;li&gt;What will it cost us to run this service?&lt;/li&gt;
  &lt;li&gt;How do those costs scale with different usage patterns, and dimensions of load (data size, throughput, transaction rates, etc)?&lt;/li&gt;
  &lt;li&gt;What type of hardware do we need for this service, and how much?&lt;/li&gt;
  &lt;li&gt;How sensitive is the design to network latency or packet loss?&lt;/li&gt;
  &lt;li&gt;How do availability and durability scale with the number of replicas?&lt;/li&gt;
  &lt;li&gt;How will the system behave under overload?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The formal tools we typically use don’t do a great job of answering these questions. There are many ways to answer them, of course, from closed-form analysis&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; to prototyping. One of my favorite approaches is one I call &lt;em&gt;simple simulation&lt;/em&gt;: writing small simulators that simulate the behavior of simple models, where the code can be easily read, reviewed, and understood by people who aren’t experts on simulation or numerical methods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Quick Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you hang around with skiers or snowboarders, you’ll have heard a lot of talk over the last couple of winters about how crowded resorts have become, and how much time they now spend waiting to ride the ski lift&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Resort operators say that visits have been up only quite modestly, but skiers are seeing much longer waits. Is somebody lying? Or could we see significant increases in wait times with only modest increases in traffic?&lt;/p&gt;

&lt;p&gt;To help explore this question, I wrote a small &lt;a href=&quot;https://github.com/mbrooker/simulator_example&quot;&gt;example simulator in Python&lt;/a&gt; which you can check out.&lt;/p&gt;

&lt;p&gt;It starts off by building a model of each skier, who can be in one of three states: skiing down the hill, queuing, or riding up on the lift:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;      +-------------------------------------------+       
      |                                           |       
      v                                           +       
+-------------+      +-------------+      +-------------+
|   Waiting   |-----&amp;gt;| Riding Lift |-----&amp;gt;|   Skiing    |
+-------------+      +-------------+      +-------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then, it models the chair fairly explicitly, pulling folks from the queue and delivering them to the top of the mountain after a delay. Each skier, lift, and slope creates some events, which the simulation simply reacts to in virtual time order. The whole thing comes out to about 170 lines, with loads of comments.&lt;/p&gt;

&lt;p&gt;That’s simple enough, but can we learn anything from it?&lt;/p&gt;

&lt;p&gt;It turns out that, despite the extreme simplicity of the model, the results are interesting and run a little bit counter to our intuition. From example, here’s the result showing the percentage of time each skier spends skiing, versus the number of virtual skiers in our simulation:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/ski_percent_time.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I suspect that most people’s intuition would have this as a fairly linear relationship, and the pronounced &lt;em&gt;knee&lt;/em&gt; in the curve would be a surprise. I don’t know what the realities are of ski resort attendance, but these simulations do suggest that its plausible that small increases in attendance could lead to long wait times.&lt;/p&gt;

&lt;p&gt;As another example, my &lt;a href=&quot;https://brooker.co.za/blog/2021/10/20/simulation.html&quot;&gt;post on Serial, Parallel and Quorum Latencies&lt;/a&gt; is powered by a simple simulator.&lt;/p&gt;

&lt;p&gt;It’s exactly these kinds of small insights that bring me back to building small simulators over and over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I get started?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start simply. You can use any programming language you like (I tend to reach for Python first), don’t need to learn any frameworks or libraries (although there are some good ones), and often don’t have to write more than a few tens of lines of code. The coding side, in other words, is relatively easy.&lt;/p&gt;

&lt;p&gt;The hard part is &lt;em&gt;modeling&lt;/em&gt;. Simply, coming up with an abstract model of your system and its actors, and choosing what to include and what to exclude. What’s important and what’s irrelevant. What’s the big picture, and what’s detail. The success of simulations of all sizes depends on making good choices here.&lt;/p&gt;

&lt;p&gt;Think about the ski lift example. I modeled skier speed variations, and lift speed variations, and the periodic arrival of chairs. I didn’t model weather, or fatigue, or lunch time, or any one of many other factors that could change the result. Are those important? Maybe! But to answer our core question (“is it plausible that small increases in visits could lead to long increases in waiting?”) it didn’t seem like we needed to include them.&lt;/p&gt;

&lt;p&gt;Then, when you have the model, convert it to code. I like to do this as literally and straightforwardly as possible. It’s very attractive to build in some abstraction that simplifies the code at the cost of obscuring the model. I avoid that as much as possible: being able to correlate the model and the code seems important to helping other people understand the assumptions. Our goal is to make the model and its assumptions obvious, not obscured.&lt;/p&gt;

&lt;p&gt;Finally, explore and test. Play with the parameters and see what happens. Compare your intuition to the results. Look at the data coming out of the simulation. Try simple cases and check if they match. Validate against real systems if you can. How much effort to spend here depends a lot on how much is riding on the simulation being exact, but at least some validation is always warranted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what about….&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Simple simulations aren’t the last word in computational or numerical methods. You can write simulations that are arbitrarily sophisticated, very carefully validated, and exquisitely crafted. Depending on what you’re trying to do that may be worth the effort. But I’ve seen a lot of people avoid reaching for simulation at all under the assumption that they have to be sophisticated. Often, you don’t. In the majority of cases I’ve seen, the results are robust, validation is fairly simple, and simplicity beats sophistication. Don’t let the depth of the field dissuade you from getting started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; I was also one of the authors on &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext&quot;&gt;How Amazon Web Services Uses Formal Methods&lt;/a&gt; which appeared in CACM back in 2015. Also check out the introduction/framing Leslie Lamport wrote in the same issue: &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184705-who-builds-a-house-without-drawing-blueprints/fulltext&quot;&gt;Who Builds a House Without Drawing Blueprints?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; These succinct descriptions of safety and liveness come from &lt;a href=&quot;https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf&quot;&gt;Defining Liveness&lt;/a&gt; by Alpern and Schneider, which is well worth reading if you’re interested in going deeper on what liveness means.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; For example, modelling the durability of replicated and erasure-coded storage systems can be done fairly easily in closed-form (see, for example &lt;a href=&quot;https://dominoweb.draco.res.ibm.com/reports/rj10391.pdf&quot;&gt;Notes on Reliability Models for Non-MDS Erasure Codes&lt;/a&gt;). The benefit is that the models are nice and clean and can be thrown in a spreadsheet. The downside is that they get complex quickly when you try include things like non-MDS erasure codes, correlated failure, and so on. The messy realities of life complicate modelling.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Often this increase in traffic has been blamed on lower pass or ticket prices, which seems reasonable to believe. On the other hand, the same folks often complain about how expensive skiing has become. Clearly, the sport is both too cheap and too expensive, a real challenge for resort operators!&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Fixing retries with token buckets and circuit breakers</title>
      <link>http://brooker.co.za/blog/2022/02/28/retries.html</link>
      <pubDate>Mon, 28 Feb 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/02/28/retries</guid>
      <description>&lt;h1 id=&quot;fixing-retries-with-token-buckets-and-circuit-breakers&quot;&gt;Fixing retries with token buckets and circuit breakers&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Throttle yourself before you DoS yourself.&lt;/p&gt;

&lt;p&gt;After my last post on &lt;a href=&quot;https://brooker.co.za/blog/2022/02/16/circuit-breakers.html&quot;&gt;circuit breakers&lt;/a&gt;, a couple of people reached out to recommend using circuit breakers only to break retries, and still send normal first try traffic no matter the failure rate. That’s a nice approach. It provides possible solutions to the core problem with client-side circuit breakers (they may make partial outages worse), and to the retry problem (where retries increase load on already-overloaded downstream services). To see how well that works, we can compare it to my favorite &lt;em&gt;better retries&lt;/em&gt; approach: a token bucket.&lt;/p&gt;

&lt;p&gt;First, let’s formally introduce the players:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;No retries&lt;/strong&gt;. When a client wants to make a call, it makes that call as normal. If it fails, the client moves on without retrying.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;N retries&lt;/strong&gt;. When a client wants to make a call, it makes that call as normal. If it fails, the client makes a maximum of N retries of the call.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Adaptive Retries&lt;/strong&gt; (aka the &lt;strong&gt;retry token bucket&lt;/strong&gt;). When a client wants to make a call, it makes that call as normal. If it succeeds, it drops part of a token into a limited-size &lt;a href=&quot;https://en.wikipedia.org/wiki/Token_bucket&quot;&gt;token bucket&lt;/a&gt;. If the call fails, retry up to N times as long as there are (whole) tokens in the bucket. For example, each success could deposit 0.1 tokens, and each retry could consume 1 token.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Retry circuit breaker&lt;/strong&gt;. When a client wants to make a call, it makes that call as normal. On success or failure, it updates statistics which track the (recent) failure rate. If that failure rate is below a threshold, it retries up to N times. If it’s above the threshold, it doesn’t retry at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Think it through&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, let’s try think through how each of these would perform.&lt;/p&gt;

&lt;p&gt;No retries is the easiest. If the downstream failure rate is x%, the effective failure rate is x%.&lt;/p&gt;

&lt;p&gt;N retries is the next easiest. If the downstream failure rate is x%, the effective failure rate is x&lt;sup&gt;N&lt;/sup&gt;, but with significant additional work. At 100% failure rate, the system does 1+N times as much work.&lt;/p&gt;

&lt;p&gt;The adaptive strategy is a little difficult to think about, but the rough idea is that it behaves like &lt;em&gt;N retries&lt;/em&gt; when failure rates are low, and “some percent retries” when the failure rate is higher. For example, if each successful calls puts 10% of a token into the bucket, adaptive behaves like &lt;em&gt;N retries&lt;/em&gt; a lot below 10% failure rate, and like “0.1 retries” much above 10% failure rate.&lt;/p&gt;

&lt;p&gt;The circuit breaker strategy is somewhat similar. At low rates (below the threshold) it behaves like &lt;em&gt;N retries&lt;/em&gt;. Above the threshold it behaves like &lt;em&gt;no retries&lt;/em&gt;. This is a little complicate by the fact that each client doesn’t know the true failure rate, and instead makes its decision based on a local sampling of the failure rate (which may vary substantially from the true rate for small clients).&lt;/p&gt;

&lt;p&gt;Closed-form reasoning about these dynamics are difficult. Instead of trying to reason about it, we can simulate the effects with a small event-driven simulation of a service and clients. I’ll write more in future about this simulation approach, but will start with some results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulating Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s consider a model with a single abstract service, which randomly fails calls at some rate. The service is called by 100 independent clients, each starting new attempts at some rate&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. We’re concerned with two results: the success rate the client sees, and the load the server sees from the clients. In particular, we’re concerned with how those things vary with the failure rate.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/retry_simulation_results.png&quot; alt=&quot;Graph of failure rates and load for four retry strategies&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can immediately see a couple of expected things, and a few interesting things. As expected, &lt;em&gt;no retries&lt;/em&gt; does no extra work, and provides availability that drops linearly with the failure rate. &lt;em&gt;Three retries&lt;/em&gt; does a lot of extra work, and provides the best robustness against errors. The breaker strategy does extra work, and provides extra robustness at low failure rates, but drops down to match &lt;em&gt;no retries&lt;/em&gt; after a threshold.&lt;/p&gt;

&lt;p&gt;Let’s zoom in a bit to the lower rates:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/retry_simulation_results_zoomed.png&quot; alt=&quot;Graph of failure rates and load for four retry strategies&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can see the strategies start to diverge. The first interesting observation is that the breaker strategy starts tripping a little early: around half the expected rate. That’s because each client is breaking independently. In this low-failure regime, the &lt;em&gt;adaptive&lt;/em&gt; strategy is very similar to &lt;em&gt;three retries&lt;/em&gt;, but slowly starting to diverge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The effect of client count&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both the &lt;em&gt;adaptive&lt;/em&gt; and &lt;em&gt;circuit breaker&lt;/em&gt; approach depend on per-client estimates of the failure rate, either expressed explicitly with the circuit breaker failure threshold, or implicitly with the contents of the token bucket. When the number of clients is low, it’s reasonable to expect that that these per-client estimates will converge on the true failure rate. With larger numbers of clients sending small volumes of traffic, estimates will vary more widely. This is especially important in serverless and container-based architectures, where clients may be numerous and short-lived, with each doing relatively little work (compared, say, to a multi-threaded monolith where a single client may see the work of very large numbers of threads).&lt;/p&gt;

&lt;p&gt;We can simulate the effects of client count on the performance of our &lt;em&gt;adaptive&lt;/em&gt; and &lt;em&gt;circuit breaker&lt;/em&gt; strategies. Here, we’ve got the same total number of requests divided among 10, 100, and 1000 clients:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/retry_simulation_results_clients.png&quot; alt=&quot;Graph of failure rates and loads for different numbers of clients&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What’s interesting here is that the two approaches have the opposite behavior. The &lt;em&gt;circuit breaker&lt;/em&gt; strategy is tripping too early, and approaching the performance of the &lt;em&gt;no retries&lt;/em&gt; approach. The &lt;em&gt;token bucket&lt;/em&gt; strategy (starting with a full bucket) doesn’t deplete its bucket fast enough, converging on the behavior of &lt;em&gt;n retries&lt;/em&gt;. Clearly, neither does a perfect job of solving the retry problem with limited per-client knowledge. A model with state shared between clients would change these results, but also significantly increase the complexity of the system (because clients would need to discover and talk to each other).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which one is better?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Choosing the right retry strategy depends on what we want to achieve. The ideal is to have a solution with no additional load and 100% success rate no matter the service failure rate. That’s clearly unachievable for a simple reason: clients don’t have any way to know which requests will succeed. The only mechanism they have is trying.&lt;/p&gt;

&lt;p&gt;Short of that ideal, what can we have? What most applications want is to have a high success rate when the server failure rate is low, and not too much additional load. &lt;em&gt;No retries&lt;/em&gt; fails on the first criterion, and &lt;em&gt;N retries&lt;/em&gt; fails on the second. Both the &lt;em&gt;adaptive&lt;/em&gt; and &lt;em&gt;circuit breaker&lt;/em&gt; strategies succeed to different extents. The circuit breaker approach gives no additional load at high failure rates, which is great. But it suffers from some modality (it’s either retrying or not retrying, and might switch back and forth between the two). The &lt;em&gt;adaptive&lt;/em&gt; strategy isn’t modal in the same way, and seems to perform better at lower failure rates, but does give some (tunable) additional load at higher rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; In other words, each client presents independent Poisson-process arrivals, and keeps its own retry state. The Poisson model here isn’t entirely accurate, but doesn’t matter because we’re not (yet) modelling overload or concurrency.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Will circuit breakers solve my problems?</title>
      <link>http://brooker.co.za/blog/2022/02/16/circuit-breakers.html</link>
      <pubDate>Wed, 16 Feb 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/02/16/circuit-breakers</guid>
      <description>&lt;h1 id=&quot;will-circuit-breakers-solve-my-problems&quot;&gt;Will circuit breakers solve my problems?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Maybe, but you need to know what problem you&apos;re trying to solve first.&lt;/p&gt;

&lt;p&gt;A couple of weeks ago, I started a tiny storm on Twitter by posting this image, and claiming that retries (mostly) make things worse in real-world distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/retry_loop.png&quot; alt=&quot;Retry loop, showing how retries make overload conditions worse&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The bottom line is that retries are often triggered by overload conditions, permanent or transient, and tend to make those conditions worse by increasing traffic. Many people replied saying that I’m ignoring the obvious effective solution to this problem: circuit breakers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a circuit breaker?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Way down in your basement, or in a closet, or wherever your local government decrees it to be, there’s a box full of electrical circuit breakers. These circuit breakers have one job&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;: turn off during overload before something else melts, burns, or flashes. They’re pretty great from a “staying alive” perspective. Reasoning by analogy, folks&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; developed the concept of circuit breakers for distributed systems. They goal of circuit breakers is usually defined something like this (from the &lt;a href=&quot;https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker&quot;&gt;Azure docs&lt;/a&gt;):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent failures that have occurred, and use this information to decide whether to allow the operation to proceed, or simply return an exception immediately.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;or this (from &lt;a href=&quot;https://martinfowler.com/bliki/CircuitBreaker.html&quot;&gt;Martin Fowler&lt;/a&gt;):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So far, so sensible. But why? What is the goal?&lt;/p&gt;

&lt;p&gt;Martin, again:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It’s common for software systems to make remote calls to software running in different processes, probably on different machines across a network. One of the big differences between in-memory calls and remote calls is that remote calls can fail, or hang without a response until some timeout limit is reached. What’s worse if you have many callers on a unresponsive supplier, then you can run out of critical resources leading to cascading failures across multiple systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and they do this in a way that’s better than just short timeouts. Microsoft again:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Note that setting a shorter timeout might help to resolve this problem, but the timeout shouldn’t be so short that the operation fails most of the time, even if the request to the service would eventually succeed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When people talk about circuit breakers, they’re typically considering two potential benefits. One, as Martin points out, is that failing early can prevent you from wasting work or resources on something that’s doomed. Doing that may allow work that requires the same resources, but isn’t dependent on the same downstream dependency, to continue to succeed. The second benefit is allowing a kind of progressive degradation in service. Maybe you can present your website without some optional feature, if the service backing that optional feature doesn’t work&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Again, sensible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with Circuit Breakers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The problem with circuit breakers is that they don’t take into account the fundamental properties of real distributed systems. Let’s consider the architecture of a toy distributed NoSQL database:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌────────────────────────────────────────┐
│          Load Balancer/Router          │
└────────────────────────────────────────┘
                     │                    
      ┌──────────────┼──────────────┐     
      │              │              │     
      ▼              ▼              ▼     
┌──────────┐   ┌──────────┐   ┌──────────┐
│          │   │          │   │          │
│ Storage  │   │ Storage  │   │ Storage  │
│  (A-H)   │   │  (I-R)   │   │  (S-Z)   │
│          │   │          │   │          │
└──────────┘   └──────────┘   └──────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There’s a router layer, and some shards of storage. When a request comes in for a key starting with B, it goes the the A-H shard. Requests for keys starting with T go to the S-Z shard, and so on. Real systems tend to be more complex and more sophisticated than this, but the top level architecture of scale-out databases almost always looks a little bit like this.&lt;/p&gt;

&lt;p&gt;How might this system fail? Clearly, the router layer could fail taking the whole thing down. But that seems less likely because its simple, probably stateless, easily horizontally scalable, etc. More likely is that one of the storage shards gets overloaded. Say &lt;em&gt;AaronCon&lt;/em&gt; is in town, and everybody is trying to sign up. The A-H shard will get a lot of load, while the others might get little. Calls for A-H may start failing, while calls for other keys keep working.&lt;/p&gt;

&lt;p&gt;That presents the circuit breaker with a problem. Is this database &lt;em&gt;down&lt;/em&gt;? Have failures reached a threshold?&lt;/p&gt;

&lt;p&gt;If you say &lt;em&gt;yes, it’s down&lt;/em&gt;, then you’ve made service worse for Jane and Tracy. If you say &lt;em&gt;no, it’s not down&lt;/em&gt;, then you may as well not have the breaker at all. Breakers that don’t trip aren’t very useful&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;The same issue is true of cell-based architectures, where a circuit breaker tripping on the failure of one cell may make the whole system look like its down, defeating the purpose of cells entirely. Cell-based architectures are similar to the sharded architectures, just sharded for availability and blast-radius instead of scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can We Fix Them?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Maybe. The problem here is that for circuit breakers to do the right thing in cell-based and sharded systems they need to predict something very specific: is &lt;em&gt;this call for these parameters&lt;/em&gt; likely to work? Inferring that from other calls with other parameters may not be possible. Clients simply don’t know enough (and, mostly, shouldn’t know enough) about the inner workings of the systems they are calling to make that decision. Typically, three solutions to this problem are proposed:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Tight coupling. If the client does know how internal data sharding works in the service, it can see which shards of the service are down, and make a good decision. The tradeoff here, obviously, is that this layering violation makes change hard. Nobody wants to be unable to change their service without changing every client. On the other hand, this approach may work well if you can guess well enough, like having circuit breakers per upstream customer.&lt;/li&gt;
  &lt;li&gt;Server information. On overload, the service can say things like “I’m overloaded for requests that start with A”, and the client can flip the corresponding mini circuit breaker. I’ve seen real-world systems that work this way, but the complexity cost may be high.&lt;/li&gt;
  &lt;li&gt;Statistical inference magic/AI magic/ML magic. Could work. Hard to get right. Have fun writing the postmortem when the arriving traffic looks nothing like the training set.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom Line&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Modern distributed systems are designed to partially fail, continuing to provide service to some clients even if they can’t please everybody. Circuit breakers are designed to turn partial failures into complete failures. One mechanism will likely defeat the other. Make sure you think that through before deploying circuit breakers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Ok, ok, modern circuit breakers have multiple jobs including detecting ground and arc faults, and industrial circuit breakers can do fancy things like detect high-impedance faults.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Commonly attributed to Michael Nygard in &lt;a href=&quot;https://www.amazon.com/Release-Production-Ready-Software-Pragmatic-Programmers/dp/0978739213&quot;&gt;Release It!&lt;/a&gt;, but it’s not clear that’s the actual origin, and I don’t have my copy of the book to hand to check if he credits somebody else. It’s a good book, worth reading.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Fans of the paper &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Harvest, Yield, and Scalable Tolerant Systems&lt;/a&gt; might call this a reduction in Harvest. That’s a good paper, as long as you skip the confusing section about CAP.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; And &lt;a href=&quot;https://www.nbcbayarea.com/news/local/federal-pacific-circuit-breakers-investigation-finds-decades-of-danger/1930189/&quot;&gt;can even be actively harmful&lt;/a&gt; by catching fire themselves. Useless complexity is bad.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Software Deployment, Speed, and Safety</title>
      <link>http://brooker.co.za/blog/2022/01/31/deployments.html</link>
      <pubDate>Mon, 31 Jan 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/01/31/deployments</guid>
      <description>&lt;h1 id=&quot;software-deployment-speed-and-safety&quot;&gt;Software Deployment, Speed, and Safety&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;There&apos;s one right answer that applies in all situations, as always.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: Sometime around a 2015, I wrote AWS’s official internal guidance on balancing deployment speed and safety. This blog post is not that. It’s not official guidance from AWS (nothing on this blog is), and certainly not guidance for AWS. Instead, it’s my own take on deployments and safety, and how I think about the space.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You’ll find a lot of opinions about deployments on the internet. Some folks will say that teams have to be able to deploy from commit to global production in minutes. Others will point out that their industry has multi-year product cycles. It’s a topic that people feel strongly about, and for good reason. As usual with these kinds of topics, most of the disagreement doesn’t come from actual disagreement, but from people with wildly different goals and tradeoffs in mind. Without being explicit about what we’re trying to achieve, our risk tolerance, and our desired reward, it’s impossible to have a productive conversation on this topic. This post is an attempt to disentangle that argument a little bit, and explain my perspective.&lt;/p&gt;

&lt;p&gt;That perspective is clearly focused on the world I work in - offering cloud-based services to large groups of customers. Some of that applies to software more generally, and some applies only to that particular context. I have also used the word &lt;em&gt;deployment&lt;/em&gt; here to stand in for all production changes, including both software and configuration changes, and the actions of operators in general.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tradeoffs exist&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In my experience, software teams are happiest when they’re shipping code. That could mean code to production, or to testing, or to validation, but nothing seems to destroy the morale of a team quite a surely as making changes with no end date in sight. Folks want to see their changes have an impact on the real world. It’s what I like to see too. Shipping often also sometimes means shipping smaller, better understood, increments, potentially increasing safety. Speed and agility are also important for reliability and security. Flaws in systems, whether in our software or the software, firmware, and hardware it’s built on, are an unfortunate fact of working on complex systems. Once flaws are found, it’s important to be able to address them quickly. Especially so when it comes to security, where an &lt;em&gt;adversary&lt;/em&gt; may learn about flaws at the same time we do. Businesses also want to get changes in the hands of customers quickly - after all, that’s what most of us are doing here. Customers want new features, improvements, better performance, or whatever else we’ve been working on. And, for the most part, they want it &lt;em&gt;now&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;These factors argue that faster is better.&lt;/p&gt;

&lt;p&gt;Balancing the need for speed is risk. Let’s ignore, for the rest of this post, the risk of developing software fast (and, presumably, skipping out on testing, validation, careful deployment practices, etc) and focus only on the act of deployment. Getting changed software out into production. Clearly, at some level, deploying software reduces risk by giving us an opportunity to address known flaws in the system. Despite this opportunity to improve, deployment brings risk, introduced both by the act of deploying and by the fact that new software is going out to meet the world for the first time. That new software is tested and validated, of course, but the real world is more complex and weirder than even the most ambitious testing program, and therefore will contain new flaws.&lt;/p&gt;

&lt;p&gt;I’m going to mostly ignore the risks of the act of deploying. Clare Liguori wrote &lt;a href=&quot;https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/?did=ba_card&amp;amp;trk=ba_card&quot;&gt;a great post for the Amazon Builder’s Library on that topic&lt;/a&gt;, and the state-of-the-art of technological and organizational solutions. I won’t repeat that material here.&lt;/p&gt;

&lt;p&gt;Even in a world where getting software out to production is perfectly safe, new software has risks that old software doesn’t. More crucially, new components of old systems introduce change that may lead to emergent changes in the behavior of the entire system, in ways that prove difficult to predict. New features and components add complexity that wasn’t there before. Even new performance improvements may have unexpected consequences, either by introducing new cases where performance is unexpectedly worse, or by moving the bottlenecks to somewhere they are less visible, or by introducing instability or metastability into the system.&lt;/p&gt;

&lt;p&gt;Deploying software incrementally - to some places, customers, or machines - helps contain this risk. It doesn’t reduce the probability that something goes wrong, but does reduce the blast radius when something does. Deploying incrementally only works &lt;em&gt;over time&lt;/em&gt;. You need to allow enough time (measured in hours, requests, or both) to pass between steps to know if there is trouble. Monitoring, logging, and observability are needed to make that time valuable. If you’re not looking for signs of trouble, you’re wasting your time.&lt;/p&gt;

&lt;p&gt;There’s a tradeoff between speed and safety.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time finds problems, people fix them&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are two hidden assumptions in the section above: problems happen (or become visible) some time after deployment, and that time is short enough that waiting between deployments will catch a significant proportion of problems. The first assumption seems true enough. Trivial problems are often caught in testing, especially integration testing, so what’s left is more subtle things. It can take some time for a subtle signal in error rates or logs to become visible. The second is less obvious. If problems are caused by state or customer behavior that only exist in production but not in testing, then we may expect them to show themselves fairly quickly. More subtle issue may take longer, and longer than we’re willing to wait. For example Roblox recently had a long outage triggered by a long latent design issue, &lt;a href=&quot;https://blog.roblox.com/2022/01/roblox-return-to-service-10-28-10-31-2021/&quot;&gt;for reasons they describe in an excellent postmortem&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;These assumptions lead to three of the biggest controversies in this space: how fast to move, how to measure speed, and the Friday question.&lt;/p&gt;

&lt;p&gt;How fast we move, or how long we wait between increments, is a question that needs to be answered in two ways. One is driven by data on how long it takes to detect and react to real in-production software problems. That’s going to depend a great deal on the system itself. The second answer has to come from customer expectations and promises, which I’ll touch on later. The second controversy is how to measure speed. Do we measure in terms of wall-clock time, or requests, or coverage? The problem with wall-clock time is that software problems don’t tend to be triggered just by time, but by the system actually doing work. So if you deploy and nobody is using it, then waiting doesn’t help. Counting work, like requests, seems to be the obvious solution. The problem with that approach is that user patterns tend to be seasonal, and so you need to be quite specific about which requests to count (and doing that requires a level of foresight that may not be possible). Requests and coverage are also a little open-ended, which makes making promises somewhat difficult.&lt;/p&gt;

&lt;p&gt;Then there’s The Friday Question. This is typically framed two ways. One is that not deploying on Fridays and weekends is better because it respects people’s time (because people will need to fix problems that arise), and better because folks being out for the weekend will increase time-to-recovery, and better because waking customer’s oncalls up on Friday night is painful. The other framing is that avoiding deploying on Fridays is merely a symptom of bad practices or bad testing or bad tooling or bad observability leading to too much risk. A lot of the controversy comes from the fact that both of these positions are reasonable, and correct. Good leaders do need to constantly look out for, and sometimes avoid, short-term band-aids over long-term problems. On the other hand, our whole industry could do with being more respectful of people’s lives, and doing the right thing by our customers is always the first goal. The passion on this question seems to be misguided.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There are no take-backsies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rollback (taking a troublesome change out of production and replacing it with a previously known-good version) and &lt;em&gt;roll forward&lt;/em&gt; (taking a troublesome change out of production and replacing it with a hopefully fixed new version) are important parts of managing deployment risk. In idealized stateless systems they might be all we need to manage that risk. Detect quickly if something bad is happening, and roll back. Limit the trouble caused by a bad deployment by limiting the time to recovery. Unfortunately, with stateful systems you can’t take back some kinds of mistakes. Once state is corrupted, or lost, or leaked, or whatever, no amount of rolling back is going to fix the problem. That’s obviously true of hard-state systems like databases and soft-state systems like caches, but also true of the softer and more implicit state in thing like queued requests, or inflight work, or running workflows or whatever. Rolling back requires fixing state, which may be impossible.&lt;/p&gt;

&lt;p&gt;High quality validation and testing are a critical part of the way forward. Stateful systems need to have a different bar for software quality than many other kinds of systems. They also need to have deployment practices reflect the fact that one of the most powerful tools for risk management—rollback—simply doesn’t work.&lt;/p&gt;

&lt;p&gt;Immutable, append-only, or log-based systems are often cited as a solution to this problem. They may be, but rolling back your log causes your log to turn into a DAG, and that’s a whole problem of its own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can’t talk about risk without talking about customers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When we talk about the risk of a bad deployment, the only sensible way to do that is in context of the outcome. When I deploy changes to this blog, I do it live on the one production box. After all, if I mess up my blog may be broken, but nothing here is secret (the content is CC-BY) or particularly valuable.&lt;/p&gt;

&lt;p&gt;But as systems get bigger and more critical the risk changes. A few minutes of downtime, or an unexpected behavior, for a critical cloud system could have a direct impact on millions of people, and an indirect impact on billions. The outcome is of mistakes is different, and therefore so is the risk. We can’t rationally talk about deployment practices without acknowledging this fact. Any reasonable set of deployment practices must be based around both the benefits of moving fast, and the risk of quickly pushing out bad changes. This is where most online discussions about this topic fall down, and where people end up talking past each other.&lt;/p&gt;

&lt;p&gt;The only place to start that makes sense is to understand the needs of the customers of the system. They want improvements, and fast. But they also want reliability and availability. How much is enough for them? Obviously there’s no amount of down time that makes people happy, but there is an amount where it stops becoming an impediment to their use of the service. That’s going to depend a great deal on what system we’re talking about, and how customers use it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can’t talk about risk without talking about correlation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Redundancy is the most powerful tool in our toolbox as distributed systems engineers. Have more that one place the data is stored, or requests are processed, or traffic can flow, and you can have a system that is more available and durable than any single component. Assuming, of course, that those components don’t fail at the same time. If they do, then all that carefully designed-in redundancy is for nought.&lt;/p&gt;

&lt;p&gt;What makes components fail at the same time? Could just be bad luck. The good news is that we can spend linear amounts of money to offset exponential amounts of bad luck. Correlated failures could be caused by common infrastructure, like power, cooling, or thin glass tubes. Correlated failures could be caused by data, load, or user behavior. But one of the dominant causes of correlated failure is software deployments. After all, deployments can cut across redundancy boundaries in ways that requests, data, and infrastructure often can’t. Deployments are the one thing that breaks all our architecture assumptions.&lt;/p&gt;

&lt;p&gt;To understand what a big deal is, we need to understand the exponential effects of redundancy. Say I have one box, which is down for a month a year, then I have a system with around 92% availability. If I have two boxes (either of which can handle the load), and they fail independently, then I have a system with 99.3% availability! On the other hand, if they tend to fail together, then I’m back to 92%. Three boxes independent gets me 99.95%. Three boxes at the same time get me 92%. And so on. Whether failures happen independently or at the same time matters a lot for availability.&lt;/p&gt;

&lt;p&gt;Our deployment practices need to be aware of our redundancy assumptions. If we’re betting that failures are uncorrelated, we need to be extremely careful about reintroducing that correlation. Similarly, and even more importantly, our architectures need to be sensitive to the need to deploy safely. This is one of the places that folks working on architecture can have the biggest long-term impact, by designing systems that can be changed safely and frequently with low impact, and isolating critical state-altering logic from other logic which we may want to change more quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can’t talk about risk without talking about correlation on behalf of customers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just like we build highly-available systems out of redundant components, customers of cloud services do too. It’s typical practice to build systems which run in multiple availability zones or datacenters. Customers with more extreme availability needs may build architectures which cross regions, or even continents. Those designs only work if the underlying services don’t fail at the same time, for all the same reasons that apply to each system in isolation. Making that true requires careful attention to system design, and careful attention to not re-introducing correlated failure modes during deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a tricky space, because it combines social and organizational concerns with technical concerns with customer concerns, and even things like contractual obligations. Like any similar problem space, it’s hard to come up with clear answers to these questions, because the answers are so dependent on context and details of your business. My advice is to write down those tensions explicitly, and be clear about what you’re trying to balance and where you think the right balance is for your business or technology.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>DynamoDB's Best Feature: Predictability</title>
      <link>http://brooker.co.za/blog/2022/01/19/predictability.html</link>
      <pubDate>Wed, 19 Jan 2022 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2022/01/19/predictability</guid>
      <description>&lt;h1 id=&quot;dynamodbs-best-feature-predictability&quot;&gt;DynamoDB’s Best Feature: Predictability&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Happy birthday!&lt;/p&gt;

&lt;p&gt;It’s &lt;a href=&quot;https://www.amazon.science/latest-news/amazons-dynamodb-10-years-later&quot;&gt;10 years since the launch of DynamoDB&lt;/a&gt;, Amazon’s fast, scalable, NoSQL database. Back when DynamoDB launched, I was leading the team rethinking the control plane of &lt;a href=&quot;https://aws.amazon.com/ebs/&quot;&gt;EBS&lt;/a&gt;. At the time, we had a large number of manually-administered MySQL replication trees, which were giving us a lot of operational pain. Writes went to a single primary, and reads came from replicas, with lots of eventual consistency and weird anomalies in the mix. Our code, based on an in-house framework, was also hard to work with. We weren’t happy with our operational performance, or our ability to deliver features and improvements. Something had to change. We thought a lot about how to use MySQL better, and in the end settled on ditching it entirely. We rebuilt the whole thing, from the ground up, using DynamoDB. At the time my main attraction to DynamoDB was &lt;em&gt;somebody else gets paged for this&lt;/em&gt;, with a side order of &lt;em&gt;it’s fast and consistent&lt;/em&gt;. DynamoDB turned out to be the right choice, but not only for those reasons.&lt;/p&gt;

&lt;p&gt;To understand the real value of DynamoDB, I needed to think more deeply about one of the reasons the existing system was painful. It wasn’t just the busywork of DB operations, and it wasn’t just the eventual consistency. The biggest pain point was behavior under load. A little bit of unexpected traffic and things went downhill fast. Like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/goodput_curve.jpeg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Our system had two stable modes (see my posts &lt;a href=&quot;https://brooker.co.za/blog/2021/05/24/metastable.html&quot;&gt;on metastability&lt;/a&gt; and on &lt;a href=&quot;https://brooker.co.za/blog/2021/08/27/caches.html&quot;&gt;cache behavior&lt;/a&gt;): one where it was ticking along nicely, and one where it had collapsed under load and wasn’t able to make progress. That collapsing under load was primarily driven by the database itself, with buffer/cache thrashing and IO contention the biggest drivers, but that wasn’t the real cause. The real cause was that we couldn’t reject work well enough to avoid entering that mode. Once we knew - based on queue lengths or latency or other output signals - the badness had already started. The unexpectedly expensive work had already been let in, and the queries were already running. Sometimes cancelling queries helped. Sometimes failing over helped. But it was always a pain.&lt;/p&gt;

&lt;p&gt;Moving to DynamoDB fixed this for us in two ways. One is that DynamoDB is great at rejecting work. When a table gets too busy you don’t get weird long latencies or lock contention or IO thrashing, you get a nice clean HTTP response. The net effect of DynamoDB’s ability to reject excess load (based on per-table settings) is that the offered load/goodput graph has a nice flat “top” instead of going up and then sharply down. That’s great, because it gives systems more time to react to excess load before tipping into overload. Rejections are a clear leading signal of excess load.&lt;/p&gt;

&lt;p&gt;More useful than that is another property of DynamoDB’s API: each call to the database does a clear, well-defined unit of work. Get these things. Scan these items. Write these things. There’s never anything open-ended about the work that you ask it to do. That’s quite unlike SQL, where a single &lt;em&gt;SELECT&lt;/em&gt; or &lt;em&gt;JOIN&lt;/em&gt; can do a great deal of work, depending on things like index selection, cache occupancy, key distribution, and the skill of the query optimizer. Most crucially, though, the amount of work that a SQL database does in response to a query depends on what data is already in the database. The programmer can’t know how much work a query is going to trigger unless they can also predict what data is going to be there. And, to some extent, what other queries are running at the same time. These properties make it hard for the programmer to build a good mental model of how their code will work in production, especially as products grow and conditions change.&lt;/p&gt;

&lt;p&gt;The same unpredictability has another effect. In typical web services, requests need to be accepted or rejected at the front door. That means that services need to be able to look at a request, and decide whether it should be rejected (for example to prevent overload or because of user quotas) without being able to accurately predict the cost of the database queries it will trigger.&lt;/p&gt;

&lt;p&gt;This all comes together to make it much easier to write stable, well-conditioned, systems and services against DynamoDB than against relational databases. SQL and relational databases definitely have their place, including in scalable service architectures, but significant extra effort needs to be spent to make the systems that depend on them stable under unexpected load. That’s work that most developers aren’t deeply familiar with. DynamoDB’s model, on the other hand, forces stability and load to be considered up front, and makes them easier to reason about. In some environments that’s well worth it.&lt;/p&gt;

&lt;p&gt;It took me a while to realize it, but that’s my favorite thing about DynamoDB.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Bug in Paxos Made Simple</title>
      <link>http://brooker.co.za/blog/2021/11/16/paxos.html</link>
      <pubDate>Tue, 16 Nov 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/11/16/paxos</guid>
      <description>&lt;h1 id=&quot;the-bug-in-paxos-made-simple&quot;&gt;The Bug in Paxos Made Simple&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;There&apos;s not really a bug in Paxos, but clickbait is fun.&lt;/p&gt;

&lt;p&gt;Over the last few weeks, I’ve been picking up the excellent &lt;a href=&quot;https://github.com/p-org/P&quot;&gt;P programming language&lt;/a&gt;, a language for modelling and specifying distributed systems. One of the first things I did in P was implement Paxos - an algorithm I know well, has a lot of subtle failure modes, and is easy to get wrong. Perfect for practicing specification. To test out P’s model checker, I intentionally implemented a subtly buggy version of Paxos, following the description in &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt;. The model checker found, as expected, implemented the way I read Paxos Made Simple, that Paxos is broken.&lt;/p&gt;

&lt;p&gt;I mentioned this to a colleague who said they had never heard of this bug. I think it deserves to be more well known, so I thought I’d write a bit about it.&lt;/p&gt;

&lt;p&gt;The problem lies not in the Paxos algorithm itself, but in the description in the paper. Michael Deardeuff pointed out this bug to me, and also wrote it up in what may be &lt;a href=&quot;https://stackoverflow.com/questions/29880949/contradiction-in-lamports-paxos-made-simple-paper&quot;&gt;the best Stack Overflow exchange of all time&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; (or, at least, the one with the best value-to-upvotes ratio). In the Stack Overflow question, user &lt;em&gt;lambda&lt;/em&gt; describes the following sequence of events:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Consider that there are totally 3 acceptors ABC. We will use X(n:v,m) to denote the status of acceptor X: proposal n:v is the largest numbered proposal accepted by X where n is the proposal number and v is the value of the proposal, and m is the number of the largest numbered prepare request that X has ever responded.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The following can play out:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;1. P1 sends &apos;prepare 1&apos; to AB
2.  Both AB respond P1 with a promise to not to accept any request numbered smaller than 1.\
    Now the status is: A(-:-,1) B(-:-,1) C(-:-,-)
3.  P1 receives the responses, then gets stuck and runs very slowly
4.  P2 sends &apos;prepare 100&apos; to AB
5.  Both AB respond P2 with a promise to not to accept any request numbered smaller than 100.
    Now the status is: A(-:-,100) B(-:-,100) C(-:-,-)
6.  P2 receives the responses, chooses a value b and sends &apos;accept 100:b&apos; to BC   
7.  BC receive and accept the accept request, the status is: A(-:-,100) B(100:b,100) C(100:b,-).
    Note that proposal 100:b has been chosen.
8.  P1 resumes, chooses value a and sends &apos;accept 1:a&apos; to BC
9.  B doesn&apos;t accept it, but C accepts it because C has never promise anything.
    Status is: A(-:-,100) B(100:b,100) C(1:a,-).
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This seems to be a major problem, because now the system could &lt;em&gt;forget&lt;/em&gt; the decided value and decide on another one, violating the most basic safety property of Paxos. As Micheal points out in his answer, it turns out that this happens because of two ambiguities in the text of Paxos Made Simple. First, on the selection of acceptors for the &lt;em&gt;accept&lt;/em&gt; (second) phase (from Paxos Made Simple):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of &lt;strong&gt;those acceptors&lt;/strong&gt; for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you follow the letter of this statement, and send the accept messages to the acceptors who responded to your first phase messages, then the problem can’t happen. Unfortunately, this also makes the algorithm somewhat less robust in practice. Fortunately, there’s another possible fix. Again, from Michael’s answer:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;By accepting a value the node is also promising to not accept earlier values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Lamport doesn’t say this in Paxos Made Simple. Instead, he says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So if you don’t quite follow the letter of the text about acceptor selection, and then do follow the text about how acceptors handle accept messages, then you end up with the bug described in the Stack Overflow question. That seems like a narrow case, but I’ll admit that I’ve implemented Paxos incorrectly in this way multiple times. It’s a very easy mistake to make.&lt;/p&gt;

&lt;p&gt;Leslie Lamport is one of my technical writing heroes. I re-read some of his papers, like &lt;a href=&quot;https://www.microsoft.com/en-us/research/uploads/prod/2016/12/What-Good-Is-Temporal-Logic.pdf&quot;&gt;What Good is Temporal Logic?&lt;/a&gt; from time to time just because I like the way they are written. Pointing out this ambiguity isn’t criticizing his writing, but rather reminding you about how hard it is to write crisp descriptions of even relatively simple distributed protocols in text. As Lamport himself &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/pubs.html#paxos-simple&quot;&gt;says&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Prose is not the way to precisely describe algorithms.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s a big part of why I like languages like P and TLA+ so much. Not only are they great ways to specify, check, and model algorithms, but they are also great ways to communicate them. If you work with distributed algorithms, I strongly advise picking up one of these languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://twitter.com/unmeshjoshi&quot;&gt;Unmesh Joshi&lt;/a&gt; had an interesting &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-17162?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&amp;amp;focusedCommentId=17445881#comment-17445881&quot;&gt;conversation with the Cassandra folks&lt;/a&gt; about why their implementation is correct (which it seems to be, at least in this context).&lt;/li&gt;
  &lt;li&gt;Heidi Howard replied with an interesting thread on Twitter, saying:&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;Thanks for posting this, it’s a super interesting observation! I actually think about this issue a bit differently. Acceptors in Paxos store the “highest proposal accepted” not the “last proposal accepted”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Thanks for posting this, it&amp;#39;s a super interesting observation! I actually think about this issue a bit differently. Acceptors in Paxos store the &amp;quot;highest proposal accepted&amp;quot; not the &amp;quot;last proposal accepted&amp;quot;.&lt;/p&gt;&amp;mdash; Heidi Howard (@heidiann360) &lt;a href=&quot;https://twitter.com/heidiann360/status/1461464625380270087?ref_src=twsrc%5Etfw&quot;&gt;November 18, 2021&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; I was reminded about this excellent exchange by Mahesh Balakrishnan’s recent post &lt;a href=&quot;https://maheshba.bitbucket.io/blog/2021/11/15/Paxos.html&quot;&gt;Paxos made Abstract.&lt;/a&gt;, which is well worth reading to give a different perspective on how Paxos works, and one way to think about it from a systems perspective.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; There are a lot more correct variants of jury selection for different phases that meet different design goals, as &lt;a href=&quot;https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf&quot;&gt;Heidi Howard’s work&lt;/a&gt; clearly points out.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Serial, Parallel, and Quorum Latencies</title>
      <link>http://brooker.co.za/blog/2021/10/20/simulation.html</link>
      <pubDate>Wed, 20 Oct 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/10/20/simulation</guid>
      <description>&lt;h1 id=&quot;serial-parallel-and-quorum-latencies&quot;&gt;Serial, Parallel, and Quorum Latencies&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Why are they letting me write Javascript?&lt;/p&gt;

&lt;p&gt;I’ve written &lt;a href=&quot;https://brooker.co.za/blog/2021/04/19/latency.html&quot;&gt;before&lt;/a&gt; about the latency effects of series (do X, then Y), parallel (do X and Y, wait for them both), and quorum (do X, Y and Z, return when two of them are done) systems. The effects of these different approaches to doing multiple things are quite intuitive. What may not be intuitive, though, is the impact of quorums, and how much quorums can reduce tail latency.&lt;/p&gt;

&lt;p&gt;So I put together this little toy simulator.&lt;/p&gt;

&lt;p&gt;The knobs are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;serial&lt;/strong&gt; The number of things to do in a &lt;em&gt;chain&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;parallel&lt;/strong&gt; The number of parallel &lt;em&gt;chains&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;quorum&lt;/strong&gt; The number of chains we wait to complete before being done.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;runs&lt;/strong&gt; How many times to sample.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, for example, a traditional 3-of-5 Paxos system would have serial=1, parallel=5, and quorum=3. A length-3 chain replication system would have serial=3, parallel=1, quorum=1. The per-node service time distribution is (for now) assumed to be exponentially distributed with mean 1.&lt;/p&gt;

&lt;div id=&quot;vis&quot;&gt;&lt;/div&gt;

&lt;script src=&quot;https://cdn.jsdelivr.net/npm/vega@5&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;https://cdn.jsdelivr.net/npm/vega-lite@4&quot;&gt;&lt;/script&gt;

&lt;script src=&quot;https://cdn.jsdelivr.net/npm/vega-embed@6&quot;&gt;&lt;/script&gt;

&lt;script type=&quot;text/javascript&quot;&gt;
  // Generate `n` samples, exponentially distributed with `lambda = 1.0` (i.e. a mean of 1)
  function samples(n) {
    let data = [];
    for (let i = 0; i &lt; n; i++) {
      data.push(-Math.log(Math.random()));
    }
    return data;
  }

  function makeSerial(n, serial) {
    let totals = [];
    for (let i = 0; i &lt; n; i++) {
      let sample = samples(serial).reduce((a, b) =&gt; a + b, 0);
      totals.push(sample);
    }
    return totals;
  }

  function simulate(n, serial, parallel, quorum) {
    let results = [];
    for (let i = 0; i &lt; n; i++) {
      // Each sample starts with `parallel` serial chains, each of length `serial`
      let serial_samples = makeSerial(parallel, serial);
      // Then we sort them, and take the highest `quorum`
      let sorted_samples = serial_samples.sort().slice(0, quorum);
      // And the result is the longest remaining sample
      results.push(sorted_samples[sorted_samples.length - 1]);
    }
    return results;
  }

  function arrayToData(arr) {
    return arr.map(function(v) { return {&quot;u&quot;: v }; });
  }

  function updateView(view) {
    let new_data = simulate(view.signal(&apos;runs&apos;), view.signal(&apos;serial&apos;), view.signal(&apos;parallel&apos;), view.signal(&apos;quorum&apos;));
    view.change(&apos;points&apos;, vega.changeset().remove(vega.truthy).insert(arrayToData(new_data))).runAsync();
  }

  var spec = &quot;https://brooker.co.za/blog/resources/simulation_vega_lite_spec.json&quot;;
  vegaEmbed(&apos;#vis&apos;, spec).then(function(result) {
    updateView(result.view);
    result.view.addSignalListener(&apos;serial&apos;, function(name, value) {
      updateView(result.view);
    });
    result.view.addSignalListener(&apos;parallel&apos;, function(name, value) {
      updateView(result.view);
    });
    result.view.addSignalListener(&apos;quorum&apos;, function(name, value) {
      updateView(result.view);
    });
    result.view.addSignalListener(&apos;runs&apos;, function(name, value) {
      updateView(result.view);
    });
  }).catch(console.error);
&lt;/script&gt;

&lt;p&gt;&lt;strong&gt;Examples to Try&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Compare a 3-length chain to a 3-of-5 Paxos system. First, set serial=3, parallel=1, and quorum=1 and see how the 99th percentile latency is somewhere around 8s. Now, try serial=1, parallel=5, quorum=3. Notice how the 99th percentile is now just over 2ms. There’s obviously a lot more to chain-vs-quorum in the real world than what is captured here.&lt;/li&gt;
  &lt;li&gt;Compare a 3-of-5 quorum to 4-of-7. The effect isn’t as big here, but the bigger quorum leads to a nice reduction in high-percentile latency.&lt;/li&gt;
  &lt;li&gt;Check out the non-linear effect of longer serial chains. The 99th percentile doesn’t increase by 10x between serial=1 and serial=10. Why?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Have fun!&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Caches, Modes, and Unstable Systems</title>
      <link>http://brooker.co.za/blog/2021/08/27/caches.html</link>
      <pubDate>Fri, 27 Aug 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/08/27/caches</guid>
      <description>&lt;h1 id=&quot;caches-modes-and-unstable-systems&quot;&gt;Caches, Modes, and Unstable Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Best practices are seldom the best.&lt;/p&gt;

&lt;p&gt;Is your system having scaling trouble? A bit too slow? Sending too much traffic to the database? Add a caching layer! After all, caches are a &lt;em&gt;best practice&lt;/em&gt; and &lt;em&gt;a standard way to build systems&lt;/em&gt;. What trouble could following a best practice cause?&lt;/p&gt;

&lt;p&gt;Lots of trouble, as it turns out. In the context of distributed systems, caches are a powerful and useful tool. Unfortunately, applied incorrectly, caching can introduce some highly undesirable system behaviors. Applied incorrectly, caches can make your system unstable. Or worse, &lt;a href=&quot;https://brooker.co.za/blog/2021/05/24/metastable.html&quot;&gt;metastable&lt;/a&gt;. To understand why that is, we need to understand a bit about how systems scale.&lt;/p&gt;

&lt;p&gt;Let’s start with the basics. Your system (hopefully) has some customers who send requests to it. Most often, you have lots of customers, and each one sends requests fairly infrequently. Those requests coming in from your customers are the &lt;em&gt;offered load&lt;/em&gt;, generally measured in something like &lt;em&gt;requests per second&lt;/em&gt;. Then, your system does some work on those requests, and eventually gives the results to some happy customers. The rate it does that is the &lt;em&gt;goodput&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/architecture.jpeg&quot; alt=&quot;Diagram showing customers offering load, goodput, and concurrency&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The number of requests inside your system, the &lt;em&gt;concurrency&lt;/em&gt;, is related to the offered load and goodput. When they’re the same, the concurrency varies a small amount, but is relatively stable. The amount of concurrency in your system depends on the &lt;em&gt;offered load&lt;/em&gt; and the time it takes to handle each request (&lt;em&gt;latency&lt;/em&gt;). So far, so good.&lt;/p&gt;

&lt;p&gt;But there’s some bad news. The bad news is that &lt;em&gt;latency&lt;/em&gt; isn’t really a constant. In most systems, and maybe all systems, it increases with &lt;em&gt;concurrency&lt;/em&gt;. And &lt;em&gt;concurrency&lt;/em&gt; increases with &lt;em&gt;latency&lt;/em&gt;. Maybe you can see where this is going.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/goodput_curve.jpeg&quot; alt=&quot;Diagram showing goodput curve&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Most real systems like this have a &lt;em&gt;congestive collapse&lt;/em&gt; mode, where they can’t get rid of requests as fast as they arrive, concurrency builds up, and the goodput drops, making the issue worse. You can use tools like &lt;a href=&quot;https://brooker.co.za/blog/2018/06/20/littles-law.html&quot;&gt;Little’s law&lt;/a&gt; to think about those situations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What does this have to do with caches?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The most common use of caches in distributed systems is to reduce load on a data store, like a database. When data is needed, you check the cache, if it’s not there, you go to the database and get the data, and stash it into the cache. That’s mostly good, because it reduces load on the database, and reduces latency.&lt;/p&gt;

&lt;p&gt;What happens when the cache is empty? Well, latency is higher, and load on the backend database is higher. When latency is higher, concurrency is higher, and goodput may be lower. When load on the backend database is higher, it’s concurrency is higher, and goodput may be lower. In fact, the latter is very likely. After all, you put that cache in place to protect the backend database from all that load it can’t handle.&lt;/p&gt;

&lt;p&gt;So our system has two stable loops. One’s a happy loop where the cache is full:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/cache_happy_loop.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The other is a sad loop, where the cache is empty, and stays empty:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/cache_sad_loop.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What’s interesting and important here is that these are both stable loops. Unless something changes, the system can run in either one of these modes forever. That’s good in the case of the good loop, but bad in the case of the bad loop. It’s a classic example - probably the most common one of all - of a &lt;a href=&quot;https://brooker.co.za/blog/2021/05/24/metastable.html&quot;&gt;metastable&lt;/a&gt; distributed system.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It gets worse&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This problem is bad, and especially insidious for a couple of reasons that may not be obvious on the surface.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Load testing typically isn’t enough to kick a system in the &lt;em&gt;good&lt;/em&gt; loop into the &lt;em&gt;bad&lt;/em&gt; loop, and so may not show that the bad loop exists. This is for a couple of reasons. One is that caches love load, and typically behave better under high, predictable, well-behaved load than under normal circumstances. The other is that load tests typically test &lt;em&gt;lots of load&lt;/em&gt;, instead of testing the bad pattern for caches, which is load with a different (and heavier-tailed) key frequency distribution from the typical one.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Caches extract cacheability. What I mean by that is that the load that misses the cache is less cacheable than the load that hits the cache. So typically, systems end up with either a hierarchy of cache sizes (like a CPU), or with one big cache. But when that cache is empty, a lot of requests for the same key will go to the systems behind it. A cache could have helped there, but there isn’t one because it wouldn’t have helped in the happy case.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Caches are based on assumptions. Fundamentally, a cache assumes that there’s either some amount of temporal or spatial locality of access (i.e. if Alice is sending work now, she’ll probably be sending more work soon, so it’s efficient to keep Alice’s stuff on deck), or their key distribution isn’t uniform (i.e. Bob sends work every second, Alice sends work every day, so it’s efficient to keep Bob’s stuff on deck and fetch Alice’s when we need it). These assumptions don’t tend to be rigorous, or enforced in any way. They may change in ways that are invisible to most approaches to monitoring.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;But aren’t CPU caches good?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yes, CPU caches are good. Our computers would be way slower without them.&lt;/p&gt;

&lt;p&gt;Thinking about why CPU caches are good and (generally) immune to this problem is very instructive. It’s because of offered load. When you’re clicking away on your laptop, say designing a robot in CAD or surfing the web, you react to slowness by asking for less work. That means that slowness caused by empty caches reduces goodput, but also reduces offered load. The unbounded increase in concurrency doesn’t happen.&lt;/p&gt;

&lt;p&gt;Good caches have feedback loops. Like back pressure, and limited concurrency. Bad caches are typically open-loop. This starts to give us a hint about how we may use caches safely, and points to some of the safe patterns for distributed systems caching. More on that later.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>My Proposal for Arecibo: Drones</title>
      <link>http://brooker.co.za/blog/2021/08/11/arecibo.html</link>
      <pubDate>Wed, 11 Aug 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/08/11/arecibo</guid>
      <description>&lt;h1 id=&quot;my-proposal-for-arecibo-drones&quot;&gt;My Proposal for Arecibo: Drones&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;With apologies to real radio astronomers&lt;/p&gt;

&lt;p&gt;Last night I finally got around to watching Grady Hillhouse’s &lt;a href=&quot;https://www.youtube.com/watch?v=3oBCtTv6yOw&quot;&gt;excellent video on the collapse of the Arecibo Telescope&lt;/a&gt;. At the end of Grady’s video he says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I hope eventually that they can replace the telescope with an instrument as futuristic and forward-looking as the Arecibo Telescope when first conceived.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I hope so too. While I’ve never worked in radio astronomy, my PhD supervisor and a number of my colleagues were involved in &lt;a href=&quot;https://en.wikipedia.org/wiki/MeerKAT&quot;&gt;MeerKAT&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Hydrogen_Epoch_of_Reionization_Array&quot;&gt;HERA&lt;/a&gt;, so I developed a real interest in the field. I’d love to see radio, and radar, astronomy gain a great new instrument.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.naic.edu/ngat/NGAT_WhitePaper_rv9_05102021.pdf&quot;&gt;Next Generation Arecibo Telescope&lt;/a&gt; white paper is one such proposal, written by a group a scientists who seem to know what they’re talking about. Their concept is for an array of 1,112 closely-packed 9m dishes, either built in the same sinkhole as the original telescope, or elsewhere on the site. Just like with KAT and SKA, their proposal starts with a small array and builds up over time.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Some of the exciting new possibilities with this instrument include searching for pulsars orbiting Sgr A*, observing molecular lines from early Universe, climate change, ISR studies both parallel and perpendicular to the geomagnetic field, space debris characterization, ac-curate velocity measurements of a larger fraction of near earth objects, space weather forecasts, dark energy and dark matter studies, gravitational wave detection through pulsar timing, etc.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The whole thing seems smart, and sensible, but a little conservative.&lt;/p&gt;

&lt;p&gt;So this is my semi-serious pitch for a sci-fi alternative: drones.&lt;/p&gt;

&lt;p&gt;The expensive and hard part of building a telescope like Arecibo or &lt;a href=&quot;https://en.wikipedia.org/wiki/Five-hundred-meter_Aperture_Spherical_Telescope&quot;&gt;FAST&lt;/a&gt; isn’t the dish itself&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, despite that being the large and iconic part. Instead, it’s the platform and gear that houses the receivers (and, in the case of radar astronomy, the transmitters). That stuff is heavy, needs to be able to move around while being held firmly in position, creates a lot of data, and takes a lot of power. The dish itself is just a bunch of aluminium panels, mostly near the floor. What’s inside the suspended structure is an antenna, or array of antennas, equipment to keep that antenna cool, equipment to receive signal from the antenna, digitize it, denoise and compress it, and get that data back to somewhere more sensible for further processing. It needs to be held quite still, way up in the air. To change where the telescope looks, you need to move the antenna. To focus it, you need to move the antenna. To change what frequency band you’re working in, you need to change the antenna.&lt;/p&gt;

&lt;p&gt;It’s the same with radars: the traditional design is for one very complex and expensive receiver. Array telescopes change this up by having an array of dishes, and one slightly less complex receiver per dish, but the complexity is still quite high. Similarly, multistatic radars change it up by having a lot of receivers. What if we could go even further, and have a much higher number of much cheaper and less complex receivers? We could, in theory, build a radar or radio telescope that’s orders of magnitude cheaper than what we’re building today. Or orders of magnitude more capable for the same price.&lt;/p&gt;

&lt;p&gt;If I have a really big hemispherical-ish dish, like what was at Arecibo, I can turn it into a telescope by having one big receiver in the air space above it, or by having many, many receivers in the same space. The advantage of one is that I get to make that one very good indeed. The advantage of many is that I get to sample the space at a lot of points, which (combined with fancy signal processing) gives me a huge amount of flexibility about where I look, and what energy I pay attention to. Plus, I get redundancy, which is nice.&lt;/p&gt;

&lt;p&gt;People don’t tend to build the “one dish, loads of receivers” kind of radio telescope, partially because the structure needed to keep tens of thousands of antennas floating in the air would be quite complex. All that structure would get into the way of the radio waves, and make the whole thing work much less well.&lt;/p&gt;

&lt;p&gt;Which bring us to drone shows.&lt;/p&gt;

&lt;p&gt;Drone shows, like the &lt;a href=&quot;https://inteldronelightshows.com/&quot;&gt;Intel Drone Light Show&lt;/a&gt; that featured at the opening ceremony of the Tokyo olympics. Or this record-breaking show:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube-nocookie.com/embed/44KvHwRHb3A&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Drone light shows put thousands of small things into the air, and get them to execute precise and complex coordinated movements. What if we put an antenna on each drone, and flew it above a huge reflective dish? By moving the drones around, we could sample the radio energy at thousands of locations above the dish, and send the data back for processing. With the right kind of processing, we could correct for errors in the drone’s location and movement, and combine all their signals into a single view of what the telescope is seeing.&lt;/p&gt;

&lt;p&gt;It’s obviously hard, but I suspect it’s possible. And likely way cheaper than the $454MM budget expected for NGAT. Cheap enough for an Olympic opening ceremony stunt.&lt;/p&gt;

&lt;p&gt;What might make it impossible? Drones are quite unreliable, at least per-drone. They have limited battery life. Drones are quite electrically noisy. Drones don’t hold station particularly well, certainly not well enough for long-term coherent sampling the the GHz range. It’d probably be hard to put a cryo-cooled antenna on a drone. But all those seem kinda surmountable. If there’s one thing that would kill the idea, it’s probably noise. There’s not a lot of energy in those radio signals from the sky, so they are easy to drown out&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. But maybe we could shove the electrical noise into the bands we’re not interested in.&lt;/p&gt;

&lt;p&gt;So, NSF, how about replacing Arecibo with a drone array?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; MeerKAT is not only a great scientific instrument, but also one of the greatest puns of all time.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; On the other hand, if you’re building a big moving dish like &lt;a href=&quot;https://en.wikipedia.org/wiki/Hartebeesthoek_Radio_Astronomy_Observatory&quot;&gt;DSS51&lt;/a&gt; then the structure is very tricky, both in just getting it to stand up, and getting it stiff enough that it doesn’t go all fuzzy when the wind blows or the sun shines. One interesting relationship is that HartRAO is where it is fro the same reasons Arecibo is where it is.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; One of the NGAT proposal’s advantages over Arecibo is “Capable of mitigating radio frequency interference (RFI) through phased nulling”. What they mean here is that they can use the fact that it’s an array rather than a single antenna to better reject nearby sources of RFI, by steering a null in the direction of the noise source. The more antennas you have, the easier that is to do.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Latency Sneaks Up On You</title>
      <link>http://brooker.co.za/blog/2021/08/05/utilization.html</link>
      <pubDate>Thu, 05 Aug 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/08/05/utilization</guid>
      <description>&lt;h1 id=&quot;latency-sneaks-up-on-you&quot;&gt;Latency Sneaks Up On You&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;And is a bad way to measure efficiency.&lt;/p&gt;

&lt;p&gt;As systems get big, people very reasonably start investing more in increasing efficiency and decreasing costs. That’s a good thing, for the business, for the environment, and often for the customer. Much of the time efficient systems have lower and more predictable latencies, and everybody enjoys lower and more predictable latencies.&lt;/p&gt;

&lt;p&gt;Most folks around me think about latency using percentiles or other order statistics. Common practice is to look at the median (50&lt;sup&gt;th&lt;/sup&gt; percentile), and some high percentile like the 99.9&lt;sup&gt;th&lt;/sup&gt; or 99.99&lt;sup&gt;th&lt;/sup&gt;. As they do efficiency work, they often see that not only does their 50&lt;sup&gt;th&lt;/sup&gt; percentile improve a lot, but so does the high percentiles. Then, just a short while later, the high percentiles have crept back up without the code getting slower again. What’s going on?&lt;/p&gt;

&lt;p&gt;A lot of things, as usual, but one of them is the non-linear effect of utilization. To explain that, let’s consider a simple system with one server that can serve one thing at a time, and a queue in front of it&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌────────┐     ┌─────┐    ┌──────┐
│ Client │────▶│Queue│───▶│Server│
└────────┘     └─────┘    └──────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Simple as it comes, really. We can then define the Utilization of the server (calling it ⍴ for traditional reasons) in terms of two other numbers: the mean arrival rate λ, and the mean completion rate μ, both in units of jobs/second.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;⍴ = λ/μ
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Another way to think of ⍴ is that the server is idle (no work to do, empty queue) 1-⍴ of the time. So if ⍴=0.4, then the server is idle 60% of the time. Clearly, if ⍴&amp;gt;1 for a long time then the queue grows without bound, because more stuff is arriving than leaving. Let’s ignore that for now, because infinite queues are silly things.&lt;/p&gt;

&lt;p&gt;To understand what happens here, we need to look at the diagram above, and notice that there’s no feedback loop. The client sends work&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; at random, on it’s own schedule. Sometimes that’s when the server is idle, and sometimes when it’s busy. When the server is busy, that work is added to the queue. Because ⍴&amp;lt;1 in our world, the work will eventually get done, but may have to wait behind other things in the queue. Waiting in the queue&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; for service is a common cause of outlier latency.&lt;/p&gt;

&lt;p&gt;As we think about it this way, we realize that the closer ⍴ gets to 1, the more likely it is that an incoming item of work will find a busy server, and so will be queued. So increasing ⍴ increases queue depth, which increases latency. By a lot. In fact, it increases latency by an alarming amount as ⍴ goes it 1. One way to think about this is in terms of the number of items of work in the system, including being serviced by the server, and in the queue. For tradition’s sake, we’ll call this N and its mean (expectation) E[N].&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;E[N] = ⍴/(1-p)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Maybe we need to draw that to show how alarming it is.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/queue_length.png&quot; alt=&quot;Graph of ⍴/(1-p)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To give you some sense of this, at ⍴=0.5 (about half utilized), E[N] is 1. At ⍴=0.99, it’s 99.&lt;/p&gt;

&lt;p&gt;When people do efficiency work, they typically increase the rate at which the system can do work μ, without changing the arrival rate λ. Going back to our definition:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;⍴ = λ/μ
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can see that increasing μ drives down ⍴ and pushes us to the left and down on the curve above. Even relatively modest changes in μ can lead to a very big change in E[N], and if queuing dominates our latency, a big win on latency. An especially big win on outlier latency.&lt;/p&gt;

&lt;p&gt;Next, the system grows for a while (increasing λ), or we reduce the number of servers (decreasing μ) to realize our efficiency gains. That causes ⍴ to pop back up, and latency to return to where it was. This often leads people to be disappointed about the long-term effects of efficiency work, and sometimes under-invest in it.&lt;/p&gt;

&lt;p&gt;The system we consider above is a gross simplification, both in complexity, and in kinds of systems. Streaming systems will behave differently. Systems with backpressure will behave differently. Systems whose clients &lt;em&gt;busy loop&lt;/em&gt; will behave differently. These kinds of dynamics are common, though, and worth looking out for.&lt;/p&gt;

&lt;p&gt;The bottom line is that high-percentile latency is a bad way to measure efficiency, but a good (leading) indicator of pending overload. If you must use latency to measure efficiency, use &lt;a href=&quot;https://brooker.co.za/blog/2017/12/28/mean.html&quot;&gt;mean (avg) latency&lt;/a&gt;. Yes, average latency&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; I’m intentionally glossing over the details here. The system I’m considering is M/M/1, with a single server, unbounded queue, Poisson arrival process, and exponentially distributed service time. And yes, real systems aren’t like this. I know.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; In this case according to a Poisson process, which isn’t entirely realistic, but isn’t so far off the reality either. I’m fudging something else here: single clients don’t tend to be Poisson processes, but the sum of very large numbers of independent clients do. If you care about that, sub ‘clients’ every time I say ‘client’.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; When I say &lt;em&gt;queue&lt;/em&gt; that may be an explicit actual queue, or could just be a bunch of threads waiting on a lock, or an async task waiting for an IO to complete, or whatever. Implicit queues are everywhere.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Yes, those people on the internet that tell you never to measure average latency are wrong. And don’t get me started on the trimmers and winsorizers.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Metastability and Distributed Systems</title>
      <link>http://brooker.co.za/blog/2021/05/24/metastable.html</link>
      <pubDate>Mon, 24 May 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/05/24/metastable</guid>
      <description>&lt;h1 id=&quot;metastability-and-distributed-systems&quot;&gt;Metastability and Distributed Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;What if computer science had different parents?&lt;/p&gt;

&lt;p&gt;There’s no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again. The reasons that works so well are varied, but one reason is especially important for the developers and operators of distributed systems: metastability.&lt;/p&gt;

&lt;p&gt;I’ll let the authors of &lt;a href=&quot;https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf&quot;&gt;Metastable Failures in Distributed Systems&lt;/a&gt; define what that means:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What they’re identifying here is a kind of &lt;em&gt;stable down&lt;/em&gt; state, where the system is stable but not doing useful work, even though it’s only being offered a load that it successfully handled sometime in the past.&lt;/p&gt;

&lt;p&gt;One classic version of this problem involves queues. A system is ticking along nicely, and something happens. Could be a short failure, a spike of load, a deployment, or one of many other things. This causes queues to back up in the system, causing an increase in latency. That increased latency causes clients to time out before the system responds to them. Clients continue to send work, and the system continues to complete that work. Throughput is great. None of the work is useful, though, because clients aren’t waiting for the results, so goodput is zero. The system is mostly stable in this state, and without an external kick, could continue going along that way indefinitely. Up, but down. Working, but broken.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf&quot;&gt;Metastable Failures in Distributed Systems&lt;/a&gt;, Bronson et al correctly observe that these types of failure modes are well-known&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; to the builders of large-scale systems:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;By reviewing experiences from a decade of operating hyperscale distributed systems, we identify a class of failures that can disrupt them, even when there are no hardware failures, configuration errors, or software bugs. These metastable failures have caused widespread outages at large internet companies, lasting from minutes to hours. Paradoxically, the root cause of these failures is often features that improve the efficiency or reliability of the system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The paper identifies a list of other triggers for these types of metastable failures, including retries, caching, slow error handling paths and emergent properties of load-balancing algorithms. That’s a good list, but just scratching the surface of all of the possible causes of these ‘down but up’ states in distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is there a cure?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The disease is a serious one, but perhaps with the right techniques we can build systems that don’t have these metastable states. Bronson et al propose approaching that in several ways:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We consider the root cause of a metastable failure to be the sustaining feedback loop, rather than the trigger. There are many triggers that can lead to the same failure state, so addressing the sustaining effect is much more likely to prevent future outages.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn’t a controversial point, but is an important one: focusing on just fixing the triggering causes of issues causes us to fail to prevent similar issues with slightly different causes in future.&lt;/p&gt;

&lt;p&gt;The rest of their proposed solutions are more debatable. Changing policy during overload introduces modal behavior that can be hard to reason about (and &lt;a href=&quot;https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/&quot;&gt;modes are bad&lt;/a&gt;). Prioritization and &lt;a href=&quot;https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems/&quot;&gt;fairness&lt;/a&gt; are good if you can get them, but many systems can’t, either because their workloads are complex interdependent graphs without clear priority order, or because the priority order is unpalatable to the business. Fast error paths and outlier hygiene are good, in an &lt;em&gt;eat your broccoli&lt;/em&gt; kind of way.&lt;/p&gt;

&lt;p&gt;The other two they cover that really resonate with me are &lt;em&gt;organizational incentives&lt;/em&gt; and &lt;em&gt;autoscaling&lt;/em&gt;. Autoscaling, again, is a &lt;em&gt;good if you can get it&lt;/em&gt; kind of thing, but most applications can get it by building on top of modern cloud systems. Maybe even get it for free by building on serverless&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. On organizational incentives:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Optimizations that apply only to the common case exacerbate feedback loops because they lead to the system being operated at a larger multiple of the threshold between stable and vulnerable states.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yes, precisely. This is a very important dynamic to understand, and design an organization around defeating&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. One great example of this behavior is retries. If you’re only looking at your day-to-day error rate metric, you can be lead to believe that adding more retries makes systems better because it makes the error rate go down. However, the same change can make systems more vulnerable, by converting small outages into sudden (and metastable) periods of internal retry storms. Your weekly loop where you look at your metrics and think about how to improve things may be making things worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where do we go?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Knowing this problem exists, and having some tactics to fix certain versions of it, is useful. Even more useful would be to design systems that are fundamentally stable.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Can you predict the next one of these metastable failures, rather than explain the last one?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The paper lays out a couple of strategies here. The most useful one is a &lt;em&gt;characteristic metric&lt;/em&gt; that gives insight into the state of the feedback loop that’s holding the system down. This is the start of a line of thinking that treats large-scale distributed systems as control systems, and allows us to start applying the mathematical techniques of control theory and &lt;a href=&quot;https://en.wikipedia.org/wiki/Dynamical_system&quot;&gt;dynamical systems theory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I believe that many of the difficulties we have in this area come from where computing grew up. Algorithms, data structures, discrete math, finite state machines, and the other core parts of the CS curriculum are only one possible intellectual and theoretical foundation for computing. It’s interesting to think about what would be different in the way we teach CS, and the way we design and build systems, if we had instead chosen the mathematics of control systems and dynamical systems as the foundation. Some things would likely be harder. Others, like avoiding building metastable distributed systems, would likely be significantly easier.&lt;/p&gt;

&lt;p&gt;In lieu of a time-travel-based rearrangement of the fundamentals of computing, I’m excited to see more attention being paid to this problem, and to possible solutions. We’ve made a lot of progress in this space over the last few decades, but there’s a lot more research and work to be done.&lt;/p&gt;

&lt;p&gt;Overall, &lt;a href=&quot;https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf&quot;&gt;Metastable Failures in Distributed Systems&lt;/a&gt; is an important part of a conversation that doesn’t get nearly the attention it deserves in the academic or industrial literature. If I have any criticism, it’s that the paper overstates its case for novelty. These kinds of issues are well known in the world of control systems, in &lt;a href=&quot;https://qualitysafety.bmj.com/content/14/2/130&quot;&gt;health care&lt;/a&gt;, in operations research, and other fields. The organizational insights echo those of folks like Jens Rasmussen&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. But it’s a HotOS paper, and this sure is a hot topic, so I won’t hold the lack of a rigorous investigation of the background against the authors.&lt;/p&gt;

&lt;p&gt;If you build, operate, or research large-scale distributed systems, you should read this paper. There’s also a good summary on &lt;a href=&quot;http://charap.co/metastable-failures-in-distributed-systems/&quot;&gt;Aleksey Charapko’s blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; For example, I wrote about part of this problem in &lt;a href=&quot;https://brooker.co.za/blog/2019/05/01/emergent.html&quot;&gt;Some risks of coordinating only sometimes&lt;/a&gt;, and &lt;a href=&quot;http://www.hpts.ws/papers/2019/brooker.pdf&quot;&gt;talked about it at HPTS’19&lt;/a&gt;, although framed the issue as a bistability rather than metastability. Part of the thinking in that talk came from my own experience, and discussions of the topic in books like &lt;a href=&quot;https://www.amazon.com/Designing-Distributed-Control-Systems-Language/dp/1118694155/&quot;&gt;designing distributed control systems&lt;/a&gt;. It’s a topic we’ve spent a lot of energy on at AWS over the last decade, although typically using different words.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Of course I’m heavily biased, but the big advantage of serverless is that most applications are small relative to the larger serverless systems they run on, and so have a lot of headroom to deal with sudden changes in efficiency. In practice, I think that building on higher-level abstractions is going to be the best way for &lt;em&gt;most&lt;/em&gt; people to avoid problems like those described in the paper, most of the time.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Specifically the discussion of the “error margin” in &lt;a href=&quot;https://lewebpedagogique.com/audevillemain/files/2014/12/maint-Rasmus-1997.pdf&quot;&gt;Risk Management in a Dynamic Society&lt;/a&gt;, and how economic and labor forces push systems closer to the boundary of acceptable performance.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; An organization and an economy. As we saw with supply-side shortages of things like masks early in the Covid pandemic, real-world systems are optimized for little excess capacity too, and optimized for the happy case.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Tail Latency Might Matter More Than You Think</title>
      <link>http://brooker.co.za/blog/2021/04/19/latency.html</link>
      <pubDate>Mon, 19 Apr 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/04/19/latency</guid>
      <description>&lt;h1 id=&quot;tail-latency-might-matter-more-than-you-think&quot;&gt;Tail Latency Might Matter More Than You Think&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;A frustratingly qualitative approach.&lt;/p&gt;

&lt;p&gt;Tail latency, also known as &lt;em&gt;high-percentile&lt;/em&gt; latency, refers to high latencies that clients see fairly infrequently. Things like: “my service mostly responds in around 10ms, but sometimes takes around 100ms”. There are many causes of tail latency in the world, including contention, garbage collection, packet loss, host failure, and weird stuff operating systems do in the background. It’s tempting to look at the 99.9th percentile, and feel that it doesn’t matter. After all, 999 of 1000 calls are seeing lower latency than that.&lt;/p&gt;

&lt;p&gt;Unfortunately, it’s not that simple. One reason is that modern architectures (like microservices and SoA) tend to have a lot of components, so one user interaction can translate into many, many, service calls. A common pattern in these systems is that there’s some &lt;em&gt;frontend&lt;/em&gt;, which could be a service or some Javascript or an app, which calls a number of backend services to do what it needs to do. Those services then call other services, and so on. This forms two kinds of interactions: parallel fan-out, where the service calls many backends in parallel and waits for them all to complete, and serial chains where one service calls another, which calls another, and so on.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/call_graph.png&quot; alt=&quot;Service call graph showing fan-out and serial chains&quot; /&gt;&lt;/p&gt;

&lt;p&gt;These patterns make tail latency more important than you may think.&lt;/p&gt;

&lt;p&gt;To understand why, let’s do a simple numerical experiment. Let’s simplify the world so that all services respond with the same latency, and that latency follows a very simple bimodal distribution: 99% of the time with a mean of 10ms (normally distributed with a standard deviation of 2ms), and 1% of the time with a mean of 100ms (and SD of 10ms). In the real world, service latencies are almost always multi-modal like this, but typically not just a sum of normal distributions (but that doesn’t matter here).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parallel Calls&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, let’s consider parallel calls. The logic here is simple: we call N services in parallel, and wait for the slowest one. Applying our intuition suggests that as N increases, it becomes more and more likely that we’ll wait for a ~100ms &lt;em&gt;slow&lt;/em&gt; call. With N=1, that’ll happen around 1% of the time. With N=10, around 10% of the time. In this simple model, that basic intuition is right. This is what it looks like:&lt;/p&gt;

&lt;video width=&quot;480&quot; height=&quot;480&quot; autoplay=&quot;&quot; controls=&quot;&quot;&gt;
  &lt;source src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/freq_maxes.mp4&quot; type=&quot;video/mp4&quot; /&gt;
&lt;/video&gt;

&lt;p&gt;The tail mode, which used to be quite rare, starts to dominate as N increases. What was a rare occurrence is now normal. Nearly everybody is having a bad time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serial Chains&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Serial chains are a little bit more interesting. In this model, services call services, down a chain. The final latency is the sum of all of the service latencies down the chain, and so there are a lot more cases to think about: 1 &lt;em&gt;slow&lt;/em&gt; service, 2 slow services, etc. That means that we can expect the overall shape of the distribution to change as N increases. Thanks to the central limit theorem we could work out what that looks like as N gets large, but the journey there is interesting too.&lt;/p&gt;

&lt;p&gt;Here, we’re simulating the effects of chain length on the latency of two different worlds. One &lt;em&gt;Tail&lt;/em&gt; world which has the bimodal distribution we describe above, and one &lt;em&gt;No Tail&lt;/em&gt; world which only has the primary distribution around 10ms.&lt;/p&gt;

&lt;video width=&quot;480&quot; height=&quot;480&quot; autoplay=&quot;&quot; controls=&quot;&quot;&gt;
  &lt;source src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/freq_sums.mp4&quot; type=&quot;video/mp4&quot; /&gt;
&lt;/video&gt;

&lt;p&gt;Again, the tail latency becomes more prominent here. That relatively rare tail increases the variance of the distribution we’re converging on by a factor of 25. That’s a huge difference, caused by something that didn’t seem too important to start with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing Summary Statistics&lt;/strong&gt;
One way that this should influence your thinking is in how you choose which latency statistics to monitor. The truth is that no summary statistic is going to give you the full picture. Looking at histograms is cool, but tends to miss the time component. You could look at some kind of windowed histogram heat map, but probably won’t. Instead, make sure you’re aware of the high percentiles of service latency, and consider monitoring common customer or client use-cases and monitoring their end-to-end latency experience.&lt;/p&gt;

&lt;p&gt;Trimmed means, winsorized means, truncated means, interquartile ranges, and other statistics which trim off some of the tail of the distribution seem to be gaining in popularity. There’s a lot to like about the trimmed mean and friends, but cutting off the right tail will cause you to miss effects where that tail is very important, and may become dominant depending on how clients call your service.&lt;/p&gt;

&lt;p&gt;I continue to believe that if you’re going to measure just one thing, make it &lt;a href=&quot;https://brooker.co.za/blog/2017/12/28/mean.html&quot;&gt;the mean&lt;/a&gt;. However, you probably want to measure more than one thing.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Redundant against what?</title>
      <link>http://brooker.co.za/blog/2021/04/14/redundancy.html</link>
      <pubDate>Wed, 14 Apr 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/04/14/redundancy</guid>
      <description>&lt;h1 id=&quot;redundant-against-what&quot;&gt;Redundant against what?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Threat modeling thinking to distributed systems.&lt;/p&gt;

&lt;p&gt;There’s basically one fundamental reason that distributed systems can achieve better availability than single-box systems: redundancy. The software, state, and other things needed to run a system are present in multiple places. When one of those places fails, the others can take over. This applies to replicated databases, load-balanced stateless systems, serverless systems, and almost all other common distributed patterns.&lt;/p&gt;

&lt;p&gt;One problem with redundancy is that it &lt;a href=&quot;https://brooker.co.za/blog/2019/06/20/redundancy.html&quot;&gt;adds complexity&lt;/a&gt;, which may reduce availability. Another problem, and the one that people tend to miss the most, is that redundancy isn’t one thing. Like &lt;em&gt;security&lt;/em&gt;, redundancy is a single word that we mean that our architectures and systems are resistant to different kinds of failures. That can mean infrastructure failures, where redundancy could mean multiple machines, multiple racks, multiple datacenters or even multiple continents. It can mean software failures, where common techniques like canary deployments help systems to be redundant when one software version failures. I can also mean logical failures, where we recognize that &lt;em&gt;state&lt;/em&gt; can affect the performance or availability of our system, and we try ensure that the same &lt;em&gt;state&lt;/em&gt; doesn’t go to every host. Sometimes that state is configuration, sometimes it’s stored data or requests and responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, when we talk about system designs, we tend to forget these multiple definitions of redundancy and instead just focus on infrastructure. To show why this matters, let’s explore an example.&lt;/p&gt;

&lt;p&gt;Event logs are rightfully a popular way to build large-scale systems. In these kinds of systems there’s an ordered log which all changes (writes) flows through, and the changes are then applied to some systems that hang off the log. That could be read copies of the data, workflow systems taking action on the changes, and so on. In the simple version of this pattern one thing is true: every host in the log, and every consumer, sees the same changes in the same order.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/bus_arch_0.jpg&quot; alt=&quot;Event bus architecture, with three replicas hanging off the bus&quot; /&gt;&lt;/p&gt;

&lt;p&gt;One advantage of this architecture is that it can offer a lot of redundancy against infrastructure failures. Common event log systems (like Kafka) can easily handle the failure of a single host. Surviving the failure of a single replica is also easy, because the architecture makes it very easy to keep multiple replicas in sync.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/bus_arch_1.jpg&quot; alt=&quot;Event bus architecture, with three replicas hanging off the bus, with host failures&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now, consider the case where one of the events that comes down the log is a &lt;em&gt;poison pill&lt;/em&gt;. This simply means that the consumers don’t know how to process it. Maybe it says something that’s illegal (“I can’t decrement this unsigned 0!”), or doesn’t make sense (“what’s this data in column X? I’ve never heard of column X!”). Maybe it says something that only makes sense in a future, or past, version of the software. When faced with a poison pill, replicas have basically two options: ignore it, or stop.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/bus_arch_2.jpg&quot; alt=&quot;Event bus architecture, with three replicas hanging off the bus, with logical failures&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Ignoring it could lead to data loss, and stopping leads to writes being unavailable. Nobody wins. The problem here is a lack of redundancy: running the same (deterministic) software on the same state is going to have the same bad outcome every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More Generally&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This problem doesn’t only apply to event log architectures. Replicated state machines, famously, suffer from the same problem. So does primary/backup replication. It’s not a problem with one architecture, but a problem with distributed systems designs in general. As you design systems, it’s worth asking the question about what you’re getting from your redundancy, and what failures it protects you against. In some sense, this is the same kind of thinking that security folks use when they do &lt;a href=&quot;https://en.wikipedia.org/wiki/Threat_model&quot;&gt;threat modeling&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Threat modeling answers questions like “Where am I most vulnerable to attack?”, “What are the most relevant threats?”, and “What do I need to do to safeguard against these threats?”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few years ago, I experimented with building a &lt;a href=&quot;https://brooker.co.za/blog/2015/06/20/calisto.html&quot;&gt;threat modeling framework for distributed system designs&lt;/a&gt;, called CALISTO, but I never found something I loved. I do love the way of thinking, though. “What failures am I vulnerable to?”, “Which are the most relevant failures?”, “What do I need to do to safeguard against those failures?”&lt;/p&gt;

&lt;p&gt;If your answer to “What failures am I vulnerable to?” doesn’t include software bugs, you’re more optimistic than me.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>What You Can Learn From Old Hard Drive Adverts</title>
      <link>http://brooker.co.za/blog/2021/03/25/latency-bandwidth.html</link>
      <pubDate>Thu, 25 Mar 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/03/25/latency-bandwidth</guid>
      <description>&lt;h1 id=&quot;what-you-can-learn-from-old-hard-drive-adverts&quot;&gt;What You Can Learn From Old Hard Drive Adverts&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;The single most important trend in systems.&lt;/p&gt;

&lt;p&gt;Adverts for old computer hardware, especially hard drives, are a fun staple of computer forums and the nerdier side of the internet&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. For example, a couple days ago, Glenn Lockwood tweeted out this old ad:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;At least this isn’t an ad for a HAMR drive. $10k in today’s dollars. &lt;a href=&quot;https://t.co/2h2g3Gnguw&quot;&gt;pic.twitter.com/2h2g3Gnguw&lt;/a&gt;&lt;/p&gt;&amp;mdash; Glenn K. Lockwood (@glennklockwood) &lt;a href=&quot;https://twitter.com/glennklockwood/status/1374770622748708864?ref_src=twsrc%5Etfw&quot;&gt;March 24, 2021&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;Apparently from the early ’80s, these drives offered seek times of 70ms, access speeds of about 900kB/s, and capacities up to 10MB. Laughable, right? But these same ads hide a really important trend that’s informed system design more than any other. To understand what’s going on, let’s compare this creaky old 10MB drive to a modern competitor. Most consumers don’t buy magnetic drives anymore, so we’ll throw in an SSD for good measure.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;XCOMP 10MB   &lt;/th&gt;
      &lt;th&gt;Modern HDD  &lt;/th&gt;
      &lt;th&gt;Change&lt;/th&gt;
      &lt;th&gt;Modern SSD &lt;/th&gt;
      &lt;th&gt;Change&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Capacity&lt;/td&gt;
      &lt;td&gt;10MB&lt;/td&gt;
      &lt;td&gt;18TiB&lt;/td&gt;
      &lt;td&gt;1.8 million times  &lt;/td&gt;
      &lt;td&gt;2 TiB&lt;/td&gt;
      &lt;td&gt;200,000x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Latency&lt;/td&gt;
      &lt;td&gt;70ms&lt;/td&gt;
      &lt;td&gt;5ms&lt;/td&gt;
      &lt;td&gt;14x&lt;/td&gt;
      &lt;td&gt;50μs&lt;/td&gt;
      &lt;td&gt;1400x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Throughput&lt;/td&gt;
      &lt;td&gt;900kB/s&lt;/td&gt;
      &lt;td&gt;220MB/s&lt;/td&gt;
      &lt;td&gt;250x&lt;/td&gt;
      &lt;td&gt;3000MB/s&lt;/td&gt;
      &lt;td&gt;3300x&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;IOPS/GiB (QD1)&lt;/td&gt;
      &lt;td&gt;1400&lt;/td&gt;
      &lt;td&gt;0.01&lt;/td&gt;
      &lt;td&gt;0.00007x&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;0.007x&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Or there abouts&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Starting with the magnetic disk, we’ve made HUGE gains in storage size, big gains in throughput, modest gains in latency, and a seen a massive drop in random IO per unit of storage. What may be surprising to you is that SSDs, despite being much faster in every department, have seen pretty much the same overall trend.&lt;/p&gt;

&lt;p&gt;This is not, by any stretch, a new observation. 15 years ago the great Jim Gray said “Disk is Tape”. David Patterson (you know, Turing award winner, RISC co-inventor, etc) wrote a great paper back in 2004 titled &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.7415&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Latency Lags Bandwidth&lt;/a&gt; that made the same observation. He wrote:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I am struck by a consistent theme across many technologies: bandwidth improves much more quickly than latency.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That may not sound like a huge amount, but remember that we’re talking about exponential growth here, and exponential growth is a wicked thing that breaks our minds. Multiplying Patterson’s trend out, by the time bandwidth improves 1000x, latency improves only 6-30x. That’s about what we’re seeing on the table above: a 250x improvement in bandwidth, and a 14x improvement in latency. Latency lags bandwidth. Bandwidth lags capacity.&lt;/p&gt;

&lt;p&gt;One way to look at this is how long it would take to read the whole drive with a serial stream of 4kB random reads. The 1980s drive would take about 3 minutes. The SSD would take around 8 hours. The modern hard drive would take about 10 months. It’s not a surprise to anybody that small random IOs are slow, but maybe not how slow. It’s a problem that’s getting exponentially worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So what?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every stateful system we build brings with it some tradeoff between latency, bandwidth, and storage costs. For example, RAID5-style 4+1 erasure coding allows a system to survive the loss of one disk. 2-replication can do the same thing, with 1.6x the storage cost and 2/5ths the IOPS cost. Log-structured databases, filesytems and file formats all make bets about storage cost, bandwidth cost, and random access cost. The changing ratio between the hardware capabilities require that systems are re-designed over time to meet the capabilities of new hardware: yesterday’s software and approaches just aren’t efficient on today’s systems.&lt;/p&gt;

&lt;p&gt;The other important thing is parallelism. I pulled a bit of a slight-of-hand up there by using QD1. That’s a queue depth of one. Send an IO, wait for it to complete, send the next one. Real storage devices can do better when you give them multiple IOs at a time. Hard drives do better with scheduling trickery, handling “nearby” IOs first. Operating systems have done IO scheduling for this purpose forever, and for the last couple decades drives have been smart enough to do it themselves. SSDs, on the other hand, &lt;a href=&quot;https://brooker.co.za/blog/2014/07/04/iostat-pct.html&quot;&gt;have real internal parallelism&lt;/a&gt; because they aren’t constrained by the bounds of physical heads. Offering lots of IOs to an SSD at once can improve performance by as much as 50x. Back in the 80’s, IO parallelism didn’t matter. It’s a huge deal now.&lt;/p&gt;

&lt;p&gt;There are two conclusions here for the working systems designer. First, pay attention to hardware trends. Stay curious, and update your internal constants from time to time. Exponential growth may mean that your mental model of hardware performance is completely wrong, even if it’s only a couple years out of date. Second, system designs rot. The real-world tradeoffs change, for this reasons as well as many others. The data structures and storage strategies in your favorite textbook likely haven’t stood the test of time. The POSIX IO API definitely hasn’t.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; See, for example, &lt;a href=&quot;https://www.reddit.com/r/interestingasfuck/comments/ay225x/this_xcomp_hard_disk_advertisement_from_1981_how/&quot;&gt;this Reddit thread&lt;/a&gt;, &lt;a href=&quot;https://forums.unraid.net/topic/7377-10-mb-xcomp-hard-drive-339800/&quot;&gt;unraid forums&lt;/a&gt;, &lt;a href=&quot;http://mag.metamythic.com/old-hard-disk-drive-adverts/&quot;&gt;this site&lt;/a&gt; and so on. They’re everywhere.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; I extracted these numbers from my head, but I think they’re more-or-less representative of modern mainstream NVMe and enterprise magnetic drives.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Incident Response Isn't Enough</title>
      <link>http://brooker.co.za/blog/2021/02/22/postmortem.html</link>
      <pubDate>Mon, 22 Feb 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/02/22/postmortem</guid>
      <description>&lt;h1 id=&quot;incident-response-isnt-enough&quot;&gt;Incident Response Isn’t Enough&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Single points of failure become invisible.&lt;/p&gt;

&lt;p&gt;Postmortems, COEs, incident reports. Whatever your organization calls them, when done right they are a popular and effective way of formalizing the process of digging into system failures, and driving change. The success of this approach has lead some to believe that postmortems are the &lt;em&gt;best&lt;/em&gt;, or even &lt;em&gt;only&lt;/em&gt;, way to improve the long-term availability of systems. Unfortunately, that isn’t true. A good availability program requires deep insight into the design of the system.&lt;/p&gt;

&lt;p&gt;To understand why, let’s build a house, then a small community.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/avail_slide_1.png&quot; alt=&quot;A house, with four things it needs to be a working home&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Our house has four walls, a roof, and a few things it needs to be a habitable home. We’ve got a well for water, a field of corn for food, a wood pile for heat, and a septic tank. If any one of these things is not working, let’s say that the house is &lt;em&gt;unavailable&lt;/em&gt;. Our goal is to build many houses, and make sure they are unavailable for as little of the time as possible.&lt;/p&gt;

&lt;p&gt;When we want to build a second house, we’re faced with a choice. The simple approach is just to stamp out a second copy of the entire house, with it’s own field, wood, well, and tank. That approach is great: the failure of the two houses is completely independent, and availability is very easy to reason about.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/avail_slide_2.png&quot; alt=&quot;Two houses, with full redundancy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As we scale this approach up, however, we’re met with the economic pressure to share components. This makes a lot of sense: wells are expensive to drill, and don’t break down often, so sharing one between many houses could save the home owners a lot of money. Not only does sharing a well reduce construction costs, but thanks to the averaging effect of adding the demand of multiple houses together, reduces the peak-to-average ratio of water demand. That improves ongoing economics, too.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/avail_slide_3.png&quot; alt=&quot;Five houses, sharing a well&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In exchange for the improved economics, we’ve bought ourselves a potential problem. The failure of the well will cause all the houses in our community to become &lt;em&gt;unavailable&lt;/em&gt;. The well has high &lt;em&gt;blast radius&lt;/em&gt;. Mitigating that is well-trodden technical ground, but there’s a second-order organizational and cultural effect worth paying attention to.&lt;/p&gt;

&lt;p&gt;Every week, our community’s maintenance folks get together and talk about problems that occurred during the week. Dead corn, full tanks, empty woodpiles, etc. They’re great people with good intentions, so for each of these issues they carefully draw up plans to prevent recurrence of the issue, and invest the right amount in following up on those issues. They invest in the most urgent issues, and talk a lot about the most common issues. The community grows, and the number of issues grows. The system of reacting to them scales nicely.&lt;/p&gt;

&lt;p&gt;Everything is great until the well breaks. The community is without water, and everybody is mad at the maintenance staff. They’d hardly done any maintenance on the well all year! It wasn’t being improved! They spent all their attention elsewhere! Why?&lt;/p&gt;

&lt;p&gt;The problem here is simple. With 100 houses in the community, there were 100 fields, 100 tanks, 100 piles, and one well. The well was only responsible for 1 in every 301 issues, just 0.33%. So, naturally, the frequency-based maintenance plan spent just 0.33% of the maintenance effort on it. Over time, with so little maintenance, it got a little creaky, but was still only a tiny part of the overall set of problems.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/avail_slide_4.png&quot; alt=&quot;Plot showing how the percentage of action items related to the well drops with scale&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is one major problem with driving any availability program only from postmortems. It feels like a data-driven approach, but tends to be biased in exactly the ways we don’t want a data-driven approach to be biased. As a start, the frequency measurement needs to be weighted based on impact. That doesn’t solve the problem. The people making decisions are human, and humans are bad at making decisions. One way we’re bad at decisions is called the &lt;a href=&quot;https://en.wikipedia.org/wiki/Availability_heuristic&quot;&gt;Availability Heuristic&lt;/a&gt;: We tend to place more importance on things we can remember easily. Like those empty wood piles we talk about every week, and not the well issue from two years ago. Fixing this requires that an availability program takes &lt;em&gt;risk&lt;/em&gt; into account, not only in how we measure, but also in how often we talk about issues.&lt;/p&gt;

&lt;p&gt;It’s very easy to forget about your single point of failure. After all, there’s just one.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Fundamental Mechanism of Scaling</title>
      <link>http://brooker.co.za/blog/2021/01/22/cloud-scale.html</link>
      <pubDate>Fri, 22 Jan 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/01/22/cloud-scale</guid>
      <description>&lt;h1 id=&quot;the-fundamental-mechanism-of-scaling&quot;&gt;The Fundamental Mechanism of Scaling&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;It&apos;s not Paxos, unfortunately.&lt;/p&gt;

&lt;p&gt;A common misconception among people picking up distributed systems is that replication and consensus protocols—Paxos, Raft, and friends—are the tools used to build the largest and most scalable systems. It’s obviously true that these protocols are important building blocks. They’re used to build systems that offer more availability, better durability, and stronger integrity than a single machine. At the most basic level, though, they don’t make systems scale.&lt;/p&gt;

&lt;p&gt;Instead, the fundamental approach used to scale distributed systems is &lt;em&gt;avoiding&lt;/em&gt; co-ordination. Finding ways to make progress on work that doesn’t require messages to pass between machines, between clusters of machines, between datacenters and so on. The fundamental tool of cloud scaling is coordination avoidance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Spectrum of Systems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With this in mind, we can build a kind of spectrum of the amount of coordination required in different system designs:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Coordinated&lt;/em&gt; These are the kind that use paxos, raft, chain replication or some other protocol to make a group of nodes work closely together. The amount of work done by the system generally scales with the offered work (&lt;em&gt;W&lt;/em&gt;) and the number of nodes (&lt;em&gt;N&lt;/em&gt;), something like O(&lt;em&gt;N&lt;/em&gt; * &lt;em&gt;W&lt;/em&gt;) (or, potentially, worse under some kinds of failures).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Data-dependent Coordination&lt;/em&gt; These systems break their workload up into uncoordinated pieces (like shards), but offer ways to coordinate across shards where needed. Probably the most common type of system in this category is sharded databases, which break data up into independent pieces, but then use some kind of coordination protocol (such as two-phase commit) to offer cross-shard transactions or queries. Work done can vary between O(&lt;em&gt;W&lt;/em&gt;) and O(&lt;em&gt;N&lt;/em&gt; * &lt;em&gt;W&lt;/em&gt;) depending on access patterns, customer behavior and so on.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Leveraged Coordination&lt;/em&gt; These systems take a coordinated system and build a layer on top of it that can do many requests per unit of coordination. Generally, coordination is only needed to handle failures, scale up, redistribute data, or perform other similar management tasks. In the happy case, work done in these kinds of systems is O(&lt;em&gt;W&lt;/em&gt;). In the bad case, where something about the work or environment forces coordination, they can change to O(&lt;em&gt;N&lt;/em&gt; * &lt;em&gt;W&lt;/em&gt;) (see &lt;a href=&quot;http://brooker.co.za/blog/2019/05/01/emergent.html&quot;&gt;Some risks of coordinating only sometimes&lt;/a&gt; for more). Despite this risk, this is a rightfully popular pattern for building scalable systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Uncoordinated&lt;/em&gt; These are the kinds of systems where work items can be handled independently, without any need for coordination. You might think of them as embarrassingly parallel, sharded, partitioned, geo-partitioned, or one of many other ways of breaking up work. Uncoordinated systems scale the best. Work is always O(&lt;em&gt;W&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;This is only one cut through a complex space, and some systems don’t quite fit&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.  I think it’s still useful, though, because by building a hierarchy of coordination we can think clearly about the places in our systems that scale the best and worst. The closer a system is to the uncoordinated end the better it will scale, in general.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Other useful tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are many other ways to approach this question of when coordination is necessary, and how that influences scale.&lt;/p&gt;

&lt;p&gt;The CAP theorem&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, along with a rich tradition of other impossibility results&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, places limits on the kinds of things systems can do (and, most importantly, the kinds of things they can offer to their clients) without needing coordination. If you want to get into the details there, the breakdown in Figure 2 of &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;Highly Available Transactions: Virtues and Limitations&lt;/a&gt; is pretty clear. I like it because it shows us both what is possible, and what isn’t.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://arxiv.org/pdf/1901.01930.pdf&quot;&gt;CALM theorem&lt;/a&gt; is very useful, because it provides a clear logical framework for whether particular programs can be run without coordination, and something of a path for constructing programs that are coordination free. If you’re going to read just one distributed systems paper this year, you could do a lot worse than &lt;a href=&quot;https://arxiv.org/pdf/1901.01930.pdf&quot;&gt;Keeping CALM&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.411&quot;&gt;Harvest and Yield&lt;/a&gt; is another way to approach the problem, by thinking about when systems can return partial results&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. This is obviously a subtle topic, because the real question is when your clients and customers can accept partial results, and how confused they will be when they get them. At the extreme end, you start expecting clients to write code that can handle any subset of the full result set. Sometimes that’s OK, sometimes it sends them down the same rabbit hole that CALM takes you down. Probably the hardest part for me is that partial-result systems are hard to test and operate, because there’s a kind of mode switch between partial and complete results and &lt;a href=&quot;https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/&quot;&gt;modes make life difficult&lt;/a&gt;. There’s also the minor issue that there are 2&lt;sup&gt;N&lt;/sup&gt; subsets of results, and testing them all is often infeasible. In other words, this is a useful too, but it’s probably best not to expose your clients to the full madness it leads to.&lt;/p&gt;

&lt;p&gt;Finally, we can think about the work that each node needs to do. In a &lt;em&gt;coordinated&lt;/em&gt; system, there is generally one or more nodes that do O(&lt;em&gt;W&lt;/em&gt;) work. In an uncoordinated system, the ideal node does O(&lt;em&gt;W&lt;/em&gt;/&lt;em&gt;N&lt;/em&gt;) work, which turns into O(1) work because &lt;em&gt;N&lt;/em&gt; is proportional to &lt;em&gt;W&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Like systems that coordinate heavily on writes but mostly avoid coordination on reads. &lt;a href=&quot;https://www.usenix.org/legacy/event/usenix09/tech/full_papers/terrace/terrace.pdf&quot;&gt;CRAQ&lt;/a&gt; is one such system, and a paper that helped me fall in love with distributed systems. So clever, and so simple once you understand it.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Best described by &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Brewer and Lynch&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; See, for example, Nancy Lynch’s 1989 paper &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.5022&quot;&gt;A Hundred Impossibility Proofs for Distributed Computing&lt;/a&gt;. If there were a hundred of these in 1989, you can imagine how many there are now, 32 years later. Wow, 1989 was 32 years ago. Huh.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; I wrote &lt;a href=&quot;http://brooker.co.za/blog/2014/10/12/harvest-yield.html&quot;&gt;a post&lt;/a&gt; about it back in 2014.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Quorum Availability</title>
      <link>http://brooker.co.za/blog/2021/01/06/quorum-availability.html</link>
      <pubDate>Wed, 06 Jan 2021 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2021/01/06/quorum-availability</guid>
      <description>&lt;h1 id=&quot;quorum-availability&quot;&gt;Quorum Availability&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;It&apos;s counterintuitive, but is it right?&lt;/p&gt;

&lt;p&gt;In our paper &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Millions of Tiny Databases&lt;/a&gt;, we say this about the availability of quorum systems of various sizes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;As illustrated in Figure 4, smaller cells offer lower availability in the face of small numbers of uncorrelated node failures, but better availability when the proportion of node failure exceeds 50%. While such high failure rates are rare, they do happen in practice, and a key design concern for Physalia.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And this is what Figure 4 looks like:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/mtb_fig_4.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The context here is that a &lt;em&gt;cell&lt;/em&gt; is a Paxos cluster, and the system needs a majority quorum for the cluster to be able to process requests&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. A cluster of one box needs one box available, five boxes need three available and so on. The surprising thing here is the claim that having smaller clusters is actually &lt;em&gt;better&lt;/em&gt; if the probability of any given machine failing is very high. The paper doesn’t explain it well, and I’ve gotten a few questions about it. This post attempts to do better.&lt;/p&gt;

&lt;p&gt;Let’s start by thinking about what happens for a cluster of one machine (&lt;em&gt;n=1&lt;/em&gt;), in a datacenter of &lt;em&gt;N&lt;/em&gt; machines (for very large &lt;em&gt;N&lt;/em&gt;). We then fail each machine independently with probability &lt;em&gt;p&lt;/em&gt;. What is the probability that our one machine failed? That’s trivial: it’s &lt;em&gt;p&lt;/em&gt;. Now, let’s take all &lt;em&gt;N&lt;/em&gt; machines and put them into a cluster of &lt;em&gt;n=N&lt;/em&gt;. What’s the probability that a majority of the cluster is available? For large &lt;em&gt;N&lt;/em&gt;, it’s 1 for &lt;em&gt;p &amp;lt; 0.5&lt;/em&gt;, and 0 for &lt;em&gt;p &amp;gt; 0.5&lt;/em&gt;. If less than half the machines fail, less than half have failed. If more than half the machines fail, more than half have failed. Ok?&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/quorum_avail_a.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Notice how a cluster size of 1 is worse than N up until &lt;em&gt;p = 0.5&lt;/em&gt; then better after. &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.5629&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Peleg and Wool&lt;/a&gt; say:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;… for &lt;em&gt;0 &amp;lt; p &amp;lt; ½&lt;/em&gt; the most available NDC&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; is shown to be the “democracy” (namely, the minimal majority system), while the “monarchy” (singleton system) is least available. Due to symmetry, the picture reverses for &lt;em&gt;½ &amp;lt; p &amp;lt; 1&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here, the &lt;em&gt;minimal majority system&lt;/em&gt; is the one I’d call a &lt;em&gt;majority quorum&lt;/em&gt;, and is used by Physalia (and, indeed, most Paxos implementations). The &lt;em&gt;monarchy&lt;/em&gt; is where you have one leader node.&lt;/p&gt;

&lt;p&gt;What about real practical cluster sizes like &lt;em&gt;n=3&lt;/em&gt;, 5, and 7? There are three ways we can do this math. In &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.5629&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;The Availability of Quorum Systems&lt;/a&gt;, Peleg and Wool derive closed-form solutions to this problem&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Our second approach is to observe that the failures of the nodes are Bernoulli trials with probability &lt;em&gt;p&lt;/em&gt;, and therefore we can read the answer to “what is the probability that 0 or 1 of 3 fail for probability &lt;em&gt;p&lt;/em&gt;” from the distribution function of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Binomial_distribution&quot;&gt;binomial distribution&lt;/a&gt;. Finally, we can be lazy and do it with Monte Carlo. That’s normally my favorite method, because it’s easier to include correlation and various “what if?” questions as we go.&lt;/p&gt;

&lt;p&gt;Whichever way you calculate it, what do you expect it to look like? For small &lt;em&gt;n&lt;/em&gt; you may expect it to be closer in shape to &lt;em&gt;n=1&lt;/em&gt;, and for large &lt;em&gt;n&lt;/em&gt; you may expect it to approach the shape of &lt;em&gt;n=N&lt;/em&gt;. If that’s what you expect, you’d be right.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/quorum_avail_b.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I’ll admit that I find this result deeply deeply counter-intuitive. I think it’s right, because I’ve approached it multiple ways, but it still kind of bends my mind a little. That may just be me. I’ve discussed it with friends and colleagues over the years, and they seem to think it matches their intuition. It’s counter-intuitive to me because it suggests that smaller &lt;em&gt;n&lt;/em&gt; (smaller clusters, or smaller cells in Physalia’s parlance) is better for high &lt;em&gt;p&lt;/em&gt;! If you think a lot of your boxes are going to fail, you may get better availability (not durability, though) from smaller clusters.&lt;/p&gt;

&lt;p&gt;Weird.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlation to the rescue!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s not often that my statistical intuition is saved by introducing correlation, but in this case it helps. I’d argue that, in practice, that you only lose machines in an uncorrelated Bernoulli trial way for small &lt;em&gt;p&lt;/em&gt;. Above a certain &lt;em&gt;p&lt;/em&gt;, it’s likely that the failures have some shared cause (power, network, clumsy people, etc) and so the failures are likely to be correlated in some way. In which case, we’re back into the game we’re playing with Physalia of avoiding those correlated failures by optimizing placement.&lt;/p&gt;

&lt;p&gt;In many other kinds of systems, like ones you deploy across multiple datacenters (we’d call that &lt;em&gt;regional&lt;/em&gt; in AWS, deployed across multiple &lt;em&gt;availability zones&lt;/em&gt;), you end up treating the datacenters as units of failure. In that case, for 3 datacenters you’d pick something like &lt;em&gt;n=9&lt;/em&gt; because you can keep quorum after the failure of an entire datacenter (3 machines) and any one other machine. As soon as there’s correlation, the math above is mostly useless and the correlation’s cause is all that really matters.&lt;/p&gt;

&lt;p&gt;Availability also isn’t the only thing to think about with cluster size for quorum systems. Durability, latency, cost, operations, and contention on leader election also come into play. Those are topics for another post (or section 2.2 of &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Millions of Tiny Databases&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;JP Longmore sent me this intuitive explanation, which makes a lot of sense:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Probability of achieving a quorum will increase when removing 2 nodes from a cluster, each with failure rate p&amp;gt;.5, since on average you’re removing 2 bad nodes instead of 2 good nodes. Other cases with 1 good node &amp;amp; 1 bad node don’t change the outcome (quorum/not). Repeat reasoning till N=1 or all remaining nodes have p&amp;lt;=.5 (if failure rate isn’t uniform).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Physalia uses a very naive Paxos implementation, intentionally optimized for testability and simplicity. The quorum intersection requirements of Paxos (or Paxos-like protocols) are more subtle than this, and work like Heidi Howard et al’s &lt;a href=&quot;https://fpaxos.github.io/&quot;&gt;Flexible Paxos&lt;/a&gt; has been pushing the envelope here recently. &lt;a href=&quot;https://arxiv.org/pdf/1608.06696v1.pdf&quot;&gt;Flexible Paxos:  Quorum intersection revisited&lt;/a&gt; is a good place to start.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Here, an NDC is a &lt;em&gt;non-dominated coterie&lt;/em&gt;, and a &lt;em&gt;coterie&lt;/em&gt; is a set of groups of nodes (like &lt;em&gt;{{a, b}, {b, c}, {a, c}}&lt;/em&gt;). See Definition 2.2 in &lt;a href=&quot;https://www.cs.purdue.edu/homes/bb/cs542-20Spr/readings/reliability/How%20to%20assign%20Votes-JACM-garcia-molina.pdf&quot;&gt;How to Assign Votes in a Distributed System&lt;/a&gt; for the technical definition of domination. What’s important, though, is that for each &lt;em&gt;dominated coterie&lt;/em&gt; there’s a &lt;em&gt;non-dominated coterie&lt;/em&gt; that provides the same mutual exclusion properties, but superior availability under partitions. The details are not particularly important here, but are very interesting if you want to do tricky things with quorum intersection.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Along with a whole lot of other interesting facts about quorums, majority quorums and other things. It’s a very interesting paper. Another good read in this space is Garcia-Molina and Barbara’s &lt;a href=&quot;https://www.cs.purdue.edu/homes/bb/cs542-20Spr/readings/reliability/How%20to%20assign%20Votes-JACM-garcia-molina.pdf&quot;&gt;How to Assign Votes in a Distributed System&lt;/a&gt;, which both does a better job than Peleg and Wool of defining the terms it uses, but also explores the general idea of assigning &lt;em&gt;votes&lt;/em&gt; to machines, rather than simply forming quorums of machines. As you read it, it’s worth remembering that it predates Paxos, and many of the terms might not mean what you expect.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Getting Big Things Done</title>
      <link>http://brooker.co.za/blog/2020/10/19/big-changes.html</link>
      <pubDate>Mon, 19 Oct 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/10/19/big-changes</guid>
      <description>&lt;h1 id=&quot;getting-big-things-done&quot;&gt;Getting Big Things Done&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;In one particular context.&lt;/p&gt;

&lt;p&gt;A while back, a colleague wanted to make a major change in the design of a system, the sort of change that was going to take a year or more, and many tens of person-years of effort. They asked me how to justify the project. This post is part of the email reply I sent. The advice is in context of technical leadership work at a big company, but perhaps it may apply elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is it the right solution?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I like to pay attention to ways I can easily fool myself. One of those ways is an &lt;em&gt;availability heuristic&lt;/em&gt; applied to big problems. I see a big problem that needs a big solution, and am strongly biased to believe that the first big solution that presents itself is the right one. It takes intentional effort to figure out whether the big solution is, indeed, a solution to the big problem. Bold action, after all, isn’t a solution itself.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Sometimes, in one of his more exuberant or desperate moods, Pa would go out in the veld and sprinkle brandy on the daisies to make them drunk so that they wouldn’t feel the pain of shriveling up and dying. (André Brink)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because I am so easily fooled in this way, I like to write my reasoning down. Two pages of prose normally does it, building an argument as to why this is the right solution to the problem. Almost every time, this exposes flaws in my reasoning, opportunities to find more data, or other solutions to explore. Thinking in my head doesn’t have this effect for me, but writing does. Or, rather, the exercise of writing and reading does.&lt;/p&gt;

&lt;p&gt;The first step is to write a succinct description of the problem, and what it means for the problem to be solved. Sometimes those are quantitative goals. Speeds and feeds. Sometimes, they are concrete goals. A product launch, or a document. Sometimes, it’s something more qualitative and harder to explain. Thinking about the problem bears a great deal of fruit.&lt;/p&gt;

&lt;p&gt;Then, the solution. The usual questions apply here, including cost, viability, scope and complexity. Most important is engaging with the problem statement. It’s easy to make the exercise useless if you disconnect the problem statement from the solution.&lt;/p&gt;

&lt;p&gt;It is important you feel comfortable with the outcome of this exercise, because losing faith in your own work is a sure way to have it fail. Confidence is one valuable outcome. Another one is a simpler solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is it the right problem?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An elegant solution to the wrong problem is worse than no solution at all, at least in that it might fool people into thinking that the true problem has been solved, and to stop trying. You need to deeply understand the problem you are solving. Rarely, this will be an exercise only in technology or engineering. More commonly, large problems will span business, finance, engineering, management and more. You probably don’t understand all of these things. Be sure to seek the help of people who do.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I think I understand multiple perspectives on a problem, I like to write them down and run them by the people who explained the problem to me. They’ll be able to point out where you’re still wrong. Perhaps you’re confusing your net and operational margins, or your hiring targets make no sense, or your customers see a different problem from you. This requires that the people you consult trust you, and you trust them. Fortunately, non-engineers in engineering organizations are always looking out for allies and friends. Most are, like engineers, only too excited to explain their work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Engage with the doubters, but don’t let them get you down&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Be prepared! And be careful not to do
Your good deeds when there’s no one watching you (Tom Lehrer)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You will never convince everybody of your point of view. By now, you have two powerful tools to help convince people: A clear statement of the problem that considers multiple points-of-view, and a clear statement of the solution. Some people will read those and be convinced. Others would never be convinced, because their objections lie beyond the scope of your thinking. A third group will have real, honest, feedback about the problem and proposed solution. That feedback is gold that should be mined. Unfortunately, separating the golden nuggets of feedback from the pyrite nuggets of doubt isn’t easy.&lt;/p&gt;

&lt;p&gt;The doubters will get you down. Perhaps they think the problem doesn’t exist, or that the solution is impractical. Perhaps they think you aren’t the person to do it. Perhaps they think the same resources should be spent on a different problem, or a different solution. You’ll repeat, repeat, and repeat. Get used to it. I’m still not used to it, but you should be.&lt;/p&gt;

&lt;p&gt;Again, writing is a tool I reach for. “Today, I’m doubting my solution because…” Sometimes that doubt will be something more about you than the project. That’s OK. Sometime it’ll be about the project, and will identify a real problem. Often, it’ll just point to one of those unknown unknowns that all projects have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meet the stakeholders where they are&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most likely, you’re going to need to convince somebody to let you do the work. That’s good, because doing big things requires time, people and money. You don’t want to be working somewhere that’s willing to waste time, people or money. If they’re willing to waste time on your ill-conceived schemes, they’ll be willing to waste your time on somebody else’s ill-conceived schemes.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;May your wisdom grace us until the stars rain down from the heavens. (C.S. Lewis)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The best advice I’ve received about convincing stakeholders is to write for them, not you. Try to predict what questions they are going to ask, what concerns they will have, and what objections they will bring up and have answers for those in the text. That doesn’t mean you should be defensive. Don’t aim only to flatter. Instead, tailor your approach. It can help to have the advice of people who’ve been through this journey before.&lt;/p&gt;

&lt;p&gt;The previous paragraph may seem to you like &lt;em&gt;politics&lt;/em&gt;, you may have a distaste for politics, or believe you can escape it by moving to a different business. It is. You may. You can’t.&lt;/p&gt;

&lt;p&gt;Leadership willing to engage with your ideas and challenge you on them is a blessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build a team&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You need to build two teams. Your local team is the team of engineers who are going to help you write, review, test, deploy, operate, and so on. These people are critical to your success, not because they are the fingers to your brain, but because the details are at least as important as the big picture. Get into some details yourself. Don’t get into every detail. You can’t.&lt;/p&gt;

&lt;p&gt;Your extended team is a group of experts, managers, customer-facing folks, product managers, lawyers, designers and so on. Some of these people won’t be engaged day-to-day. You need to find them, get them involved, and draw on them when you need help. You’re not an expert in everything, but expertise in everything will be needed. Getting these people excited about your project, and bought into its success, is important.&lt;/p&gt;

&lt;p&gt;Finally, find yourself a small group of people you trust, and ask them to keep you honest. Check in with them, to make sure your ideas still make sense. Share the documents you wrote with them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Be willing to adapt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You will learn, at some point into doing your big project, that your solution is bullshit. You completely misunderstood the problem. You may feel like this leaves you back at the beginning, where you started. It doesn’t. Instead, you’ve stepped up your level of expertise. Most likely, you can adapt that carefully-considered solution to the new problem, but you might need to throw it out entirely. Again, write it down. Be specific. What have you learned, and what did it teach you? Look for things you can recover, and don’t throw things out prematurely.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Consensus is Harder Than It Looks</title>
      <link>http://brooker.co.za/blog/2020/10/05/consensus.html</link>
      <pubDate>Mon, 05 Oct 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/10/05/consensus</guid>
      <description>&lt;h1 id=&quot;consensus-is-harder-than-it-looks&quot;&gt;Consensus is Harder Than It Looks&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;And it looks pretty hard.&lt;/p&gt;

&lt;p&gt;In his classic paper &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.8330&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;How to Build a Highly Available System Using Consensus&lt;/a&gt; Butler Lampson laid out a pattern that’s become very popular in the design of large-scale highly-available systems. Consensus is used to deal with unusual situations like host failures (Lampson says &lt;em&gt;reserved for emergencies&lt;/em&gt;), and leases (time-limited locks) provide efficient normal operation. The paper lays out a roadmap for implementing systems of this kind, leaving just the implementation details to the reader.&lt;/p&gt;

&lt;p&gt;The core algorithm behind this paper, Paxos, is famous for its complexity and subtlety. Lampson, like many who came after him&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, try to build a framework of specific implementation details around it to make it more approachable. It’s effective, but incomplete. The challenge is that Paxos’s subtlety is only one of the hard parts of building a consensus system. There are three categories of challenges that I see people completely overlook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Determinism&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“How  can  we  arrange  for  each  replica  to  do  the  same  thing?  Adopting  a  scheme  first proposed  by  Lamport,  we  build  each  replica  as  a  deterministic  state  machine;  this means that the transition relation is a function from (state, input) to (new state, output). It is customary to call one of these replicas a ‘process’. Several processes that start in the same state and see the same sequence of inputs will do the same thing, that is, end up in the same state and produce the same outputs” - Butler Lampson (from &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.8330&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;How to Build a Highly Available System Using Consensus&lt;/a&gt;).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Conceptually, that’s really easy. We start with a couple of replicas with &lt;em&gt;state&lt;/em&gt;, feed them &lt;em&gt;input&lt;/em&gt;, and they all end up with &lt;em&gt;new state&lt;/em&gt;. Same inputs in, same state out. Realistically, it’s hard. Here are just some of the challenges:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Concurrency&lt;/em&gt;. Typical runtimes and operating systems use more than just your program’s state to schedule threads, which means that code that uses multiple threads, multiple processes, remote calls, or even just IO, can end up with non-deterministic results. The simple fix is to be resolutely single-threaded, but that has severe performance implications&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Floating Point&lt;/em&gt;. Trivial floating-point calculations are deterministic. Complex floating point calculations, especially where different replicas run on different CPUs, have code built with different compilers, may not be&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. In &lt;a href=&quot;https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf&quot;&gt;Physalia&lt;/a&gt; we didn’t support floating point, because this was too hard to think about.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Bug fixes&lt;/em&gt;. Say the code that turns &lt;em&gt;state&lt;/em&gt; and &lt;em&gt;input&lt;/em&gt; into &lt;em&gt;new state&lt;/em&gt; has a bug. How do you fix it? You can’t just change it and then roll it out incrementally to different replicas. You don’t want to deploy all your replicas at once (we’re trying to build an HA system, remember?) So you need to come up with a migration strategy. Maybe a flag sequence number. Or complex migration code that changes &lt;em&gt;buggy new state&lt;/em&gt; into &lt;em&gt;good new state&lt;/em&gt;. Possible, but hard.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Code updates&lt;/em&gt;. Are you sure that version &lt;em&gt;N+1&lt;/em&gt; produces exactly the same output as version &lt;em&gt;N&lt;/em&gt; for all inputs? You shouldn’t be, because even in the well-specified world of cryptography &lt;a href=&quot;https://hdevalence.ca/blog/2020-10-04-its-25519am&quot;&gt;that’s not always true&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Corruption&lt;/em&gt;. In reality, &lt;em&gt;input&lt;/em&gt; isn’t just &lt;em&gt;input&lt;/em&gt;, it’s also a constant stream of failing components, thermal noise, cosmic rays, and other similar assaults on the castle of determinism. Can you survive them all?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And more. There’s always more.&lt;/p&gt;

&lt;p&gt;Some people will tell you that you can solve these problems by using &lt;em&gt;byzantine&lt;/em&gt; consensus protocols. Those people are right, of course. They’re also the kind of people who solved their rodent problem by keeping a leopard in their house. Other people will tell you that you can solve these problems with blockchain. Those people are best ignored.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring and Control&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Although using a single, centralized, server is the simplest way to implement a service, the resulting service can only be as fault tolerant as the processor executing that server. If this level of fault tolerance is unacceptable, then multiple servers that fail independently must be used. - Fred Schneider (from &lt;a href=&quot;https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf&quot;&gt;Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The whole point of building a highly-available distributed system is to exceed the availability of a single system. If you can’t do that, you’ve added a bunch of complexity for nothing.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Complex systems run in degraded mode. - Richard Cook (from &lt;a href=&quot;https://how.complexsystems.fail/&quot;&gt;How Complex Systems Fail&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Depending on what you mean by &lt;em&gt;failed&lt;/em&gt;, distributed systems of &lt;em&gt;f+1&lt;/em&gt;, &lt;em&gt;2f+1&lt;/em&gt; or &lt;em&gt;3f+1&lt;/em&gt; nodes can entirely hide the failure of &lt;em&gt;f&lt;/em&gt; nodes from their clients. This, combined with a process of repairing failed nodes, allows us to build highly-available systems even in the face of significant failure rates. It also leads directly to one of the traps of building a distributed system: clients can’t tell the difference between the case where an outage is &lt;em&gt;f&lt;/em&gt; failures away, and where it’s just one failure away. If a system can tolerate &lt;em&gt;f&lt;/em&gt; failures, then &lt;em&gt;f-1&lt;/em&gt; failures may look completely healthy.&lt;/p&gt;

&lt;p&gt;Consensus systems cannot be monitored entirely &lt;em&gt;from the outside&lt;/em&gt; (see &lt;a href=&quot;//brooker.co.za/blog/2016/01/03/correlation.html&quot;&gt;why must systems be operated?&lt;/a&gt;). Instead, monitoring needs to be deeply aware of the implementation details of the system, so it can know when nodes are healthy, and can be replaced. If they choose the wrong nodes to replace, disaster will strike.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Control planes provide much of the power of the cloud, but their privileged position also means that they have to act safely, responsibly, and carefully to avoid introducing additional failures. - Brooker, Chen, and Ping (from &lt;a href=&quot;https://www.usenix.org/system/files/nsdi20-paper-brooker.pdf&quot;&gt;Millions of Tiny Databases&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Do You Really Need Strong Consistency?&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It is possible to provide high availability and partition tolerance, if atomic consistency is not required. - Gilbert and Lynch&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The typical state-machine implementation of consensus provides a strong consistency property called &lt;em&gt;linearizability&lt;/em&gt;. In exchange, it can’t be available for all clients during a network partition. That’s probably why you chose it.&lt;/p&gt;

&lt;p&gt;Is that why you chose it? Do you need &lt;em&gt;linearizability&lt;/em&gt;? Or would something else, like &lt;em&gt;causality&lt;/em&gt; be enough? Using consensus when its properties aren’t really needed is a mistake a lot of folks seem to make. Service discovery, configuration distribution, and similar problems can all be handled adequately without strong consistency, and using strongly consistent tools to solve them makes systems less reliable rather than more. Strong consistency is not better consistency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Despite these challenges, consensus is an important building block in building highly-available systems. Distribution &lt;a href=&quot;http://brooker.co.za/blog/2020/01/02/why-distributed.html&quot;&gt;makes building HA systems easier&lt;/a&gt;. It’s a tool, not a solution.&lt;/p&gt;

&lt;p&gt;Think of using consensus in your system like getting a puppy: it may bring you a lot of joy, but with that joy comes challenges, and ongoing responsibilities. There’s a lot more to dog ownership than just getting a dog. There’s a lot more to high availability than picking up a Raft library off github.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Including &lt;a href=&quot;https://raft.github.io/raft.pdf&quot;&gt;Raft&lt;/a&gt;, which has become famous for being a more understandable consensus algorithm. &lt;a href=&quot;https://www.cs.rutgers.edu/~pxk/417/notes/virtual_synchrony.html&quot;&gt;Virtual Synchrony&lt;/a&gt; is less famous, but no less a contribution.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; There are some nice patterns for building deterministic high-performance systems, but the general problem is still an open area of research. For a good primer on determinism and non-determinism in database systems, check out &lt;a href=&quot;http://paperhub.s3.amazonaws.com/878608b83ccf413ea73acfd6b78860a1.pdf&quot;&gt;The Case for Determinism in Database Systems&lt;/a&gt; by Thomson and Abadi.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Bruce Dawson has an &lt;a href=&quot;https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/&quot;&gt;excellent blog post&lt;/a&gt; on the various issues and challenges.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Bailis et al’s &lt;a href=&quot;https://dsf.berkeley.edu/papers/vldb14-hats.pdf&quot;&gt;Highly Available Transactions: Virtues and Limitations&lt;/a&gt; paper contains a nice breakdown of the options here, and Aphyr’s post on &lt;a href=&quot;https://aphyr.com/posts/313-strong-consistency-models&quot;&gt;Strong Consistency Models&lt;/a&gt; is a very approachable breakdown of the topic. If you really want to go deep, check out Dziuma et al’s &lt;a href=&quot;https://projects.ics.forth.gr/tech-reports/2013/2013.TR439_Survey_on_Consistency_Conditions.pdf&quot;&gt;Survey on consistency conditions&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Focus on the Good Parts</title>
      <link>http://brooker.co.za/blog/2020/09/02/learning.html</link>
      <pubDate>Wed, 02 Sep 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/09/02/learning</guid>
      <description>&lt;h1 id=&quot;focus-on-the-good-parts&quot;&gt;Focus on the Good Parts&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Skepticism and cynicism can get in your way.&lt;/p&gt;

&lt;p&gt;Back in May, I wrote &lt;a href=&quot;//brooker.co.za/blog/2020/05/25/reading.html&quot;&gt;Reading Research: A Guide for Software Engineers&lt;/a&gt;, answering common questions I get about why and how to read research papers. In that post, I wrote about three modes of reading: &lt;em&gt;solution finding&lt;/em&gt;, &lt;em&gt;discovery&lt;/em&gt;, and &lt;em&gt;curiosity&lt;/em&gt;. In subsequent conversations, I’ve realized there’s another common issue that gets in engineers’ ways when they read research, especially in the &lt;em&gt;discovery&lt;/em&gt; and &lt;em&gt;curiosity&lt;/em&gt; modes: too much skepticism.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;The chief deficiency I see in the skeptical movement is its polarization: Us vs. Them — the sense that we have a monopoly on the truth; that those other people who believe in all these stupid doctrines are morons; that if you’re sensible, you’ll listen to us; and if not, to hell with you.&lt;/em&gt; (from Carl Sagan’s &lt;em&gt;The Demon Haunted World&lt;/em&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I could blame it on comment thread culture, racing to make that top comment pointing out errors in the paper. I could blame it on the low signal-to-noise ratio of content in general. I could blame it on poor research, poor writing, or incorrect data. But whatever is to blame, many readers approach technical content with their first goal being to find errors and mistakes, gaps in logic, or incomplete justifications of statements. When a mistake is found, the reader is justified in throwing out the whole piece of writing (&lt;em&gt;unreliable!&lt;/em&gt;), the authors (&lt;em&gt;sloppy!&lt;/em&gt;), their institutions (&lt;em&gt;clueless!&lt;/em&gt;), or even the whole field (&lt;em&gt;substandard!&lt;/em&gt;). It’s also a perfect opportunity to write that comment or tweet pointing out the problems. After all, if you found the author’s mistake, doesn’t that make you smarter and better than the author?&lt;/p&gt;

&lt;p&gt;This approach gets in the way of your ability to learn from reading. I’d encourage you to take a different one: read with the goal of finding the good stuff. Dig for the ideas, the insights, the analyses and the data points that provide value. Look for what you can learn.&lt;/p&gt;

&lt;p&gt;I’m not suggesting that you don’t carefully approach what you read. You absolutely should make sure what you believe is well-supported. Don’t waste your life reading crap. Your time is too valuable for that.&lt;/p&gt;

&lt;p&gt;The flip side of this is relying too much on social proof. If you open the comment thread first, you’ll find that the piece you’re about to read is &lt;em&gt;great&lt;/em&gt; or it’s &lt;em&gt;crap&lt;/em&gt; or it’s &lt;em&gt;another piece of junk published by &lt;strong&gt;those people&lt;/strong&gt; (you know, them, the incompetent ones)&lt;/em&gt;. Then, when you finally read the paper, you’ll be less smart. You’ll be biased towards confirming the opinions of others, rather than reading and understanding the material. I’m not against comment threads, but I never read them first.&lt;/p&gt;

&lt;p&gt;Again, you can go too far in this direction. A lot of academic publishing is an exercise in social proof. Almost all the filtering we use to reduce the firehose of content down to a manageable stream depends on social proof. We use these tools because they’re powerful, and scalable. But remember than popularity with Hacker News commenters, and even publication in a prestigious conference or journal, is only weak evidence of quality. Unpopularity, and rejection, are weak evidence of a lack of quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fox and Brewer’s classic paper &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.411&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Harvest, Yield, and Scalable Tolerant Systems&lt;/a&gt; contains many great ideas. The framing of Harvest and Yield is very useful, and I’ve found it’s had a big influence on the way that I have approached system design over the years. The first time I read it, though, I put it down. The parts describing CAP (Section 2 and 3) are confusing at best and wrong at worst (as I’ve &lt;a href=&quot;http://brooker.co.za/blog/2014/10/12/harvest-yield.html&quot;&gt;blogged about before&lt;/a&gt;). I couldn’t get past them.&lt;/p&gt;

&lt;p&gt;It was only after being encouraged by a colleague that I read the whole thing. Taken as a whole, it’s full of great ideas. If I had kept tripping over my skepticism, and getting stuck on the bad parts, I never would have been able to learn from it.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Surprising Economics of Load-Balanced Systems</title>
      <link>http://brooker.co.za/blog/2020/08/06/erlang.html</link>
      <pubDate>Thu, 06 Aug 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/08/06/erlang</guid>
      <description>&lt;h1 id=&quot;surprising-economics-of-load-balanced-systems&quot;&gt;Surprising Economics of Load-Balanced Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;The M/M/c model may not behave like you expect.&lt;/p&gt;

&lt;p&gt;I have a system with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt; servers, each of which can only handle a single concurrent request, and has no internal queuing. The servers sit behind a load balancer, which contains an infinite queue. An unlimited number of clients offer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c * 0.8&lt;/code&gt; requests per second to the load balancer on average. In other words, we increase the offered load linearly with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt; to keep the per-server load constant. Once a request arrives at a server, it takes one second to process, on average. How does the client-observed mean request time vary with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/erlang_c_plot.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Option A is that the mean latency decreases quickly, asymptotically approaching one second as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt; increases (in other words, the time spent in queue approaches zero). Option B is constant. Option C is a linear improvement, and D is a linear degradation in latency. Which curve do you, intuitively, think that the latency will follow?&lt;/p&gt;

&lt;p&gt;I asked my Twitter followers the same question, and got an interestingly mixed result:
&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/erlang_twitter_poll.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Breaking down the problem a bit will help figure out which is the right answer. First, names. In the terminology of queue theory, this is an &lt;a href=&quot;https://en.wikipedia.org/wiki/M/M/c_queue&quot;&gt;M/M/c&lt;/a&gt; queuing system: Poisson arrival process, exponentially distributed client service time, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt; backend servers. In teletraffic engineering, it’s &lt;a href=&quot;https://en.wikipedia.org/wiki/Agner_Krarup_Erlang&quot;&gt;Erlang’s&lt;/a&gt; delay system (or, because terminology is fun, M/M/n). We can use a classic result of queuing theory to analyze this system: Erlang’s C formula &lt;em&gt;E&lt;sub&gt;2,n&lt;/sub&gt;(A)&lt;/em&gt;, which calculates the probability that an incoming customer request is enqueued (rather than handled immediately), based on the number of servers (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;n&lt;/code&gt; aka &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt;), and the offered traffic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;. For the details, see page 194 of the &lt;a href=&quot;https://www.itu.int/dms_pub/itu-d/opb/stg/D-STG-SG02.16.1-2001-PDF-E.pdf&quot;&gt;Teletraffic Engineering Handbook&lt;/a&gt;. Here’s the basic shape of the curve (using our same parameters):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/erlang_c_result.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Follow the blue line up to half the saturation point, at 2.5 rps offered load, and see how the probability is around 13%. Now look at the purple line at half its saturation point, at 5 rps. Just 3.6%. So at half load the 5-server system is handling 87% of traffic without queuing, with double the load and double the servers, we handle 96.4% without queuing. Which means only 3.6% see any additional latency.&lt;/p&gt;

&lt;p&gt;It turns out this improvement is, indeed, asymptotically approaching 1. The right answer to the Twitter poll is A.&lt;/p&gt;

&lt;p&gt;Using the mean to measure latency is controversial (although &lt;a href=&quot;http://brooker.co.za/blog/2017/12/28/mean.html&quot;&gt;perhaps it shouldn’t be&lt;/a&gt;). To avoid that controversy, we need to know whether the percentiles get better at the same rate. Doing that in closed form is somewhat complicated, but this system is super simple, so we can plot them out using a Monte-Carlo simulation. The results look like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/sim_result.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That’s entirely good news. The median (p50) follows the mean line nicely, and the high percentiles (99&lt;sup&gt;th&lt;/sup&gt; and 99.9&lt;sup&gt;th&lt;/sup&gt;) have a similar shape. No hidden problems.&lt;/p&gt;

&lt;p&gt;It’s also good news for cloud and service economics. With larger &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt; we get better latency at the same utilization, or better utilization for the same latency, all at the same per-server throughput. That’s not good news only for giant services, because most of this goodness happens at relatively modest &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt;. There are few problems related to scale and distributed systems that get easier as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt; increases. This is one of them.&lt;/p&gt;

&lt;p&gt;There are some reasonable follow-up questions. Are the results robust to our arbitrary choice of 0.8? Yes, they are&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Are the M/M/c assumptions of Poisson arrivals and exponential service time reasonable for typical services? I’d say they are reasonable, albeit wrong. Exponential service time is especially wrong: realistic services tend to be something more like log-normal. It may not matter. More on that another time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update:&lt;/em&gt; Dan Ports responded to my thread with a fascinating &lt;a href=&quot;https://twitter.com/danrkports/status/1291517540280070144&quot;&gt;Twitter thread&lt;/a&gt; pointing to &lt;a href=&quot;https://drkp.net/papers/latency-socc14.pdf&quot;&gt;Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency&lt;/a&gt; from SoCC’14 which looks at this effect in the wild.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Up to a point. As soon as the mean arrival rate exceeds the system’s ability to complete requests, the queue grows without bound and latency goes to infinity. In our case, that happens when the request load exceeds &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;c&lt;/code&gt;. More generally, for this system to be stable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;λ/cμ&lt;/code&gt; must be less than 1, where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;λ&lt;/code&gt; is the mean arrival rate, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;μ&lt;/code&gt; is the mean time taken for a server to process a request.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>A Story About a Fish</title>
      <link>http://brooker.co.za/blog/2020/07/28/fish.html</link>
      <pubDate>Tue, 28 Jul 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/07/28/fish</guid>
      <description>&lt;h1 id=&quot;a-story-about-a-fish&quot;&gt;A Story About a Fish&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Nothing&apos;s more boring than a fishing story.&lt;/p&gt;

&lt;p&gt;In the 1930s, Marjorie Latimer was working as a museum curator in East London. Not the eastern part of London as one may expect. This East London is a small city on South Africa’s south coast, named so thanks to colonialism’s great tradition of creative and culturally relevant place names. Latimer was a keen and knowledgeable naturalist, and had a deal with local fishermen that they would let her know if they found anything unusual in their nets. One morning in 1938, she got a call from a fishing boat captain named Hendrik Goosen. He’d found something very unusual indeed, and wanted Marjorie to look at it. The fish which Hendrik Goosen showed Marjorie Latimer was truly unusual. Unlike anything she had seen before.&lt;/p&gt;

&lt;p&gt;Latimer knew just the person to identify it: professor JLB Smith at Rhodes University in nearby Grahamstown (now Makhanda). He was away, so she had the unusual fish gutted and taxidermied, and sent sketches to the professor. He replied (in all-caps, following the fashion at the time):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;MOST IMPORTANT PRESERVE SKELETON AND GILLS&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Smith had immediately identified the fish as something well known to science. Many like it had been seen before. This one, however, was particularly surprising. It was alive, nearly 66 millions years after the last of its kin had been thought dead. Latimer had found a &lt;a href=&quot;https://en.wikipedia.org/wiki/Coelacanth&quot;&gt;Coelacanth&lt;/a&gt;, a species of fish that had hardly evolved in the last 400 million years, and believed to exist only in the fossil record.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/Marjorie_Courtenay-Latimer_and_Coelacanth.jpg&quot; alt=&quot;Marjorie Latimer and the Coelacanth&quot; /&gt;&lt;/p&gt;

&lt;p&gt;At the time, the Coelacanths were thought to be closely related to the Rhipidistia, which were thought to be an ancestor of all modern land-based vertebrates. The science on that topic has moved on, but Goosen’s chance find, combined with Latimer’s hard work in having it identified, created a special moment in the history of biology.&lt;/p&gt;

&lt;p&gt;I was thinking about this story last night, because my daughter has been learning about Coelacanths at school. In the 1940s, JLB Smith and his wife Margaret wrote and illustrated a beautiful book called &lt;a href=&quot;https://www.biodiversitylibrary.org/item/265240#page/9/mode/1up&quot;&gt;The Sea Fishes of Southern Africa&lt;/a&gt;. My grandmother studied biology at Rhodes during the time they were writing the book, and knew the Smiths and Marjorie Latimer. Margaret Smith gave her a signed copy of their book, sometime around 1950. I was fortunate to inherit the book, and share the Smiths description and drawings of the Coelacanths with my daughter.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/smith_page_one.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://mbrooker-blog-images.s3.amazonaws.com/smith_page_two.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I hadn’t opened The Sea Fishes of Southern Africa in ten years, but re-reading Smith’s description of it was like a visit with my late grandmother. She never failed to share her excitement about, and appreciation for, all living things. I vividly remember her telling the Coelacanth story, and her small part in it, sharing the wonder of discovery and the importance of paying attention to the things around us. You never know when you’ll learn something new. Perhaps, as the Smiths write, &lt;em&gt;it is unwise to be too dogmatic&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;
Ross Goosen, grandson of Hendrik P Goosen the captain of the I &amp;amp; J trawler Nerine,who in 1938 caught the original Coelacanths off East London reached out to say:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A lot is always made about JLB Smiths contribution to the ‘Old four legs’ story,but if my grandfather had not used his experience of all those years at sea and had not contacted Marjorie Latimer about the strange fish that he had just caught,then the world would still be in the dark about the existence of this prehistoric fish.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    
    <item>
      <title>Code Only Says What it Does</title>
      <link>http://brooker.co.za/blog/2020/06/23/code.html</link>
      <pubDate>Tue, 23 Jun 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/06/23/code</guid>
      <description>&lt;h1 id=&quot;code-only-says-what-it-does&quot;&gt;Code Only Says What it Does&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Only loosely related to what it should do.&lt;/p&gt;

&lt;p&gt;Code says what it does. That’s important for the computer, because code is the way that we ask the computer to do something. It’s OK for humans, as long as we never have to modify or debug the code. As soon as we do, we have a problem. Fundamentally, debugging is an exercise in changing what a program does to match what it should do. It requires us to know what a program should do, which isn’t captured in the code. Sometimes that’s easy: What it does is crash, what it should do is &lt;em&gt;not crash&lt;/em&gt;. Outside those trivial cases, discovering intent is harder.&lt;/p&gt;

&lt;p&gt;Debugging when &lt;em&gt;should do&lt;/em&gt; is subtle, such as when building distributed systems protocols, is especially difficult. In our &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Millions of Tiny Databases&lt;/a&gt; paper, we say:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our code reviews, simworld tests, and design meetings frequently referred back to the TLA+ models of our protocols to resolve ambiguities in Java code or written communication.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem is that the implementation (in Physalia’s case the Java code) is both an imperfect implementation of the protocol, and an overly-specific implementation of the protocol. It’s overly-specific because it needs to be fully specified. Computers demand that, and no less, while the protocol itself has some leeway and wiggle room. It’s also overly-specific because it has to address things like low-level performance concerns that the specification can’t be bothered with.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Are those values in an ArrayList because order is actually important, or because O(1) random seeks are important, or some other reason? Was it just the easiest thing to write? What happens when I change it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Business logic code, while lacking the cachet of distributed protocols, have even more of these kinds of problems. Code both over-specifies the business logic, and specifies it inaccurately. I was prompted to write this by a tweet from @mcclure111 where she hits the nail on the head:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;Since most software doesn&amp;#39;t have a formal spec, most software &amp;quot;is what it does&amp;quot;, there&amp;#39;s an incredible pressure to respect authorial intent when editing someone else&amp;#39;s code. You don&amp;#39;t know which quirks are load-bearing.&lt;/p&gt;&amp;mdash; mcc 🏳️‍⚧️🏳️‍🌈 (@mcclure111) &lt;a href=&quot;https://twitter.com/mcclure111/status/1274422600236765186?ref_src=twsrc%5Etfw&quot;&gt;June 20, 2020&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;This is a major problem with code: &lt;em&gt;You don’t know which quirks are load-bearing.&lt;/em&gt; You may remember, or be able to guess, or be able to puzzle it out from first principles, or not care, but all of those things are slow and error-prone. What can we do about it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design Documentation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Documentation is uncool. Most software engineers seem to come out of school thinking that documentation is below them (&lt;em&gt;tech writer work&lt;/em&gt;), or some weird thing their SE professor talked about that is as archaic as Fortran. Part of this is understandable. My own software engineering courses emphasized painstakingly documenting the implementation in UML. No other mention of documentation was made. Re-writing software in UML helps basically nobody. I finished my degree thinking that documentation was unnecessary busywork. Even the &lt;a href=&quot;https://agilemanifesto.org/&quot;&gt;Agile Manifesto&lt;/a&gt; agreed with me&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Working software over comprehensive documentation&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What I discovered later was that design documentation, encoding the intent and decisions made during developing a system, helps teams be successful in the short term, and people be successful in the long term. Freed from fitting everything in my head, emboldened by the confidence that I could rediscover forgotten facts later, I could move faster. The same applies to teams.&lt;/p&gt;

&lt;p&gt;One thing I see successful teams doing is documenting not only the &lt;em&gt;what&lt;/em&gt; and &lt;em&gt;why&lt;/em&gt; behind their designs, but the &lt;em&gt;how they decided&lt;/em&gt;. When it comes time to make changes to the system—either for debugging or in response to changing requirements—these documents are invaluable. It’s hard to decide whether its safe to change something, when you don’t know why it’s like that in the first place. The record of how you decided is important because you are a flawed human, and understanding how you came to a decision is useful to know when that decision seems strange, or surprising.&lt;/p&gt;

&lt;p&gt;This documentation process doesn’t have to be heavyweight. You don’t have to draw painstaking &lt;a href=&quot;https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model&quot;&gt;ER diagrams&lt;/a&gt; unless you think they are helpful. You should probably ignore UML entirely. Instead, describe the system in prose as clearly and succinctly as you can. One place to start is by building an RFC template for your team, potentially inspired by one that you find on the web. &lt;a href=&quot;https://static1.squarespace.com/static/56ab961ecbced617ccd2461e/t/5d792e5a4dac4074658ce64b/1568222810968/Squarespace+RFC+Template.pdf&quot;&gt;SquareSpace&lt;/a&gt;’s template seems reasonable. Some designs will fit well into that RFC format, other’s won’t. Prefer narrative writing where you can.&lt;/p&gt;

&lt;p&gt;Then, keep the documents. Store them somewhere safe. Soak them in vinegar &lt;a href=&quot;https://www.almanac.com/content/home-remedies-cough-relief&quot;&gt;and tie them around your chest&lt;/a&gt;. You’re going to want to make sure that the people who need to maintain the system can find them. As they are spelunking through history, help them feel more like a library visitor and less like Lara Croft.&lt;/p&gt;

&lt;p&gt;I’m not advocating for Big Design Up Front. Many of the most important things we learn about a project we learn during the implementation. Some of the most important things we learn years after the implementation is complete. Design documentation isn’t a static one-time ahead-of-time deliverable, but an ongoing process. Most importantly, design documentation is not a commitment to bad ideas. If it’s wrong, fix it and move forward. Documentation is not a deal with the devil.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Few topics invite a programmer flame war like comments. We’re told that comments are silly, or childish, or make it hard to show how manly you are in writing that convoluted mess of code. If it was hard to write, it should be hard to read. After all, you’re the James Joyce of code.&lt;/p&gt;

&lt;p&gt;That silliness aside, back to @mcclure111’s thread:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;This means comments that *reveal* authorial intent are valuable, and comments that reveal *there was no authorial intent* are even more valuable. Without those hints, you&amp;#39;re left editing superstitiously, preserving quirks even when you don&amp;#39;t know why. &lt;a href=&quot;https://t.co/YhvWnXjp9i&quot;&gt;https://t.co/YhvWnXjp9i&lt;/a&gt;&lt;/p&gt;&amp;mdash; mcc 🏳️‍⚧️🏳️‍🌈 (@mcclure111) &lt;a href=&quot;https://twitter.com/mcclure111/status/1274422825831596039?ref_src=twsrc%5Etfw&quot;&gt;June 20, 2020&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;Comments allow us to encode &lt;em&gt;authorial intent&lt;/em&gt; into our code in a way that programming languages don’t always. Types, traits, interfaces, and variable names do put intent into code, but not completely (I see you, type system maximalists). These same things allow us to communicate a lack of intent—consider &lt;a href=&quot;https://docs.oracle.com/javase/8/docs/api/java/util/RandomAccess.html&quot;&gt;RandomAccess&lt;/a&gt; vs &lt;a href=&quot;https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html&quot;&gt;ArrayList&lt;/a&gt;—but are also incomplete. Well-commented code should make the intent of the author clear, especially in cases where that intent is either lost in the translation to code, or where implementation constraints hide the intent of the design. Code comments that link back to design documents are especially useful.&lt;/p&gt;

&lt;p&gt;Some languages need comments more than others. Some, like SQL, I find to nearly always obscure the intent of the design behind implementation details.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formal Specification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://cacm.acm.org/magazines/2015/4/184705-who-builds-a-house-without-drawing-blueprints/fulltext&quot;&gt;Who Builds a House Without Drawing Blueprints?&lt;/a&gt; Leslie Lamport writes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The need for specifications follows from two observations. The first is that it is a good idea to think about what we are going to do before doing it, and as the cartoonist Guindon wrote: “Writing is nature’s way of letting you know how sloppy your thinking is.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;The second observation is that to write a good program, we need to think above the code level.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’ve found that specification, from informal specification with narrative writing to formal specification with TLA+, makes writing programs faster and helps reduce mistakes. As much as I like that article, I think Lamport misses a key part of the value of formal specification: it’s a great communication tool. In developing some of the trickiest systems I’ve built, I’ve found that heavily-commented formal specifications are fantastically useful documentation. Specification languages are all about &lt;em&gt;intent&lt;/em&gt;, and some make it easy to clearly separate intent from implementation.&lt;/p&gt;

&lt;p&gt;Again, from our &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Millions of Tiny Databases&lt;/a&gt; paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We use TLA+ extensively at Amazon, and it proved exceptionally useful in the development of Physalia.  Our team used TLA+ in three ways: writing specifications of our protocols to check that we understand them deeply, model checking specifications against correctness and liveness properties using the TLC model checker, and writing extensively commented TLA+ code to serve as the documentation of our distributed protocols. While all three of these uses added value, TLA+’s role as a sort of automatically tested (via TLC),and extremely precise, format for protocol documentation was perhaps the most useful.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Formal specifications make excellent documentation. Like design docs, they aren’t immutable artifacts, but a reflection of what we have learned about the problem.&lt;/p&gt;

&lt;p&gt;** Conclusion **&lt;/p&gt;

&lt;p&gt;Building long-lasting, maintainable, systems requires not only communicating with computers, but also communicating in space with other people, and in time with our future selves. Communicating, recording, and indexing the intent behind our designs is an important part of that picture. Make time for it, or regret it later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; To be charitable to the Agile folks, &lt;em&gt;comprehensive&lt;/em&gt; does seem to be load-bearing.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Some Virtualization Papers Worth Reading</title>
      <link>http://brooker.co.za/blog/2020/06/08/virtualization.html</link>
      <pubDate>Mon, 08 Jun 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/06/08/virtualization</guid>
      <description>&lt;h1 id=&quot;some-virtualization-papers-worth-reading&quot;&gt;Some Virtualization Papers Worth Reading&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;A short, and incomplete, survey.&lt;/p&gt;

&lt;p&gt;A while back, Cindy Sridharan asked on Twitter for pointers to papers on the past, present and future of virtualization. A picked a few of my favorites, and given the popularity of that thread I decided to collect some of them here. This isn’t a literature survey by any means, just a collection of some papers I’ve found particularly interesting or useful. As usual, I’m biased towards papers I enjoyed reading, rather than those I had to slog through.&lt;/p&gt;

&lt;p&gt;Popek and Goldberg’s 1974 paper &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.141.4815&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Formal Requirements for Virtualizable Third Generation Architectures&lt;/a&gt; is rightfully a classic. They lay out a formal framework of conditions that a computer architecture must fulfill to support virtual machines. It’s 45 years old, so some of the information is dated, but the framework and core ideas have stood the test of time.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.cl.cam.ac.uk/research/srg/netos/papers/2003-xensosp.pdf&quot;&gt;Xen and the Art of Virtualization&lt;/a&gt;, from 2003, described the Xen hypervisor and a novel technique for running secure virtualization on commodity x86 machines. The exact techniques are less interesting than they were then, mostly because of hardware virtualization features on x86 like &lt;a href=&quot;https://en.wikipedia.org/wiki/X86_virtualization&quot;&gt;VT-x&lt;/a&gt;, but the discussion of the filed and trade-offs is enlightening. Xen’s influence on the industry has been huge, especially because it was used as the foundation of Amazon EC2, which triggered the following decade’s explosion in cloud computing. &lt;a href=&quot;http://pages.cs.wisc.edu/~remzi/Classes/838/Spring2013/Papers/bugnion97disco.pdf&quot;&gt;Disco: Running Commodity Operating Systems on Scalable Multiprocessors&lt;/a&gt; from 1997 is very useful from a similar perspective (and thanks to Pekka Enberg for the tip on that one). Any paper that has &lt;em&gt;“our approach brings back an idea popular in the 1970s”&lt;/em&gt; in its abstract gets my attention immediately.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.vmware.com/pdf/asplos235_adams.pdf&quot;&gt;A Comparison of Software and Hardware Techniques for x86 Virtualization&lt;/a&gt;, from 2006, looks at some of the early versions of that x86 virtualization hardware and compares it to software virtualization techniques. As above, hardware has moved on since this was written, but the criticisms and comparisons are still useful to understand.&lt;/p&gt;

&lt;p&gt;The security, compatibility and performance trade-offs of different approaches to isolation are complex. On compatibility, &lt;a href=&quot;https://dl.acm.org/doi/10.1145/2901318.2901341&quot;&gt;A study of modern Linux API usage and compatibility: what to support when you’re supporting&lt;/a&gt; is a very nice study of how much of the Linux kernel surface area actually gets touched by applications, and what is needed to be truly compatible with Linux. Randal’s &lt;a href=&quot;https://arxiv.org/abs/1904.12226&quot;&gt;The Ideal Versus the Real: Revisiting the History of Virtual Machines and Containers&lt;/a&gt; surveys the history of isolation, and what that means in the modern world. Anjali’s &lt;a href=&quot;https://dl.acm.org/doi/pdf/10.1145/3381052.3381315&quot;&gt;Blending Containers and Virtual Machines: A Study of Firecracker and gVisor&lt;/a&gt; is another of a related genre, with some great data comparing three methods of isolation.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.acm.org/doi/10.1145/3132747.3132763&quot;&gt;My VM is Lighter (and Safer) than your Container&lt;/a&gt; from SOSP’17 has also been influential in changing they way a lot of people think about virtualization. A lot of people I talk to see virtualization as a heavy tool with multi-second boot times and very limited density, mostly because that’s the way it’s typically used in industry. Manco et al’s work wasn’t the first to burst that bubble, but they do it very effectively.&lt;/p&gt;

&lt;p&gt;Our own paper &lt;a href=&quot;https://www.amazon.science/publications/firecracker-lightweight-virtualization-for-serverless-applications&quot;&gt;Firecracker: Lightweight Virtualization for Serverless Applications&lt;/a&gt; describes Firecracker, new open-source Virtual Machine Monitor (VMM) specialized for serverless workloads. The paper also covers how we use it in AWS Lambda, and some of what we see as the future challenges in this space. Obviously I’m biased here, being an author of that paper.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Reading Research: A Guide for Software Engineers</title>
      <link>http://brooker.co.za/blog/2020/05/25/reading.html</link>
      <pubDate>Mon, 25 May 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/05/25/reading</guid>
      <description>&lt;h1 id=&quot;reading-research-a-guide-for-software-engineers&quot;&gt;Reading Research: A Guide for Software Engineers&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Don&apos;t be afraid.&lt;/p&gt;

&lt;p&gt;One thing I’m known for at work is reading research papers, and referring to results in technical conversations. People ask me if, and how, they should read papers themselves. This post is a long-form answer to that question. The intended audience is working software engineers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why read research?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I read research in one of three mental modes.&lt;/p&gt;

&lt;p&gt;The first mode is &lt;em&gt;solution finding&lt;/em&gt;: I’m faced with a particular problem, and am looking for solutions. This isn’t too different from the way that you probably use Stack Overflow, but for more esoteric or systemic problems. Solution finding can work directly from papers, but I tend to find books more useful in this mode, unless I know an area well and am looking for something specific.&lt;/p&gt;

&lt;p&gt;A more productive mode is what I call &lt;em&gt;discovery&lt;/em&gt;. In this case, I’ve been working on a problem or in a space, and know something about it. In discovery mode, I want to explore around the space I know and see if there are better solutions. For example, when I was building a system using Paxos, I read a lot of literature about consensus protocols in general (including classics like &lt;a href=&quot;http://pmg.csail.mit.edu/papers/vr-revisited.pdf&quot;&gt;Viewstamped Replication&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, and newer papers like Raft). The goal in discovery mode is to find alternative solutions, opportunities for optimization, or new ways to think about a problem.&lt;/p&gt;

&lt;p&gt;The most intellectually gratifying mode for me is &lt;em&gt;curiosity&lt;/em&gt; mode. Here, I’ll read papers that just seem interesting to me, but aren’t related to anything I’m currently working on. I’m constantly surprised by how reading broadly has helped me solve problems, or just informed by approach. For example, reading about misuse-resistant cryptography primitives like &lt;a href=&quot;https://tools.ietf.org/html/rfc8452&quot;&gt;GCM-SIV&lt;/a&gt; has deeply informed my approach to API design. Similarly, reading about erasure codes around 2005 helped me solve an important problem for my team just this year.&lt;/p&gt;

&lt;p&gt;I’ve found reading for discovery and curiosity very helpful to my career. It has also given me tools that makes reading for solution finding more efficient. Sometimes, reading for curiosity leads to new paths. About five years ago I completely changed what I was working on after reading &lt;a href=&quot;https://dl.acm.org/doi/10.1145/1022594.1022596&quot;&gt;Latency lags bandwidth&lt;/a&gt;, which I believe is one of the most important trends in computing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need a degree to read research papers?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. Don’t expect to be able to pick up every paper and understand it completely. You do need a certain amount of background knowledge, but no credentials. Try to avoid being discouraged when you don’t understand a paper, or sections of a paper. I’m often surprised when I revisit something after a couple years and find I now understand it.&lt;/p&gt;

&lt;p&gt;Learning a new field from primary research can be very difficult. When tackling a new area, books, blogs, talks, and courses are better options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I find papers worth reading?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That depends on the mode you’re in. In &lt;em&gt;solution finding&lt;/em&gt; and &lt;em&gt;discovery&lt;/em&gt; modes, search engines like Google Scholar are a great place to start. One challenge with searching is that you might not even know the right things to search for: it’s not unusual for researchers to use different terms from the ones you are used to. If you run into this problem, picking up a book on the topic can often help bridge the gap, and the references in books are a great way to discover papers.&lt;/p&gt;

&lt;p&gt;Following particular authors and researchers can be great for &lt;em&gt;discovery&lt;/em&gt; and &lt;em&gt;curiosity&lt;/em&gt; modes. If there’s a researcher who’s working in a space I’m interested in, I’ll follow them on Twitter or add search alerts to see when they’ve published something new.&lt;/p&gt;

&lt;p&gt;Conferences and journals are another great place to go. Most of the computer science research you’ll read is probably published at conferences. There are some exceptions. For example, I followed &lt;a href=&quot;https://dl.acm.org/journal/tos&quot;&gt;ACM Transactions on Storage&lt;/a&gt; when I was working in that area. Pick a couple of conferences in areas that you’re interested in, and read through their programs when they come out. In my area, &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/technical-sessions&quot;&gt;NSDI&lt;/a&gt; and &lt;a href=&quot;https://www.eurosys2020.org/program/&quot;&gt;Eurosys&lt;/a&gt; happened earlier this year, and OSDI is coming up. Jeff Huang has a &lt;a href=&quot;https://jeffhuang.com/best_paper_awards.html&quot;&gt;nice list of best paper winners&lt;/a&gt; at a wide range of CS conferences.&lt;/p&gt;

&lt;p&gt;A lot of research involves going through the graph of references. Most papers include a list of references, and as I read I note down which ones I’d like to follow up on and add them to my reading list. References form a directed (mostly) acyclic graph of research going into the past.&lt;/p&gt;

&lt;p&gt;Finally, some research bloggers are worth following. &lt;a href=&quot;https://blog.acolyer.org/&quot;&gt;Adrian Colyer’s blog&lt;/a&gt; is worth its weight in gold. I’ve written about research from researchers from &lt;a href=&quot;http://brooker.co.za/blog/2014/03/30/lamport-pub.html&quot;&gt;Leslie Lamport&lt;/a&gt;, &lt;a href=&quot;http://brooker.co.za/blog/2014/05/10/lynch-pub.html&quot;&gt;Nancy Lynch&lt;/a&gt; and others, too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That’s quite a fire hose! How do I avoid drowning?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You don’t have to drink that whole fire hose. I know I can’t. Titles and abstracts can be a good way to filter out papers you want to read. Don’t be afraid to scan down a list of titles and pick out one or two papers to read.&lt;/p&gt;

&lt;p&gt;Another approach is to avoid reading new papers at all. Focus on the classics, and let time filter out papers that are worth reading. For example, I often find myself recommending Jim Gray’s 1986 paper on &lt;a href=&quot;https://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf&quot;&gt;The 5 Minute Rule&lt;/a&gt; and Lisanne Bainbridge’s 1983 paper on &lt;a href=&quot;https://www.ise.ncsu.edu/wp-content/uploads/2017/02/Bainbridge_1983_Automatica.pdf&quot;&gt;Ironies of Automation&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who writes research papers?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Research papers in the areas of computer science I work in are generally written by one of three groups. First, researchers at universities, including professors, post docs, and graduate students. These are people who’s job it is to do research. They have a lot of freedom to explore quite broadly, and do foundational and theoretical work.&lt;/p&gt;

&lt;p&gt;Second, engineering teams at companies publish their work. Amazon’s &lt;a href=&quot;https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;Dynamo&lt;/a&gt;, &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/agache&quot;&gt;Firecracker&lt;/a&gt;, &lt;a href=&quot;https://www.allthingsdistributed.com/files/p1041-verbitski.pdf&quot;&gt;Aurora&lt;/a&gt; and &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/brooker&quot;&gt;Physalia&lt;/a&gt; papers are examples. Here, work is typically more directly aimed at a problem to be solved in a particular context. The strength of industry research is that it’s often been proven in the real world, at scale.&lt;/p&gt;

&lt;p&gt;In the middle are industrial research labs. Bell Labs was home to some of the foundational work in computing and communications. Microsoft Research do a great deal of impressive work. Industry labs, as a broad generalization, also tend to focus on concrete problems, but can operate over longer time horizons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should I trust the results in research papers?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The right answer to this question is &lt;em&gt;no&lt;/em&gt;. Nothing about being in a research paper guarantees that a result is right. Results can range from simply wrong, to flawed in more subtle ways&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;On the other hand, the process of peer review does help set a bar of quality for published results, and results published in reputable conferences and journals are generally trustworthy. Reviewers and editors put a great deal of effort into this, and it’s a real strength of scientific papers over informal publishing.&lt;/p&gt;

&lt;p&gt;My general advice is to read methods carefully, and verify results for yourself if you’re going to make critical decisions based on them. A common mistake is to apply a correct result too broadly, and assume it applies to contexts or systems it wasn’t tested on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should I distrust results that aren’t in research papers?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No. The process of peer review is helpful, but not magical. Results that haven’t been peer reviewed, or rejected from peer review aren’t necessarily wrong. Some important papers have been rejected from traditional publishing, and were published in other ways. This &lt;a href=&quot;http://lamport.azurewebsites.net/pubs/pubs.html#lamport-paxos&quot;&gt;happened&lt;/a&gt; to Leslie Lamport’s classic paper introducing Paxos:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I submitted the paper to TOCS in 1990.  All three referees said that the paper was mildly interesting, though not very important, but that all the Paxos stuff had to be removed.  I was quite annoyed at how humorless everyone working in the field seemed to be, so I did nothing with the paper.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was eventually published 8 years later, and quite well received:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This paper won an ACM SIGOPS Hall of Fame Award in 2012.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There’s a certain dance one needs to know, and follow, to get published in a top conference or journal. Some of the steps are necessary, and lead to better research and better communities. Others are just for show.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What should I look out for in the methods section?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That depends on the field. In distributed systems, one thing to look out for is scale. Due to the constraints of research, systems may be tested and validated at a scale below what you’ll need to run in production. Think carefully about how the scale assumptions in the paper might impact the results. Both academic and industry authors have an incentive to talk up the strengths of their approach, and avoid highlighting the weaknesses. This is very seldom done to the point of dishonesty, but worth paying attention to as you read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I get time to read?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is going to depend on your personal circumstances, and your job. It’s not always easy. Long-term learning is one of the keys to a sustainable and successful career, so it’s worth making time to learn. One of the ways I like to learn is by reading research papers. You might find other ways more efficient, effective or enjoyable. That’s OK too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/penberg&quot;&gt;Pekka Enberg&lt;/a&gt; pointed me at &lt;a href=&quot;https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPaper.pdf&quot;&gt;How to Read a Paper&lt;/a&gt; by Srinivasan Keshav. It describes a three-pass approach to reading a paper that I like very much:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The first pass gives you a general idea about the paper. The second pass lets you grasp the paper’s content, but not its details. The third pass helps you understand the paper in depth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Murat Demirbas shared his post &lt;a href=&quot;http://muratbuffalo.blogspot.com/2013/07/how-i-read-research-paper.html&quot;&gt;How I Read a Research Paper&lt;/a&gt; which contains a lot of great advice. Like Murat, I like to read on paper, although I have taken to doing my lighter-weight reading using &lt;a href=&quot;https://www.liquidtext.net/&quot;&gt;LiquidText&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; I wrote a &lt;a href=&quot;https://brooker.co.za/blog/2014/05/19/vr.html&quot;&gt;blog post about Viewstamped Replication&lt;/a&gt; back in 2014. It’s a pity VR isn’t more famous, because it’s an interestingly different framing that helped me make sense of a lot of what Paxos does.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Obviously stuff like maths is timeless, but even in fast-moving fields like systems there are papers worth reading from the 50s and 60s. I think about Sayre’s 1969 paper &lt;a href=&quot;https://dl.acm.org/doi/10.1145/363626.363629&quot;&gt;Is automatic “folding” of programs efficient enough to displace manual?&lt;/a&gt; when people talk about how modern programmers don’t care about efficiency.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; There’s a lot of research that looks at the methods and evidence of other research. For a start, and to learn interesting things about your own benchmarking, take a look at &lt;a href=&quot;https://www.usenix.org/conference/nsdi20/presentation/uta&quot;&gt;Is Big Data Performance Reproducible in Modern Cloud Networks?&lt;/a&gt; and &lt;a href=&quot;https://www.fsl.cs.sunysb.edu/docs/fsbench/fsbench-tr.html&quot;&gt;A Nine Year Study of File System and Storage Benchmarking&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Two Years With Rust</title>
      <link>http://brooker.co.za/blog/2020/03/22/rust.html</link>
      <pubDate>Sun, 22 Mar 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/03/22/rust</guid>
      <description>&lt;h1 id=&quot;two-years-with-rust&quot;&gt;Two Years With Rust&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;I like it. I hope it&apos;s going to be big.&lt;/p&gt;

&lt;p&gt;It’s been just over two years since I started learning Rust. Since then, I’ve used it heavily at my day job, including work in the &lt;a href=&quot;https://github.com/firecracker-microvm/firecracker&quot;&gt;Firecracker&lt;/a&gt; code base, and a number of other projects. Rust is a great fit for the systems-level work I’ve been doing over the last few years: often performance- and density-sensitive, always security-sensitive. I find the type system, object life cycle, and threading model both well-suited to this kind of work and fairly intuitive. Like most people, I still fight with the compiler from time-to-time, but we mostly get on now.&lt;/p&gt;

&lt;p&gt;Rust has also mostly replaced Go as my go-to language for writing small performance-sensitive programs, like the numerical simulators I use a lot. Go replaced C in that role for me, and joined R and Python as my day-to-day go-to tools. I’ve found that I still spend more time writing a Rust program than I do Go, and more than C (except where C is held back by a lack of sane data structures and string handling). I’ve also found that programs seem more likely to work on their first run, but haven’t made any effort to quantify that.&lt;/p&gt;

&lt;p&gt;Over my career, I’ve done for-pay work in C, C++, Java, Python, Ruby, Go, Rust, Scheme, Basic, Perl, Bash, TLA+, Delphi, Matlab, ARM and x86 assembly, and R (probably forgetting a few). There’s likely some of my code in each of those languages still running somewhere. I’ve also learned a bunch of other languages, because it’s something I enjoy doing. Recently, for example, I’ve been loving playing with &lt;a href=&quot;https://frinklang.org/&quot;&gt;Frink&lt;/a&gt;. I don’t tend to be highly opinionated about languages.&lt;/p&gt;

&lt;p&gt;However, in some cases I steer colleagues and teams away from particular choices. C and C++, for example, seem to be difficult and expensive to use in a way that avoids dangerous memory-safety bugs, and users need to be willing to invest deeply in their code if these bugs matter to them. It’s possible to write great safe C, but the path there requires a challenging blend of tools and humility. Rust isn’t a panacea, but is a really nice alternative where they were fairly thin before. I find myself recommending and choosing it more and more often for small command-line programs, high-performance services, and system-level code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why I like Rust&lt;/strong&gt;
There are a lot of good programming languages in the world. There are even multiple that fit Rust’s broad description, and place in the ecosystem. This is a very good place, with real problems to solve. I’m not convinced that Rust is necessarily technically superior to its nearest neighbors, but there are some things it seems to do particularly well.&lt;/p&gt;

&lt;p&gt;I like how friendly and helpful the compiler’s error messages are. The free book and standard library documentation are all very good. The type system is nice to work with. The built-in tooling (rustup, cargo and friends) are easy and powerful. A standard formatting tool goes a long way to keeping code-bases tidy and bikesheds unpainted. Static linking and cross-compiling are built-in. The smattering of functional idioms seem to add a good amount of power and expressiveness. Features that actively lead to obtuse code (like macros) are discouraged. Out-of-the-box performance is pretty great. &lt;a href=&quot;https://doc.rust-lang.org/book/ch16-00-concurrency.html#fearless-concurrency&quot;&gt;Fearless Concurrency&lt;/a&gt; actually delivers.&lt;/p&gt;

&lt;p&gt;There’s a lot more, too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What might make Rust unsuccessful?&lt;/strong&gt;
There are also some things I don’t particularly like about Rust. Some of those are short-term. Learning how to write async networking code in Rust during the year or so before &lt;em&gt;async&lt;/em&gt; and &lt;em&gt;await&lt;/em&gt; were stabilized was a frustrating mess of inconsistent documentation and broken APIs. The compiler isn’t as smart about optimizations like loop unrolling and autovectorization as C compilers tend to be (even where it does a great job eliding the safety checks, and other Rust-specific overhead). Some parts of the specification, like aliasing rules and the exact definitions of &lt;a href=&quot;https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html&quot;&gt;atomic memory orderings&lt;/a&gt;, are still a little fuzzier than I would like. Static analysis tooling has a way to go. Allocating aligned memory is tricky, especially if you still want to use some of the standard data structures. And so on.&lt;/p&gt;

&lt;p&gt;In each of these cases, and more like them, the situation seems to have improved every time I look at it in detail. The community seems to be making great progress. &lt;em&gt;async&lt;/em&gt; and &lt;em&gt;await&lt;/em&gt; were particularly big wins.&lt;/p&gt;

&lt;p&gt;The biggest long-term issue in my mind is &lt;em&gt;unsafe&lt;/em&gt;. Rust makes what seems like a very reasonable decision to allow sections of code to be marked as &lt;em&gt;unsafe&lt;/em&gt;, which allows one to color outside the lines of the memory and life cycle guarantees. As the name implies &lt;em&gt;unsafe&lt;/em&gt; code tends to be &lt;em&gt;unsafe&lt;/em&gt;. The big problem with &lt;em&gt;unsafe&lt;/em&gt; code isn’t that the code inside the block is unsafe, it’s that it can break the safety properties of safe code in subtle and non-obvious ways. Even safe code that’s thousands of lines away. This kind of action-at-a-distance can make it difficult to reason about the properties of any code-base that contains &lt;em&gt;unsafe&lt;/em&gt; code. For low-level systems code, that’s probably all of them.&lt;/p&gt;

&lt;p&gt;This isn’t a surprise to the community. The Rust community is very realistic about the costs and benefits of &lt;em&gt;unsafe&lt;/em&gt;. Sometimes that debate goes too far (as &lt;a href=&quot;https://words.steveklabnik.com/a-sad-day-for-rust&quot;&gt;Steve Klabnik has written about&lt;/a&gt;), but mostly the debate and culture seems healthy to me as a relative outsider.&lt;/p&gt;

&lt;p&gt;The problem is that this spooky behavior of &lt;em&gt;unsafe&lt;/em&gt; tends not to be obvious to new Rust programmers. The mental model I’ve seen nearly everybody start with, including myself, is that &lt;em&gt;unsafe&lt;/em&gt; blocks can break things inside them and so care needs to be paid to writing that code well. Unfortunately, that’s not sufficient.&lt;/p&gt;

&lt;p&gt;Better static and dynamic analysis tooling could help here, as well as some better help from the compiler, and alternatives to some uses of &lt;em&gt;unsafe&lt;/em&gt;. I suspect that the long-term success of Rust as a systems language is going to depend on how well the community and tools handle &lt;em&gt;unsafe&lt;/em&gt;. A lot of the value of Rust lies in its safety, and it’s still too easy to break that safety without knowing it.&lt;/p&gt;

&lt;p&gt;Another long-term risk is the size of the language. It’s been over 10 years since I last worked with C++ every day, and I’m nowhere near being a competent C++ programmer anymore. Part of that is because C++ has evolved, which is a very good thing. Part of it is because C++ is &lt;em&gt;huge&lt;/em&gt;. From a decade away, it seems hard to be a competent part-time C++ programmer: you need to be fully immersed, or you’ll never fit the whole thing in your head. Rust could go that way too, and it would be a pity.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Firecracker: Lightweight Virtualization for Serverless Applications</title>
      <link>http://brooker.co.za/blog/2020/02/19/firecracker.html</link>
      <pubDate>Wed, 19 Feb 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/02/19/firecracker</guid>
      <description>&lt;h1 id=&quot;firecracker-lightweight-virtualization-for-serverless-applications&quot;&gt;Firecracker: Lightweight Virtualization for Serverless Applications&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Our second paper for NSDI&apos;20.&lt;/p&gt;

&lt;p&gt;In 2018, we announced &lt;a href=&quot;https://firecracker-microvm.github.io/&quot;&gt;Firecracker&lt;/a&gt;, an &lt;a href=&quot;https://github.com/firecracker-microvm/firecracker&quot;&gt;open source&lt;/a&gt; VMM optimized for multi-tenant serverless and container workloads. We heard some interest from the research community, and in response wrote up our reasoning behind building Firecracker, and how its used inside AWS Lambda.&lt;/p&gt;

&lt;p&gt;That paper was accepted to NSDI’20, and is &lt;a href=&quot;https://www.amazon.science/publications/firecracker-lightweight-virtualization-for-serverless-applications&quot;&gt;available here&lt;/a&gt;. Here’s the abstract:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Serverless containers and functions are widely used for deploying and managing software in the cloud. Their popularity is due to reduced cost of operations, improved utilization of hardware, and faster scaling than traditional deployment methods. The economics and scale of serverless applications demand that workloads from multiple customers run on the same hardware with minimal overhead, while preserving strong security and performance isolation. The traditional view is that there is a choice between virtualization with strong security and high overhead, and container technologies with weaker security and minimal overhead. This tradeoff is unacceptable to public infrastructure providers, who need both strong security and minimal overhead. To meet this need, we developed Fire-cracker, a new open source Virtual Machine Monitor (VMM)specialized for serverless workloads, but generally useful for containers, functions and other compute workloads within a reasonable set of constraints. We have deployed Firecracker in two publically available serverless compute services at Amazon Web Services (Lambda and Fargate), where it supports millions of production workloads, and trillions of requests per month. We describe how specializing for serverless in-formed the design of Firecracker, and what we learned from seamlessly migrating Lambda customers to Firecracker.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Like any project the size of Firecracker, it was developed by a team of people from vision to execution. I played only a small role in that, but it’s been great to work with the team (and the community) on getting Firecracker out, adding features, and using it in production at pretty huge scale.&lt;/p&gt;

&lt;p&gt;Firecracker is a little bit unusual among software projects of having an explicit goal of being simple and well-suited for a relatively small number of tasks. That doesn’t mean it’s simplistic. Choosing what to do, and what not to do, was some of the most interesting decisions to be made in it’s development. I’m particularly proud of how well the team made those decisions, and continues to make them.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Physalia: Millions of Tiny Databases</title>
      <link>http://brooker.co.za/blog/2020/02/17/physalia.html</link>
      <pubDate>Mon, 17 Feb 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/02/17/physalia</guid>
      <description>&lt;h1 id=&quot;physalia-millions-of-tiny-databases&quot;&gt;Physalia: Millions of Tiny Databases&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Avoiding Hard CAP Tradeoffs&lt;/p&gt;

&lt;p&gt;A few years ago, when I was still working on EBS, we started building a system called Physalia. Physalia is a custom transactional key-value store, designed to play the role of &lt;em&gt;configuration master&lt;/em&gt; in the EBS architecture. Last year, we wrote a paper about Physalia, and were thrilled that it was accepted to NSDI’20.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c2390/millions-of-tiny-databases.pdf&quot;&gt;Millions of Tiny Databases&lt;/a&gt; describes our problem and solution in detail. Here’s the abstract:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Starting in 2013, we set out to build a new database to act as the configuration store for a high-performance cloud block storage system Amazon EBS.
This database needs to be not only highly available, durable, and scalable but also strongly consistent. We quickly realized that the constraints on availability imposed by the CAP theorem, and the realities of operating distributed systems, meant that we didn’t want one database. We wanted millions. Physalia is a transactional key-value store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients.
Physalia uses its knowledge of datacenter topology to place data where it is most likely to be available. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client.
This paper describes Physalia in context of \amazon \ebs, and some other uses within \awsFull. We believe that the same patterns, and approach to design, are widely applicable to distributed systems problems like control planes, configuration management, and service discovery.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I also wrote a post about Physalia &lt;a href=&quot;https://www.amazon.science/blog/amazon-ebs-addresses-the-challenge-of-the-cap-theorem-at-scale&quot;&gt;for the Amazon Science blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One aspect of Physalia that I’m particular proud of is the work that we put in to correctness. We used TLA+ extensively throughout the design, and as documentation during implementation. As &lt;a href=&quot;http://brooker.co.za/blog/2014/08/09/formal-methods.html&quot;&gt;we’ve published about before&lt;/a&gt;, TLA+ is really well suited to these kinds of systems. We also automatically generated unit tests, an approach I haven’t seen used elsewhere:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In addition to unit testing, we adopted a number of other testing approaches. One of those approaches was a suite of automatically-generated tests which run the Paxos implementation through every combination of packet loss and re-ordering that a node can experience. This testing approach was inspired by the TLC model checker, and helped usbuild confidence that our implementation matched the formal specification.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Check out &lt;a href=&quot;https://assets.amazon.science/c4/11/de2606884b63bf4d95190a3c2390/millions-of-tiny-databases.pdf&quot;&gt;our paper&lt;/a&gt; if you’d like to learn more.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Why do we need distributed systems?</title>
      <link>http://brooker.co.za/blog/2020/01/02/why-distributed.html</link>
      <pubDate>Thu, 02 Jan 2020 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2020/01/02/why-distributed</guid>
      <description>&lt;h1 id=&quot;why-do-we-need-distributed-systems&quot;&gt;Why do we need distributed systems?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Building distributed systems is hard. It&apos;s expensive. It&apos;s complex. But we do it anyway.&lt;/p&gt;

&lt;p&gt;I grew up reading John Carmack’s .plan file. His stories about the development of Doom, Quake and the rest were a formative experience for me, and a big reason I was interested in computers beyond just gaming&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. I was a little bit disappointed to see this tweet:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;My formative memory of Python was when the Quake Live team used it for the back end work, and we wound up having serious performance problems with a few million users. My bias is that a lot (not all!) of complex “scalable” systems can be done with a simple, single C++ server.&lt;/p&gt;&amp;mdash; John Carmack (@ID_AA_Carmack) &lt;a href=&quot;https://twitter.com/ID_AA_Carmack/status/1210997702152069120?ref_src=twsrc%5Etfw&quot;&gt;December 28, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;This isn’t an isolated opinion, but I don’t think it’s a particularly good one. To be fair, there are a lot of good reasons not to build distributed systems. Complexity is one: distributed systems are legitimately harder to build, and significantly harder to understand and operate. Efficiency is another. As McSherry et al point out in &lt;a href=&quot;https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf&quot;&gt;Scalability! But at what COST?&lt;/a&gt;, single-system designs can have great performance and efficiency. Modern computers are huge and fast.&lt;/p&gt;

&lt;p&gt;I was not so much disappointed in John, as in our success at building distributed systems tools that make this untrue. Distributed computing could be much easier, and needs to be much easier. We need to get to a point, with services, tooling and technology, that monolithic systems aren’t a good default. To understand why, let me answer the question in the post’s title.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed systems offer better availability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The availability of a monolithic system is limited to the availability of the piece of hardware it runs on. Modern hardware is pretty great, and combined with a good datacenter and good management practices servers can be expected to fail with an annual failure rate (AFR) in the single-digit percentages. That’s OK, but not great in two ways. First, if you run a lot of systems fixing these servers stacks up to an awful lot of toil. The toil is unavoidable, because if we’re building a monolithic system we need to store the system state on the one server, and so creating a new server takes work (and lost state, and understanding what the lost state means to your users). The second way they get you is with time-to-recovery (TTR): unless you’re super disciplined in keeping and testing backups, your rebuild process and all the rest, it’s been a couple years since you last made a new one of these things. It’s going to take time.&lt;/p&gt;

&lt;p&gt;Distributed systems incur cost and complexity because they continuously avoid getting into this state. Dedicated state stores, replication, consensus and all the rest add up to avoiding any one server being a single point of failure, but also hide the long TTR that comes with fixing systems. Modern ops practices, like infrastructure as code, immutable infrastructure, containers, and serverless reduce the TTR and toil even more.&lt;/p&gt;

&lt;p&gt;Distributed systems can also be placed nearer the users that need them. It doesn’t really matter if a system is available or not if clients can’t get to it, and &lt;a href=&quot;https://dl.acm.org/doi/10.1145/2643130&quot;&gt;network partitions happen&lt;/a&gt;. Despite the restrictions of the CAP theorem and friends, this extra degree of flexibility allows distributed systems to do much better than monolithic systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed systems offer better durability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Like availability, the durability of single storage devices is pretty great these days. The Backblaze folks release &lt;a href=&quot;https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2019/&quot;&gt;some pretty great stats&lt;/a&gt; that show that they see about 1.6% of their drives fail in any given year. This has been the case since &lt;a href=&quot;https://dl.acm.org/doi/10.5555/1267903.1267905&quot;&gt;at least the late 2000s&lt;/a&gt;. If you put your customer’s data on a single disk, you’re highly likely to still have it at the end of the year.&lt;/p&gt;

&lt;p&gt;For this blog, “highly likely” is good enough. For almost all meaningful businesses, it simply isn’t. Monolithic systems then have two choices. One is RAID. Keep the state on multiple disks, and replace them as they fail. RAID is a good thing, but only protects against a few drive failures. Not floods, fires, or explosions. Or correlated drive failure&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. The other option is backups. Again, a good thing with a big downside. Backups require you to choose two things: how often you run them (and therefore how much data you lose when you need them), and how long they take to restore. For the stuff on my laptop, a daily backup and multi-hour restore is plenty. For business-critical data, not so much.&lt;/p&gt;

&lt;p&gt;Distributed storage systems continuously make multiple copies of a piece of data, allowing a great deal of flexibility around cost, time-to-recovery, durability, and other factors. They can also be built to be extremely tolerant to correlated failures, and avoid correlation outright.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed systems offer better scalability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As with availability and durability, distributing a system over many machines gives a lot of flexibility about how to scale it. Stateless systems are relatively easy to scale, and basic techniques like HTTP load balancers are great for an awful lot of use-cases. Stateful systems are harder to scale, both because you need to decide how to spread the state around, and because you need to figure out how to send users to the right place to get the state. These two problems are at the heart of a high percentage of the distributed systems literature, and more is published on them every single day.&lt;/p&gt;

&lt;p&gt;The good news is that many good solutions to these problems are already available. They are available as services (as in the cloud), and available as software (open source and otherwise). You don’t need to figure this out yourself, and shouldn’t try (unless you are really sure you want to).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed systems offer better efficiency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Workloads are very seldom constant. Computers like to do things on the hour, or every day, or every minute. Humans, thanks to our particular foibles like sleeping and hanging out with our kids, tend to want to do things during the day, or on the holidays, or during the work week. Other humans like to do things in the evening, or late at night. This all means that the load on most systems varies, both randomly and &lt;em&gt;seasonally&lt;/em&gt;. If you’re running each thing on it’s own box you can’t take advantage of that&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Big distributed systems, like the cloud, can. They also give you tools (like automatic scaling) to take advantage of it economically.&lt;/p&gt;

&lt;p&gt;When you count all the factors that go into their cost, most computers aren’t that much more expensive to keep busy than they are to keep idle. That means it makes a lot of economic sense to keep computers as busy as possible. Monolithic systems find it hard to do that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No magic&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, none of this stuff comes for free. Actually building (and, critically, operating) distributed systems that do better than monolithic systems on all these properties is difficult. The reality is seldom as attractive as the theory would predict.&lt;/p&gt;

&lt;p&gt;As an industry, we’ve made a fantastic amount of progress in making great distributed systems available over the last decade. But, as Carmack’s tweet shows, we’ve still got a lot to do. Despite all the theoretical advantages it’s still reasonable for technically savvy people to see monolithic systems as simpler and better. This is a big part of why I’m excited about serverless: it’s the start of a big opportunity to make all the magic of distributed systems even more widely and simply available.&lt;/p&gt;

&lt;p&gt;If we get this right, we can change the default. More availability, more durability, more efficiency, more scale, less toil. It’s going to be an interesting decade.&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Along with hacking on &lt;a href=&quot;https://github.com/GorillaStack/gorillas/blob/master/gorillas.bas&quot;&gt;gorillas.bas&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Which is a real thing. In &lt;a href=&quot;https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf&quot;&gt;Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?&lt;/a&gt; Schroeder and Gibson report that &lt;em&gt;Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.&lt;/em&gt; This situation hasn’t improved since 2007.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; I guess you can &lt;a href=&quot;https://www.mersenne.org/&quot;&gt;search for primes&lt;/a&gt;, or mine Ethereum, or something else. Unfortunately, these activities are seldom economically interesting.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Kindness, Wickedness and Safety</title>
      <link>http://brooker.co.za/blog/2019/08/12/kind-wicked.html</link>
      <pubDate>Mon, 12 Aug 2019 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2019/08/12/kind-wicked</guid>
      <description>&lt;h1 id=&quot;kindness-wickedness-and-safety&quot;&gt;Kindness, Wickedness and Safety&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;We must build kind systems.&lt;/p&gt;

&lt;p&gt;David Epstein’s book &lt;a href=&quot;https://www.amazon.com/Range-Generalists-Triumph-Specialized-World/dp/0735214484&quot;&gt;Range: Why Generalists Triumph in a Specialized World&lt;/a&gt; turned me on to the idea of Kind and Wicked learning environments, and I’ve found the idea to be very useful in framing all kinds of problems.&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; The idea comes from &lt;a href=&quot;https://pdfs.semanticscholar.org/5c5d/33b858eaf38f6a14b3f042202f1f44e04326.pdf&quot;&gt;The Two Settings of Kind and Wicked Learning Environments&lt;/a&gt;. The abstract gets right to the point:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Inference involves two settings: In the first, information is acquired (learning); in the second, it is applied (predictions or choices). Kind  learning environments involve  close matches between the informational elements in the two settings and are a necessary condition for accurate  inferences. Wicked learning environments involve mismatches.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The authors go on to describe the two environments in terms of the information that we can learn from &lt;em&gt;L&lt;/em&gt; (for learning), and information that we use when we actually have to make predictions &lt;em&gt;T&lt;/em&gt; (for target). They break environments down into &lt;em&gt;kind&lt;/em&gt; or &lt;em&gt;wicked&lt;/em&gt; depending on how &lt;em&gt;L&lt;/em&gt; relates to &lt;em&gt;T&lt;/em&gt;. In kind environments, &lt;em&gt;L&lt;/em&gt; and &lt;em&gt;T&lt;/em&gt; are closely related: if you learn a rule from &lt;em&gt;L&lt;/em&gt; it applies at least approximately to &lt;em&gt;T&lt;/em&gt;. In wicked environments, &lt;em&gt;L&lt;/em&gt; is a subset or superset of &lt;em&gt;T&lt;/em&gt;, or the sets intersect only partially, or are completely unrelated.&lt;/p&gt;

&lt;p&gt;Simplifying this a bit more, in kind environments we can learn the right lessons from experience, in wicked environment we learn the wrong lessons (or at least incomplete lessons).&lt;/p&gt;

&lt;p&gt;From the paper again:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If  kind,  we  have  the  necessary conditions for accurate inference. Therefore, any errors must be attributed to the person (e.g., inappropriate  information  aggregation).  If  wicked,  we  can  identify how error results from task features, although these can also be affected by human actions. In short, our  framework  facilitates  pinpointing  the  sources  of  errors (task structure and/or person).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This has interesting implications for thinking about safety, and the role of operators (and builders) in ensuring safety. In kind environments, operator mistakes can be seen as &lt;em&gt;human error&lt;/em&gt;, where the human learned the wrong lesson or did the wrong thing. In wicked environments, humans will always make errors, because there are risks that are not captured by their experience.&lt;/p&gt;

&lt;p&gt;Going back to &lt;a href=&quot;//brooker.co.za/blog/2019/06/17/chernobyl.html&quot;&gt;Anatoly Dyatlov’s question to the IAEA&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;How and why should the operators have compensated for design errors they did not know about?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Dyatlov is saying that operating Chernobyl was a wicked environment. Operators applying their best knowledge and experience, even flawlessly, weren’t able to make the right inferences about the safety of the system.&lt;/p&gt;

&lt;p&gt;Back to the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Since kind environments are a necessary condition for accurate judgments, our framework suggests deliberately creating kind environments.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I found reading this to be something of a revelation. When building safe systems, we need to make those systems &lt;em&gt;kind&lt;/em&gt;. We need to deliberately create kind evironments. If we build them so they are &lt;em&gt;wicked&lt;/em&gt;, then we set our operators, tooling and automation up for failure.&lt;/p&gt;

&lt;p&gt;Some parts of our field are inherently wicked. In large and complex systems the set of circumstances we learn from is always incomplete, because the system has so many states that there’s no way to have seen them all before. In security, there’s an active attacker who’s trying very hard to make the environment wicked.&lt;/p&gt;

&lt;p&gt;The role of the designer and builder of systems is to make the environment as kind as possible. Extract as much wickedness as possible, and try not to add any.&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; The book is worth reading. It contains a lot of interesting ideas, but like all popular science books also contains a lot of extrapolation beyond what the research supports. If you’re pressed for time, the &lt;a href=&quot;http://www.econtalk.org/david-epstein-on-mastery-specialization-and-range/&quot;&gt;EconTalk episode&lt;/a&gt; about the book covers a lot of the material.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>When Redundancy Actually Helps</title>
      <link>http://brooker.co.za/blog/2019/06/20/redundancy.html</link>
      <pubDate>Thu, 20 Jun 2019 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2019/06/20/redundancy</guid>
      <description>&lt;h1 id=&quot;when-redundancy-actually-helps&quot;&gt;When Redundancy Actually Helps&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Redundancy can harm more than it helps.&lt;/p&gt;

&lt;p&gt;Just after I joined the EBS team at AWS in 2011, the service &lt;a href=&quot;https://aws.amazon.com/message/65648/&quot;&gt;suffered a major disruption&lt;/a&gt; lasting more than two days to full recovery. Recently, on Twitter, &lt;a href=&quot;https://twitter.com/tacertain/status/1152459506464329729&quot;&gt;Andrew Certain said&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We were super dependent on having a highly available network to make the replication work, so having two NICs and a second network fabric seemed to be a way to improve availability. But the lesson of this event is that only some forms of redundancy improve availability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’ve been thinking about the second part of that a lot recently, as my team starts building a new replicated system. When does redundancy actually help availability? I’ve been breaking that down into four rules:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The complexity added by introducing redundancy mustn’t cost more availability than it adds.&lt;/li&gt;
  &lt;li&gt;The system must be able to run in degraded mode.&lt;/li&gt;
  &lt;li&gt;The system must reliably detect which of the redundant components are healthy and which are unhealthy.&lt;/li&gt;
  &lt;li&gt;The system must be able to return to fully redundant mode.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This might seem like obvious, even tautological, but each serves as the trigger of deeper thinking and conversation.&lt;/p&gt;

&lt;h2 id=&quot;dont-add-more-risk-than-you-take-away&quot;&gt;Don’t add more risk than you take away&lt;/h2&gt;

&lt;p&gt;Andrew (or Kerry Lee, I’m not sure which) introduced this to the EBS team as &lt;em&gt;don’t be weird&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot; data-conversation=&quot;none&quot; data-dnt=&quot;true&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;So I think it reinforces two lessons:&lt;br /&gt;&lt;br /&gt;1/ Don&amp;#39;t be weird&lt;br /&gt;2/ Modality is bad&lt;/p&gt;&amp;mdash; Andrew Certain (@tacertain) &lt;a href=&quot;https://twitter.com/tacertain/status/1152460786171707393?ref_src=twsrc%5Etfw&quot;&gt;July 20, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;https://platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;This isn’t a comment on people (who are more than welcome to be weird), but on systems. Weirdness and complexity add risk, both risk that we don’t understand the system that we’re building, and risk that we don’t understand the system that we are operating. When adding redundancy to a system, it’s easy to fall into the mistake of adding too much complexity, and underestimating the ways in which that complexity adds risk.&lt;/p&gt;

&lt;h2 id=&quot;you-must-be-able-to-run-in-degraded-mode&quot;&gt;You must be able to run in degraded mode&lt;/h2&gt;

&lt;p&gt;Once you’ve failed over to the redundant component, are you sure it’s going to be able to take the load? Even in one of the simplest cases, active-passive database failover, this is a complex question. You’re going from warm caches and full buffers to cold caches and empty buffers. Performance can differ significantly.&lt;/p&gt;

&lt;p&gt;As systems get larger and more complex, the problem gets more difficult. What components do you expect to fail? How many at a time? How much traffic can each component handle? How do we stop our cost reduction and efficiency efforts from taking away the capacity needed to handle failures? How do we continuously test that the failover works? What mechanism do we have to make sure there’s enough failover capacity? There’s typically at least as much investment in answering these questions as building the redundant system in the first place.&lt;/p&gt;

&lt;p&gt;Chaos testing, gamedays, and other similar approaches are very useful here, but typically can’t test the biggest failure cases in a continuous way.&lt;/p&gt;

&lt;h2 id=&quot;youve-got-to-fail-over-in-the-right-direction&quot;&gt;You’ve got to fail over in the right direction&lt;/h2&gt;

&lt;p&gt;When systems suffer partial failure, it’s often hard to tell what’s &lt;em&gt;healthy&lt;/em&gt; and what’s &lt;em&gt;unhealthy&lt;/em&gt;. In fact, different systems in different parts of the network often completely disagree on health. If your system sees partial failure and fails over towards the truly &lt;em&gt;unhealthy&lt;/em&gt; side, you’re in trouble. The complexity here comes from the distributed systems fog of war: telling the difference between bad networks, bad software, bad disks, and bad NICs can be surprisingly hard. Often, systems flap a bit before falling over.&lt;/p&gt;

&lt;h2 id=&quot;the-system-must-be-able-to-return-to-fully-redundant-mode&quot;&gt;The system must be able to return to fully redundant mode&lt;/h2&gt;

&lt;p&gt;If your redundancy is a single shot, it’s not going to add much availability in the long term. So you need to make sure the system can safely get from one to two, or N to N+1, or N to 2N. This is relatively easy in some kinds of systems, but anything with a non-zero RPO or asynchronous replication or periodic backups can make it extremely difficult. In small systems, human judgement can help. In larger systems, you need an automated plan. Most likely, you’re going to make a better automated plan during daylight in the middle of the week during your design phase than at 3AM on a Saturday while trying to fix the outage.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Is Anatoly Dyatlov to blame?</title>
      <link>http://brooker.co.za/blog/2019/06/17/chernobyl.html</link>
      <pubDate>Mon, 17 Jun 2019 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2019/06/17/chernobyl</guid>
      <description>&lt;h1 id=&quot;is-anatoly-dyatlov-to-blame&quot;&gt;Is Anatoly Dyatlov to blame?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Without a good safety culture, operators are bound to fail.&lt;/p&gt;

&lt;p&gt;(Spoiler warning: containers spoilers for the HBO series Chernobyl, and for history).&lt;/p&gt;

&lt;p&gt;Recently, I enjoyed watching HBO’s new series Chernobyl. Like everybody else on the internet, I have some thoughts about it. I’m not a nuclear physicist or engineer, but I do think a lot about safety and the role of operators.&lt;/p&gt;

&lt;p&gt;The show tells the story of the accident at Chernobyl in April 1986, the tragic human impact, and the cleanup and investigation in the years that followed. One of the villains in the show is Anatoly Dyatlov, the deputy chief engineer of the plant. Dyatlov was present in the control room of reactor 4 when it exploded, and received a huge dose of radiation (the second, or perhaps third, large dose in his storied life of being near reactor accidents). HBO’s portrayal of Dyatlov is of an arrogant and aggressive man whose refusal to listen to operators was a major cause of the accident. Some first-hand accounts agree&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;, &lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;, &lt;a href=&quot;#foot6&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, and others disagree&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Either way, Dyatlov spent over three years in prison for his role in the accident.&lt;/p&gt;

&lt;p&gt;There’s little debate that the reactor’s design was deeply flawed. The International Nuclear Safety Advisory Group (INSAG) found&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; that certain features of the reactor “had a primary influence on the course of the accident and its consequences”. During the time before the accident, operators had put the reactor into a mode where it was unstable, with reactivity increases leading to higher temperatures, and further reactivity increases. The IAEA (and Russian scientists) also found that the design of the control rods was flawed, both in that they initially increased (rather than decreasing) reactivity when first inserted, and in that they machinery to insert them moved too slowly. They also found issues with the control systems, cooling systems, and the fact that some critical safety measures could be manually disabled. Authorities had been aware of many of these issues since an accident at the Ignalina plant in 1983&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4, page 13&lt;/a&gt;&lt;/sup&gt;, but no major design or operational practice changes had been made by the time of the explosion in 1986.&lt;/p&gt;

&lt;p&gt;In the HBO series’ telling of the last few minutes before the event, Dyatlov was shown to dismiss concerns from his team that the reactor shouldn’t be run for long periods of time at low power. Initially, Soviet authorities claimed that the dangers of doing this was made clear to operators (and Dyatlov ignored procedures). Later investigations by IAEA found no evidence that running the reactor in this dangerous mode was forbidden &lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4, page 11&lt;/a&gt;&lt;/sup&gt;. The same is true of other flaws in the plant. Operators weren’t clearly told that pushing the emergency shutdown (aka SCRAM, aka AZ-5) button could temporarily increase the reaction rate in some parts of the reactor. The IAEA also found that the reactors were safe in “steady state”, and the accident would not have occurred without the actions of operators.&lt;/p&gt;

&lt;p&gt;Who is to blame for the 1986 explosion at Chernobyl?&lt;/p&gt;

&lt;p&gt;In 1995, Dyatlov wrote an article in which he criticized both the Soviet and IAEA investigations&lt;sup&gt;&lt;a href=&quot;#foot5&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;, and asked a powerful question:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;How and why should the operators have compensated for design errors they did not know about?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If operators make mistakes while operating systems which have flaws they don’t know about, is that “human error”? Does it matter if their ignorance of those flaws is because of their own inexperience, bureaucratic incompetence, or some vast KGB-lead conspiracy? Did Dyatlov deserve death for his role in the accident, as the series suggests? As Richard Cook says in “How Complex Systems Fail”&lt;sup&gt;&lt;a href=&quot;#foot7&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Catastrophe requires multiple failures – single point failures are not enough. … Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;After accidents, the overt failure often appears to have been inevitable and the practitioner’s actions as blunders or deliberate willful disregard of certain impending failure. … That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If Dyatlov and the other operators of the plant had known about the design issues with the reactor that had been investigated following the accident at Ignalina in 1983, would they have made the same mistake? It’s hard to believe they would have. If the reactor design had been improved following the same accident, would the catastrophe had occurred? The consensus seems to be that it wouldn’t have, and if it did then it would have taken a different form.&lt;/p&gt;

&lt;p&gt;From “How Complex Systems Fail”:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Most initial failure trajectories are blocked by designed system safety components. Trajectories that reach the operational level are mostly blocked, usually by practitioners.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The show’s focus on the failures of practitioners to block the catastrophe, and maybe on their unintentional triggering of the catastrophe seems unfortunate to me. The operators - despite their personal failings - had not been set up for success, either by arming them with the right knowledge, or by giving them the right incentives.&lt;/p&gt;

&lt;p&gt;From my perspective, the show is spot-on in it’s treatment of the “cost of lies”. Lies, and the incentive to lie, almost make it impossible to build a good safety culture. But not lying is not enough. A successful culture needs to find the truth, and then actively use it to both improve the system and empower operators. Until the culture can do that, we shouldn’t be surprised when operators blunder or even bluster their way into disaster.&lt;/p&gt;

&lt;h2 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; BBC, &lt;a href=&quot;https://www.bbc.com/news/world-europe-48580177&quot;&gt;Chernobyl survivors assess fact and fiction in TV series&lt;/a&gt;, 2019&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Svetlana Alexievich, “Voices from Chernobyl”.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Serhii Plokhy, “Chernobyl: The History of a Nuclear Catastrophe”. This is my favorite book about the disaster (I’ve probably read over 20 books on it), covering a good breadth of history and people without being too dramatic. There are a couple of minor errors in the book (like confusing GW and GWh in multiple places), but those can be overlooked.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; &lt;a href=&quot;https://www-pub.iaea.org/MTCD/publications/PDF/Pub913e_web.pdf&quot;&gt;INSAG-7 The Chernobyl Accident: Updating of INSAG-1&lt;/a&gt;, IAEA, 1992&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot5&quot;&gt;&lt;/a&gt; Anatoly Dyatlov, &lt;a href=&quot;https://www.neimagazine.com/features/featurewhy-insag-has-still-got-it-wrong&quot;&gt;Why INSAG has still got it wrong&lt;/a&gt;, NEI, 1995&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot6&quot;&gt;&lt;/a&gt; Adam Higginbotham, “Midnight in Chernobyl: The Untold Story of the World’s Greatest Nuclear Disaster”&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot7&quot;&gt;&lt;/a&gt; Richard Cook, &lt;a href=&quot;https://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf&quot;&gt;How Complex Systems Fail&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Some risks of coordinating only sometimes</title>
      <link>http://brooker.co.za/blog/2019/05/01/emergent.html</link>
      <pubDate>Wed, 01 May 2019 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2019/05/01/emergent</guid>
      <description>&lt;h1 id=&quot;some-risks-of-coordinating-only-sometimes&quot;&gt;Some risks of coordinating only sometimes&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Sometimes-coordinating systems have dangerous emergent behaviors&lt;/p&gt;

&lt;p&gt;A classic cloud architecture is built of small clusters of nodes (typically one to nine&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;), with coordination used inside
each cluster to provide availability, durability and integrity in the face of node failures. Coordination between
clusters is avoided, making it easier to scale the system while meeting tight availability and latency requirements. In
reality, however, systems sometimes do need to coordinate between clusters, or clusters need to coordinate with a
central controller. Some of these circumstances are operational, such as around adding or removing capacity. Others are
triggered by the application, where the need to present a client API which appears consistent requires either the system itself, or a layer above it, to coordinate across otherwise-uncoordinated clusters.&lt;/p&gt;

&lt;p&gt;The costs and risks of re-introducing coordination to handle API requests or provide strong client guarantees are well
explored in the literature. Unfortunately, other aspects of sometimes-coordinated systems do not get as much attention,
and many designs are not robust in cases where coordination is required for large-scale operations. Results like CAP and CALM&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; provide clear tools for thinking through when coordination must occur, but offer little help in understanding the dynamic behavior of the system when it does occur.&lt;/p&gt;

&lt;p&gt;One example of this problem is reacting to correlated failures. At scale, uncorrelated node failures happen all the
time. Designing to handle them is straightforward, as the code and design is continuously validated in production.
Large-scale correlated failures also happen, triggered by power and network failures, offered load, software bugs,
operator mistakes, and all manner of unlikely events. If systems are designed to coordinate during failure handling,
either as a mesh or by falling back to a controller, these correlated failures bring sudden bursts of coordination and
traffic. These correlated failures are rare, so the way the system reacts to them is typically untested at the scale at
which it is currently operating when they do happen. This increases time-to-recovery, and sometimes requires that
drastic action is taken to recover the system. Overloaded controllers, suddenly called upon to operate at thousands of
times their usual traffic, are a common cause of long time-to-recovery outages in large-scale cloud systems.&lt;/p&gt;

&lt;p&gt;A related issue is the work that each individual cluster needs to perform during recovery or even scale-up. In practice,
it is difficult to ensure that real-world systems have both the capacity required to run, and spare capacity for
recovery. As soon as a system can’t do both kinds of work, it runs the risk of entering a mode where it is too
overloaded to scale up. The causes of failure here are both technical (load measurement is difficult, especially in
systems with rich APIs), economic (failure headroom is used very seldom, making it an attractive target to be optimized
away), and social (people tend to be poor at planning for relatively rare events).&lt;/p&gt;

&lt;p&gt;Another risk of sometimes-coordination is changing quality of results. It’s well known how difficult it is to program
against APIs which offer inconsistent consistency, but this problem goes beyond just API behavior. A common design for
distributed workload schedulers and placement systems is to avoid coordination on the scheduling path (which may be
latency and performance critical), and instead distribute or discover stale information about the overall state of the
system. In steady state, when staleness is approximately constant, the output of these systems is predictable. During
failures, however, staleness may increase substantially, leading the system to making worse choices. This may increase
churn and stress on capacity, further altering the workload characteristics and pushing the system outside its comfort
zone.&lt;/p&gt;

&lt;p&gt;The underlying cause of each of these issues is that the worst-case behavior of these systems may diverge significantly
from their average-case behavior, and that many of these systems are bistable with a stable state in normal operation,
and a stable state at “overloaded”. Within AWS, we are starting to settle on some patterns that help constrain the
behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination,
independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.&lt;/p&gt;

&lt;p&gt;More work is needed in this space, both in understanding how to build systems which strongly avoid congestive collapse
during all kinds of failures, and in building tools to characterize and test the behavior of real-world systems.
Distributed systems and control theory are natural partners.&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes:&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Cluster sizing is a super interesting topic in it’s own right. Nine seems arbitrary here, but isn’t: for the most durable consensus systems, because when spread across three datacenters allows one datacenter failure (losing 3) and one host failure while still having a healthy majority. Chain replicated and erasure coded systems will obviously choose differently, as will anything with read replicas, or cost, latency or other constraints.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; See &lt;a href=&quot;https://arxiv.org/pdf/1901.01930.pdf&quot;&gt;Keeping CALM: When Distributed Consistency is Easy&lt;/a&gt; by Hellerstein and Alvaro. It’s a great paper, and a very powerful conceptual tool.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Learning to build distributed systems</title>
      <link>http://brooker.co.za/blog/2019/04/03/learning.html</link>
      <pubDate>Wed, 03 Apr 2019 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2019/04/03/learning</guid>
      <description>&lt;h1 id=&quot;learning-to-build-distributed-systems&quot;&gt;Learning to build distributed systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;A long email reply&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A common question I get at work is “how do I learn to build big distributed systems?”. I’ve written replies to that many times. Here’s my latest attempt.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Learning how to design and build big distributed systems is hard. I don’t mean that the theory is harder than any other field in computer science. I also don’t mean that information is hard to come by. There’s a wealth of information online, many distributed systems papers are very accessible, and you can’t visit a computer science school without tripping over a distributed systems course. What I mean is that learning the practice of building and running big distributed systems requires big systems. Big systems are expensive, and expensive means that the stakes are high. In industry, millions of customers depend on the biggest systems. In research and academia, the risks of failure are different, but no less immediate. Still, despite the challenges, doing and making mistakes is the most effective way to learn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn through the work of others&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the most obvious answer, but still one worth paying attention to. If you’re academically minded, &lt;a href=&quot;https://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/&quot;&gt;reading lists&lt;/a&gt; and &lt;a href=&quot;https://jeffhuang.com/best_paper_awards.html&quot;&gt;lists of best papers&lt;/a&gt; can give you a place to start to find interesting and relevant reading material. If you need a gentler introduction, blogs like &lt;a href=&quot;https://blog.acolyer.org/&quot;&gt;Adrian Colyer’s Morning Paper&lt;/a&gt; summarize and explain papers, and can also be a great way to discover important papers. There are a lot of distributed systems books I love, but I haven’t found an accessible introduction I particularly like yet.&lt;/p&gt;

&lt;p&gt;If you prefer to start with practice, many of the biggest distributed systems shops on the planet publish papers, blogs, and talks describing their work. Even Amazon, which has a reputation for being a bit secretive with our technology, has published papers like the &lt;a href=&quot;https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;classic Dynamo paper&lt;/a&gt;, and a &lt;a href=&quot;https://www.allthingsdistributed.com/files/p1041-verbitski.pdf&quot;&gt;recent&lt;/a&gt; &lt;a href=&quot;https://dl.acm.org/citation.cfm?id=3183713.3196937&quot;&gt;papers&lt;/a&gt; on the Aurora database, and many more. Talks can be a valuable resource too. Here’s Jaso Sorenson &lt;a href=&quot;https://www.youtube.com/watch?v=yvBR71D0nAQ&quot;&gt;describing the design of DynamoDB&lt;/a&gt;, me and Holly Mesrobian &lt;a href=&quot;https://www.youtube.com/watch?v=QdzV04T_kec&quot;&gt;describing a bit of how Lambda works&lt;/a&gt;, and Colm MacCarthaigh &lt;a href=&quot;https://www.youtube.com/watch?v=O8xLxNje30M&quot;&gt;talking about some principles for building control planes&lt;/a&gt;. There’s enough material out there to keep you busy forever. The hard part is knowing when to stop.&lt;/p&gt;

&lt;p&gt;Sometimes (as I’ve &lt;a href=&quot;http://brooker.co.za/blog/2014/08/10/the-space-between.html&quot;&gt;written about before&lt;/a&gt;) it can be hard to close the gap between &lt;em&gt;theory&lt;/em&gt; papers and &lt;em&gt;practice&lt;/em&gt; papers. I don’t have a good answer to that problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get hands-on&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Learning the theory is great, but I find that building systems is the best way to cement knowledge. Implement Paxos, or Raft, or Viewstamped Replication, or whatever you find interesting. Then test it. &lt;a href=&quot;https://github.com/jepsen-io/jepsen&quot;&gt;Fault injection&lt;/a&gt; is a great approach for that. Make notes of the mistakes you make (and you will make mistakes, for sure). Docker,  EC2 and Fargate make it easier than ever to build test clusters, locally or in the cloud. I like Go as a language for building implementations of things. It’s well-suited to writing network services. It compiles fast, and makes executables that are easy to move around.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go broad&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Learning things outside the distributed systems silo is important, too. I learned control theory as an undergrad, and while I’ve forgotten most of the math I find the way of thinking very useful. Statistics is useful, too. ML. Human factors. Formal methods. Sociology. Whatever. I don’t think there’s shame in being narrow and deep, but being broader can make it much easier to find creative solutions to problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Become an owner&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’re lucky enough to be able to, find yourself a position on a team, at a company, or in a lab that owns something big. I think the Amazon pattern of having the same team build and operate systems is ideal for learning. If you can, carry a pager. Be accountable to your team and your customers that the stuff you build works. Reality cannot be fooled.&lt;/p&gt;

&lt;p&gt;Over the years at AWS we’ve developed some great mechanisms for being accountable. &lt;a href=&quot;https://aws.amazon.com/blogs/opensource/the-wheel/&quot;&gt;The wheel&lt;/a&gt; is one great example, and &lt;a href=&quot;https://wa.aws.amazon.com/wat.concept.coe.en.html&quot;&gt;the COE process&lt;/a&gt; (similar to what the rest of the industry calls &lt;em&gt;blameless postmortems&lt;/em&gt;) is another. &lt;a href=&quot;https://github.com/danluu/post-mortems&quot;&gt;Dan Luu’s list of postmortems&lt;/a&gt; has a lot of lessons from around the industry. I’ve always enjoyed these processes, because they expose the weaknesses of systems, and provide a path to fixing them. Sometimes it can feel unforgiving, but the blameless part works well. Some COEs contain as many great distributed systems lessons as the best research papers.&lt;/p&gt;

&lt;p&gt;Research has different mechanisms. The goal (over a longer time horizon) is the same: good ideas and systems survive, and bad ideas and systems are fall away. People build on the good ones, with more good ideas and the whole field moves forward. Being an owner is important.&lt;/p&gt;

&lt;p&gt;Another tool I like for learning is the &lt;em&gt;what-if COE&lt;/em&gt; or &lt;em&gt;premortem&lt;/em&gt;. These are COEs for outages that haven’t happened yet, but could happen. When building a new system, think about writing your first COE before it happens. What are the weaknesses in your system? How will it break? When replacing an older system with a new one, look at some of the older one’s COEs. How would your new system perform in the same circumstances?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It takes time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This all takes time, both in the sense that you need to allocate hours of the day to it, and in the sense that you’re not going to learn everything overnight. I’ve been doing this stuff for 15 years in one way or another, and still feel like I’m scratching the surface. Don’t feel bad about others knowing things you don’t. It’s an opportunity, not a threat.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Control Planes vs Data Planes</title>
      <link>http://brooker.co.za/blog/2019/03/17/control.html</link>
      <pubDate>Sun, 17 Mar 2019 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2019/03/17/control</guid>
      <description>&lt;h1 id=&quot;control-planes-vs-data-planes&quot;&gt;Control Planes vs Data Planes&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Are there multiple things here?&lt;/p&gt;

&lt;p&gt;If you want to build a successful distributed system, one of the most important things to get right is the block diagram: what are the components, what does each of them own, and how do they communicate to other components. It’s such a basic design step that many of us don’t think about how important it is, and how difficult and expensive it can be to make changes to the overall architecture once the system is in production. Getting the block diagram right helps with the design of database schemas and APIs, helps reason through the availability and cost of running the system, and even helps form the right org chart to build the design.&lt;/p&gt;

&lt;p&gt;One very common pattern when doing these design exercises is to separate components into a &lt;em&gt;control plane&lt;/em&gt; and a &lt;em&gt;data plane&lt;/em&gt;, recognizing the differences in requirements between these two roles.&lt;/p&gt;

&lt;h3 id=&quot;no-true-monoliths&quot;&gt;No true monoliths&lt;/h3&gt;

&lt;p&gt;The &lt;em&gt;microservices&lt;/em&gt; and &lt;em&gt;SOA&lt;/em&gt; design approaches tend to push towards more blocks, with each block performing a smaller number of functions. The &lt;em&gt;monolith&lt;/em&gt; approach is the other end of the spectrum, where the diagram consists of a single block. Arguments about these two approaches can be endless, but ultimately not important. It’s worth noting, though, that there are almost no true monoliths. Some kinds of concerns are almost always separated out. Here’s a partial list:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Storage. Most modern applications separate business logic from storage and caching, and talk through APIs to their storage.&lt;/li&gt;
  &lt;li&gt;Load Balancing. Distributed applications need some way for clients to distribute their load across multiple instances.&lt;/li&gt;
  &lt;li&gt;Failure tolerance. Highly available systems need to be able to handle the failure of hardware and software without affecting users.&lt;/li&gt;
  &lt;li&gt;Scaling. Systems which need to handle variable load may add and remove resources over time.&lt;/li&gt;
  &lt;li&gt;Deployments. Any system needs to change over time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even in the most monolithic application, these are separate components of the system, and need to be built into the design. What’s notable here is that these concerns can be broken into two clean categories: data plane and control plane. Along with the monolithic application itself, &lt;em&gt;storage&lt;/em&gt; and &lt;em&gt;load balancing&lt;/em&gt; are data plane concerns: they are required to be up for any request to succeed, and scale O(N) with the number of requests the system handles. On the other hand, &lt;em&gt;failure tolerance&lt;/em&gt;, &lt;em&gt;scaling&lt;/em&gt; and &lt;em&gt;deployments&lt;/em&gt; are control plane concerns: they scale differently (either with a small multiple of N, with the rate of change of N, or with the rate of change of the software) and can break for some period of time before customers notice.&lt;/p&gt;

&lt;h3 id=&quot;two-roles-control-plane-and-data-plane&quot;&gt;Two roles: control plane and data plane&lt;/h3&gt;

&lt;p&gt;Every distributed system has components that fall roughly into these two roles: data plane components that sit on the request path, and control plane components which help that data plane do its work. Sometimes, the control plane components aren’t components at all, and rather people and processes, but the pattern is the same. With this pattern worked out, the block diagram of the system starts to look something like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/control_data_binary.png&quot; alt=&quot;Data plane and control plane separated into two blocks&quot; /&gt;&lt;/p&gt;

&lt;p&gt;My colleague Colm MacCárthaigh likes to think of control planes from a control theory approach, separating the system (the data plane) from the controller (the control plane). That’s a very informative approach, and you can hear him talk about it here:&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/O8xLxNje30M&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;I tend to take a different approach, looking at the scaling and operational properties of systems. As in the example above, data plane components are the ones that scale with every request&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;, and need to be up for every request. Control plane components don’t need to be up for every request, and instead only need to be up when there is work to do. Similarly, they scale in different ways. Some control plane components, such as those that monitor fleets of hosts, scale with O(N/M), which &lt;em&gt;N&lt;/em&gt; is the number of requests and &lt;em&gt;M&lt;/em&gt; is the requests per host. Other control plane components, such as those that handle scaling the fleet up and down, scale with O(dN/dt). Finally, control plane components that perform work like deployments scale with code change velocity.&lt;/p&gt;

&lt;p&gt;Finding the right separation between control and data planes is, in my experience, one of the most important things in a distributed systems design.&lt;/p&gt;

&lt;h3 id=&quot;another-view-compartmentalizing-complexity&quot;&gt;Another view: compartmentalizing complexity&lt;/h3&gt;

&lt;p&gt;In their classic paper on &lt;a href=&quot;https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf&quot;&gt;Chain Replication&lt;/a&gt;, van Renesse and Schneider write about how chain replicated systems handle server failure:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In response to detecting the failure of a server that is part of a chain (and, by the fail-stop assumption, all such failures are detected), the chain is reconfigured to eliminate the failed server.  For this purpose, we employ a service, called the &lt;em&gt;master&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fair enough. Chain replication can’t handle these kinds of failures without adding significant complexity to the protocol. So what do we expect of the master?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In what follows, we assume the master is a single process that never fails.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Oh. Never fails, huh? They then go on to say that they approach this by replicating the &lt;em&gt;master&lt;/em&gt; on multiple hosts using Paxos. If they have a Paxos implementation available, then why do they just not use that and not bother with this Chain Replication thing at all? The paper doesn’t say&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, but I have my own opinion: it’s interesting to separate them because Chain Replication offers a different set of performance, throughput, and code complexity trade offs than Paxos&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.  It is possible to build a single code base (and protocol) which handles both concerns, but at the cost of coupling these two different concerns. Instead, by making the &lt;em&gt;master&lt;/em&gt; a separate component, the chain replicated data plane implementation can focus on the things it needs to do (scale, performance, optimizing for every byte). The control plane, which only needs to handle the occasional failure, can focus on what it needs to do (extreme availability, locality, etc). Each of these different requirements adds complexity, and separating them out allows a system to compartmentalize its complexity, and reduce coupling by offering clear APIs and contract between components.&lt;/p&gt;

&lt;h3 id=&quot;breaking-down-the-binary&quot;&gt;Breaking down the binary&lt;/h3&gt;

&lt;p&gt;Say you build awesome data plane based on chain replication, and an awesome control plane (&lt;em&gt;master&lt;/em&gt;) for that data plane. At first, because of its lower scale, you can operate the control plane manually. Over time, as your system becomes successful, you’ll start to have too many instances of the control plane to manage by hand, so you build a control plane for that control plane to automate the management. This is the first way the control/data binary breaks down: at some point control planes need their own control planes. Your &lt;em&gt;controller&lt;/em&gt; is somebody else’s &lt;em&gt;system under control&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;One other way the binary breaks down is with specialization. The &lt;em&gt;master&lt;/em&gt; in the chain replicated system handles fault tolerance, but may not handle scaling, or sharding of chains, or interacting with customers to provision chains. In real systems there are frequently multiple control planes which control different aspects of the behavior of a system. Each of these control planes have their own differing requirements, requiring different tools and different expertise. Control planes are not homogeneous.&lt;/p&gt;

&lt;p&gt;These two problems highlight that the idea of control planes and data planes may be too reductive to be a core design principle. Instead, it’s a useful tool for helping identify opportunities to reduce and compartmentalize complexity by introducing good APIs and contracts, to ensure components have a clear set of responsibilities and ownership, and to use the right tools for solving different kinds of problems. Separating the control and data planes should be a heuristic tool for good system design, not a goal of system design.&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes:&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Or potentially with every request. Things like caches complicate this a bit.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; It does compare Chain Replication to other solutions, but doesn’t specifically talk about the benefits of seperation. Murat Demirbas pointed out that Chain Replication’s ability to serve linearizable reads from the tail is important. He also pointed me at the &lt;a href=&quot;https://www.usenix.org/legacy/event/usenix09/tech/full_papers/terrace/terrace.pdf&quot;&gt;Object Storage on CRAQ&lt;/a&gt; paper, which talks about how to serve reads from intermediate nodes. Thanks, Murat!&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; For one definition of Paxos. Lamport’s &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/vertical-paxos-and-primary-backup-replication/#&quot;&gt;Vertical Paxos&lt;/a&gt; paper sees chain replication as a flavor of Paxos, and more recent work by Heidi Howard et al on &lt;a href=&quot;https://arxiv.org/pdf/1608.06696v1.pdf&quot;&gt;Flexible Paxos&lt;/a&gt; makes the line even less clear.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Telling Stories About Little's Law</title>
      <link>http://brooker.co.za/blog/2018/06/20/littles-law.html</link>
      <pubDate>Wed, 20 Jun 2018 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2018/06/20/littles-law</guid>
      <description>&lt;h1 id=&quot;telling-stories-about-littles-law&quot;&gt;Telling Stories About Little’s Law&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Building Up Intuition with Narrative&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Little&apos;s_law&quot;&gt;Little’s Law&lt;/a&gt; is widely used as a tool for understanding the behavior of distributed systems. The law says that the mean concurrency in the system (𝐿) is equal to the mean rate at which requests arrive (λ) multiplied by the mean time that each request spends in the system (𝑊):&lt;/p&gt;

&lt;p&gt;𝐿 = λ𝑊&lt;/p&gt;

&lt;p&gt;As I’ve &lt;a href=&quot;//brooker.co.za/blog/2017/12/28/mean.html&quot;&gt;written about before&lt;/a&gt;, Little’s law is useful because it gives us a clear way to reason about the capacity of a system, which is often difficult to observe directly, based on quantities like arrival rate (requests per second) and latency which are easier to measure directly. Concurrency is a useful measure of capacity in real systems, because it directly measures consumption of resources like threads, memory, connections, file handles and anything else that’s numerically limited. It also provides an indirect way to think about contention: if the concurrency in a system is high, then it’s likely that contention is also high.&lt;/p&gt;

&lt;p&gt;I like Little’s Law as a mathematical tool, but also as a narrative tool. It provides a powerful way to frame stories about system behavior.&lt;/p&gt;

&lt;h3 id=&quot;feedback&quot;&gt;Feedback&lt;/h3&gt;
&lt;p&gt;The way Little’s Law is written, each of the terms are long-term averages, and λ and 𝑊 are independent. In the real world, distributed systems don’t tend to actually behave this nicely.&lt;/p&gt;

&lt;p&gt;Request time (𝑊) tends to increase as concurrency (𝐿) increases. &lt;a href=&quot;https://en.wikipedia.org/wiki/Amdahl%27s_law&quot;&gt;Amdahl’s Law&lt;/a&gt; provides the simplest model of this: each request has some portion of work which is trivially parallelizable, and some portion of work that is forced to be serialized in some way. Amdahl’s law is also wildly optimistic: most real-world systems don’t see throughput level out under contention, but rather see throughput drop as contention rises beyond some limit. The &lt;a href=&quot;http://www.perfdynamics.com/Manifesto/USLscalability.html&quot;&gt;universal scalability law&lt;/a&gt; captures one model of this behavior. The fundamental reason for this is that contention itself has a cost.&lt;/p&gt;

&lt;p&gt;Even in the naive, beautiful, Amdahl world, latency increases as load increases because throughput starts to approach some maximum. In the USL world, this increase can be dramatically non-linear. In both cases 𝑊 is a function of 𝐿.&lt;/p&gt;

&lt;p&gt;Arrival rate (λ) also depends on request time (𝑊), and typically in a non-linear way. There are three ways to see this relationship:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Arrival rate drops as request time increases (λ ∝ 1/𝑊). In this model there is a finite number of clients and each has its own finite concurrency (or, in the simplest world is calling serially in a loop). As each call goes up, clients keep their concurrency fixed, so the arrival rate drops.&lt;/li&gt;
  &lt;li&gt;Arrival rate does not depend on latency. If clients don’t change their behavior based on how long requests take, or on requests failing, then there’s no relationship. The widely-used Poisson process client model behaves this way.&lt;/li&gt;
  &lt;li&gt;Arrival rate increases as request time increases (λ ∝ 𝑊). One cause of this is &lt;em&gt;timeout and retry&lt;/em&gt;: if a client sees a request exceed some maximum time (so 𝑊&amp;gt;𝜏) they may retry. If that timeout 𝜏 is shorter than their typical inter-call time, this will increase the per-client rate. Other kinds of stateful client behavior can also kick in here. For example, clients may interpret long latencies as errors that don’t only need to be retried, but can trigger entire call chains to be retried.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The combination of these effects tends to be that the dynamic behavior of distributed systems has scary cliffs. Systems have plateaus where they behave well where 𝑊, 𝐿 and λ are either close-to-independent or inversely proportional, and cliffs where direct proportionality kicks in and they spiral down to failure. Throttling, admission control, back pressure, backoff and other mechanisms can play a big role in avoiding these cliffs, but they still exist.&lt;/p&gt;

&lt;h3 id=&quot;arrival-processes-and-spiky-behavior&quot;&gt;Arrival Processes and Spiky Behavior&lt;/h3&gt;
&lt;p&gt;The mean, &lt;a href=&quot;//brooker.co.za/blog/2017/12/28/mean.html&quot;&gt;like all descriptive statistics&lt;/a&gt;, doesn’t tell the whole story about data. The mean is very convenient in the mathematics of Little’s law, but tends to hide effects caused by high-percentile behavior. Little’s law’s use of long-term means also tends to obscure the fact that real-world statistical processes are frequently non-stationary: they include trends, cycles, spikes and seasonality which are not well-modeled as a single stationary time series. Non-stationary behavior can affect 𝑊, but is most noticeable in the arrival rate λ.&lt;/p&gt;

&lt;p&gt;There are many causes for changes in λ. Seasonality is a big one: the big gift-giving holidays, big sporting events, and other large correlated events can significantly increase arrival rate during some period of time. Human clients tend to exhibit significant daily, weekly, and yearly cycles. People like to sleep. For many systems, though, the biggest cause of spikes is the combination of human biases and computer precision: &lt;em&gt;cron&lt;/em&gt; jobs. When humans pick a time for a task to be done (&lt;em&gt;backup once a day&lt;/em&gt;, &lt;em&gt;ping once a minute&lt;/em&gt;), they don’t tend to pick a uniformly random time. Instead, they cluster the work around the boundaries of months, days, hours, minutes and seconds. This leads to significant spikes of traffic, and pushes the distribution of arrival time away from the Poisson process ideal&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Depending on how you define &lt;em&gt;long term mean&lt;/em&gt;, these cyclic changes in λ can either show up in the distribution of λ as high percentiles, or show up in λ being non-stationary. Depending on the data and the size of the spikes it’s still possible to get useful results out of Little’s law, but they will be less precise and potentially more misleading.&lt;/p&gt;

&lt;h3 id=&quot;telling-stories&quot;&gt;Telling Stories&lt;/h3&gt;
&lt;p&gt;Somewhat inspired by Little’s law, we can build up a difference equation that captures more of real-world behavior:&lt;/p&gt;

&lt;p&gt;W&lt;sub&gt;n+1&lt;/sub&gt; = 𝑊(L&lt;sub&gt;n&lt;/sub&gt;, λ&lt;sub&gt;n&lt;/sub&gt;, t)&lt;/p&gt;

&lt;p&gt;λ&lt;sub&gt;n+1&lt;/sub&gt; = λ(L&lt;sub&gt;n&lt;/sub&gt;, W&lt;sub&gt;n&lt;/sub&gt;, t)&lt;/p&gt;

&lt;p&gt;L&lt;sub&gt;n+1&lt;/sub&gt; = λ&lt;sub&gt;n+1&lt;/sub&gt; 𝑊&lt;sub&gt;n+1&lt;/sub&gt;&lt;/p&gt;

&lt;p&gt;I find that this is a powerful mental model, even if it’s lacking some precision and is hard to use for clean closed-form results. Breaking the behavior of the system down into time steps provides a way to tell a story about the way the system behaves in the next time step, and how the long-term behavior of the system emerges. It’s also useful for building simple simulations of the dynamics of systems.&lt;/p&gt;

&lt;p&gt;Telling stories about our systems, for all its potential imprecision, is a powerful way to build and communicate intuition.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The system was ticking along nicely, then just after midnight a spike of requests from arrived from a flash sale. This caused latency to increase because of increased lock contention on the database, which in turn caused 10% of client calls to time-out and be retried. A bug in backoff in our client meant that this increased call rate to 10x the normal for this time of day, further increasing contention.&lt;/em&gt; And so on…&lt;/p&gt;

&lt;p&gt;Each step in the story evolves by understanding the relationship between latency, concurrency and arrival rate. The start of the story is almost always some triggering event that increases latency or arrival rate, and the end is some action or change that breaks the cycle. Each step in the story offers an opportunity to identify something to make the system more robust. Can we reduce the increase in 𝑊 when λ increases? Can we reduce the increase in λ when 𝑊 exceeds a certain bound? Can we break the cycle without manual action?&lt;/p&gt;

&lt;p&gt;The typical resiliency tools, like backoff, backpressure and throttling, are all answers to these types of questions, but are far from the full set of answers. Telling the stories allows us to look for more answers.&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Network engineers have long known that the Poisson model is less bursty than many real systems. &lt;a href=&quot;https://pdfs.semanticscholar.org/3226/e025b4ab4afa664b2c9b0418227ee76ac13c.pdf&quot;&gt;An Empirical Workload Model for Driving Wide-Area TCP/IP Network Simulations&lt;/a&gt; and &lt;a href=&quot;http://cs.uccs.edu/~cchow/pub/master/xhe/doc/p226-paxson-floyd.pdf&quot;&gt;Wide Area Traffic: The Failure of Poisson Modeling&lt;/a&gt; are classics in that genre. I’m not aware of good research on this problem in microservice or SoA architectures, but I’m sure there are some interesting results to be found there.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Availability and availability</title>
      <link>http://brooker.co.za/blog/2018/02/25/availability-liveness.html</link>
      <pubDate>Sun, 25 Feb 2018 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2018/02/25/availability-liveness</guid>
      <description>&lt;h1 id=&quot;availability-and-availability&quot;&gt;Availability and availability&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Translating math into engineering.&lt;/p&gt;

&lt;p&gt;It’s well known that the term Availability in the CAP theorem (as formally defined by &lt;a href=&quot;https://dl.acm.org/citation.cfm?id=564601&quot;&gt;Gilbert and Lynch&lt;/a&gt;) means something different from the term &lt;em&gt;availability&lt;/em&gt; that’s commonly used by the designers, builders and operators of distributed systems. Gilbert and Lynch define availability for the CAP theorem as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;every request received by a non-failing node in the system must result in a response.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s cool, and useful for the mathematical analysis that’s needed to prove the CAP theorem. Most builders and users of distributed systems, on the other hand, define &lt;em&gt;availability&lt;/em&gt; as the percentage of requests that their clients see as successful, or something close to that. The terms, like ‘clients’ and ‘successful’ and ‘see’, are pretty fuzzy. Not much good for analysis, but more useful for capturing what people care about.&lt;/p&gt;

&lt;p&gt;This isn’t a new observation. You can find a whole lot of writing about it online. Some of that writing is pretty great. What I don’t see addressed as often is how to translate one into the other, using the CAP (or PACELC or whatever) reasoning about Availability to help us think about &lt;em&gt;availability&lt;/em&gt;. In reality, are Available systems more available than Consistent systems?&lt;/p&gt;

&lt;p&gt;This post isn’t a complete answer to that question, but does include some of the things worth thinking about in that space.&lt;/p&gt;

&lt;h3 id=&quot;harvest-and-yield&quot;&gt;Harvest and Yield&lt;/h3&gt;

&lt;p&gt;Before I dive into this topic, it’s worth talking about Harvest and Yield, from a paper by &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.411&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Fox and Brewer&lt;/a&gt;. The paper itself has some flaws (as I’ve &lt;a href=&quot;http://brooker.co.za/blog/2014/10/12/harvest-yield.html&quot;&gt;blogged about before&lt;/a&gt;), but the underlying concept is very useful. The core is about graceful degradation, and how it’s useful for systems to return partial or stale answers when they aren’t able to answer authoritatively.&lt;/p&gt;

&lt;p&gt;The paper makes its case well, but whether its conclusions are practically useful depend on what promises you make to your clients. If the direct clients of your service are people, then you’re likely to be able to get away with graceful degradation. If your clients are computers, they’re likely expecting a complete, authoritative, response. That’s mostly because when people program computers they don’t think through all of the edge cases introduced by other computers leaving out some information. This isn’t a hard-and-fast rule. Sometimes computers can tolerate partial responses, and sometimes humans can’t.&lt;/p&gt;

&lt;p&gt;In other words, Harvest and Yield is a partial answer, useful when you can use it.&lt;/p&gt;

&lt;h3 id=&quot;taking-availability-to-clients&quot;&gt;Taking Availability to Clients&lt;/h3&gt;

&lt;p&gt;How does CAP’s big-A availability translate to clients? The most useful simple answer is that if you’ve decided you want to Consistent system, then clients on the minority side of a network partition get nothing, and clients on the majority side don’t have any problems. Once the partition heals (or moves around), those minority clients might be able to make progress again. If you’ve chosen an Available system, everybody is able to make progress all the time.&lt;/p&gt;

&lt;p&gt;The reality is fuzzier than this simple answer.&lt;/p&gt;

&lt;p&gt;The first reason is that, in real systems, there isn’t typically a binary choice between A and C. Part of the reason for that is that the definition of Consistency in CAP is also different from the common sense one clients probably care about, and it’s possible to give some clients meaningfully consistent experiences without losing A. The details of that are for another day&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Lets assume that you’ve chosen a common-sense definition of consistency that requires real strong Consistency properties. Then you run into a second problem: many meaningful workloads from clients don’t only read or write a single atomic piece of data. Some workloads are nice a clean and translate into a single database transaction on a single database. Some are messier, and require pulling data from many different shards of different databases, or from other services. Over time, many of the nice clean workloads turn into ugly messy workloads.&lt;/p&gt;

&lt;p&gt;There’s also a third problem. The vast majority of the patterns that are used for building real-world large-scale systems do different amounts of work in the happy case and the failure case. Master election, for example, is a very commonly-used pattern. Paxos implementations typically optimize away one round most of the time. Raft is explicitly modal.&lt;/p&gt;

&lt;p&gt;Clients on the majority side of a partition are theoretically able to continue, but only if they are on the same majority side of all the data and services they depend on. They’re also likely to be some cost to continuing, requiring the system to detect the problem, and shift from happy mode into failure handling mode. Depending on the design of the system, this can take a significant period of time.&lt;/p&gt;

&lt;h3 id=&quot;failure-detection-and-remediation&quot;&gt;Failure Detection and Remediation&lt;/h3&gt;

&lt;p&gt;The first step to surviving a network partition (or any other failure), is figuring out what happened. Sometimes, what happened is a nice clean host failure that everybody can agree on. The real world is uglier: host failures may be partial, network failures may show up as latency or congestion rather than failure, and systems could be cycling between up and down.&lt;/p&gt;

&lt;p&gt;Whether you’ve chosen an Available (A) system or a Consistent (C) system, your system needs to be able to identify failures. How quickly you can do that, and how the system behaves in the mean time, is fundamental to &lt;em&gt;availability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;There are many ways to detect failures: timeouts, direct health pings, latency thresholds, error rate thresholds, TCP connection state (a special case of latency threshold), and even hardware magic like physical-layer connection state. None of those are instantaneous, and most will eat some requests while deciding to fail over. If that happens often, &lt;em&gt;availability&lt;/em&gt; will be decreased.&lt;/p&gt;

&lt;p&gt;Failure remediation is where the distributed systems protocol literature shines. &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt; or &lt;a href=&quot;http://www.pmg.csail.mit.edu/papers/vr.pdf&quot;&gt;Viewstamped Replication&lt;/a&gt; or &lt;a href=&quot;https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf&quot;&gt;Chord&lt;/a&gt; or one of hundreds of other papers provide answers to that problem to fit all kinds of different situations. I’m not going to go into that topic, but even if you nail the implementation of failure handling, you’ve still not solved your client’s problem.&lt;/p&gt;

&lt;p&gt;When a failure is fixed, who needs to learn about the new location of the data and how quickly they can learn about it? While clients are trying to talk to the old, broken primary or trying to talk to the other side of a network partition, they aren’t going to be making progress. Again, whether you’ve chosen A or C, &lt;em&gt;availability&lt;/em&gt; suffers. Available systems do have a bit of an easier time of this than Consistent systems. They might be able to fail over more aggressively. They also don’t have to solve the age-old “oops, I just flipped into the side of the partition away from my clients” problem.&lt;/p&gt;

&lt;h3 id=&quot;where-do-failures-happen&quot;&gt;Where do failures happen?&lt;/h3&gt;

&lt;p&gt;Network partitions do happen. From the perspective of the client of a Consistent system, the system is down if they are partitioned away from the majority of the nodes in the system. From the perspective of the client of an Available system, the system is down if they are partitioned away from all the nodes in the system.&lt;/p&gt;

&lt;p&gt;Whether that’s a useful distinction or not depends on where the clients are relative to the larger system. If the system is in a single datacenter in a single location, and the clients are spread around the global Internet, it’s not much more likely that they’ll be able to contact less than half of the nodes than none of the nodes. On the other hand, if the clients are in the same datacenter as the system, then the probabilities are going to be different. More generally, if nodes are spread around the network about the same way as clients, A and C are going to be practically different.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;In practice, CAP Available doesn’t mean ‘highly available to clients’. In practice, picking an Available design over a Consistent one means that it’s going to be more available to some clients in a fairly limited set of circumstances. That may very well be worth it, but it’s in no way a panacea for availability.&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Although, do check out &lt;a href=&quot;https://dl.acm.org/citation.cfm?doid=2460276.2462076&quot;&gt;Bailis and Ghodsi&lt;/a&gt; for a very readable introduction to the landscape of consistency.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Balls Into Bins In Distributed Systems</title>
      <link>http://brooker.co.za/blog/2018/01/01/balls-into-bins.html</link>
      <pubDate>Mon, 01 Jan 2018 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2018/01/01/balls-into-bins</guid>
      <description>&lt;h1 id=&quot;balls-into-bins-in-distributed-systems&quot;&gt;Balls Into Bins In Distributed Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Throwing things can be fun.&lt;/p&gt;

&lt;p&gt;If you’ve come across the &lt;a href=&quot;https://en.wikipedia.org/wiki/Balls_into_bins&quot;&gt;Balls Into Bins&lt;/a&gt; problem, you probably heard about in context of hash tables. When you hash things into a hash table (especially with &lt;a href=&quot;https://en.wikipedia.org/wiki/Hash_table#Separate_chaining&quot;&gt;separate chaining&lt;/a&gt;) it’s really useful to be able to ask “If I throw 𝑀 balls into 𝑁 bins, what is the distribution of balls in bins?” You can see how this is fundamental to hash tables: the amortized complexity argument for hash tables depends on their being some &lt;em&gt;load factor&lt;/em&gt; (i.e. 𝑀/𝑁) for which most bins contain a small number of items. Once this stops being true, lookup and insertion time on hash tables starts to get ugly. So from that perspective it’s already clearly an important problem.&lt;/p&gt;

&lt;h3 id=&quot;load-balancing-and-work-allocation&quot;&gt;Load Balancing and Work Allocation&lt;/h3&gt;
&lt;p&gt;Hash tables aren’t the only place that the Balls Into Bins problem is interesting. It comes up often in distributed systems, too. For one example, think about a load balancer (in this case a distributor of independent requests) sending load to some number of backends. Requests (𝑀) are balls, and the backends are bins (𝑁) and typically there are multiple requests going to each backend (𝑀 &amp;gt; 𝑁). If we know how to solve for the number of balls in each bin, we can understand the limits of random load balancing, or whether we need a stateful load balancing algorithm like &lt;em&gt;least connections&lt;/em&gt;. This is an important question to ask, because sharing consistent state limits scalability, and sharing eventually-consistent state can even &lt;a href=&quot;//brooker.co.za/blog/2012/01/17/two-random.html&quot;&gt;make load balancing decisions worse&lt;/a&gt;. Load balancing is much easier if it can be done statelessly.&lt;/p&gt;

&lt;p&gt;A related problem is push-based work allocation. Here, there is some co-ordinator handing out work items to a fleet of workers, and trying to have those workers do approximately equal amounts of work. One way that systems end up with this pattern is if they are using &lt;a href=&quot;https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/&quot;&gt;shuffle sharding&lt;/a&gt; or consistent hashing to distribute work items (or records). These hashing-based methods can be great for scaling, and so are widely used across all kinds of large-scale systems. Just as with load balancing, its interesting to be able to understand how well requests are distributed.&lt;/p&gt;

&lt;p&gt;Traditionally, papers about this problem have been most concerned about the expectation of the maximum number of balls in a bin (“how bad can it get?”), but other statistics like the expectation of the mean and expectation of the median can be interesting when planning and designing for load. It’s also interesting to understand the variance of the maximum, and the size of the right tail on the distribution of the maximum. If the maximum can get really high, but will do so infrequently, then load testing can be difficult.&lt;/p&gt;

&lt;h3 id=&quot;closed-form-analysis&quot;&gt;Closed-Form Analysis&lt;/h3&gt;
&lt;p&gt;Gaston Gonnet’s &lt;a href=&quot;https://cs.uwaterloo.ca/research/tr/1978/CS-78-46.pdf&quot;&gt;Expected Length of the Longest Probe Sequence in Hash Code Searching&lt;/a&gt;, was one of the first papers to tackle analyzing the problem, in context of separate-chaining hash tables&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Michael Mitzenmacher’s PhD thesis (&lt;a href=&quot;https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf&quot;&gt;the power of two choices in randomized load balancing&lt;/a&gt;) simplifies Gonnet’s analysis and finds, for the 𝑀=𝑁 case, the maximum number of balls is&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; 𝝝(log 𝑁/log log 𝑁). That’s not a curve you’ll come across often, so this is what it looks like:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/logn_loglogn.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In other words, it grows, but grows pretty slowly. Most real-world cases are probably going to have 𝑀&amp;gt;𝑁, and many will have 𝑀≫𝑁. To understand those cases, we can turn to one of my favorite papers in this area, &lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.4186&quot;&gt;Balls Into Bins: A Simple and Tight Analysis&lt;/a&gt; by Raab and Steger. They provide a great overview of the problem, a useful survey of the literature, and a table of bounds on the maximum number of balls in any bin, depending on the relationship between 𝑀 and 𝑁. The proofs are interesting, and somewhat enlightening, but not necessary to understand to find the paper useful.&lt;/p&gt;

&lt;p&gt;While this kind of analysis is very useful, when I’ve needed to solve these problems in the past, I haven’t tended to use the results directly. Instead, I’ve used them to sanity check the output of simulations. This is where &lt;em&gt;engineering practice&lt;/em&gt; diverges a bit from &lt;em&gt;computer science theory&lt;/em&gt; (although it does it in a pretty theoretically rigorous way).&lt;/p&gt;

&lt;h3 id=&quot;simulating-the-problem&quot;&gt;Simulating the Problem&lt;/h3&gt;
&lt;p&gt;There are a couple of limitations on the usefulness of the closed-form analysis of this problem. One problem is that it’s fairly difficult to understand clearly (at least for me), and quite complex to communicate. The bigger problem, though, is that it’s quite inflexible: extending the analysis to include cases like balls of different sizes (as requests are in the real world) and balls coming out of bins (requests completing) is difficult, and difficult to code review unless you are lucky enough to work with a team that’s very mathematically sophisticated. The good news, though, is that this problem is exceptionally easy to simulate.&lt;/p&gt;

&lt;p&gt;When I think about doing these kinds of simulations, I don’t generally think about using specialized simulation tools or frameworks (although you could certainly do that). Instead, I generally think about just writing a few tens of lines of Python or R which directly try the thing that I want an answer for many times, and then output data in a form that’s easy to plot. Computer simulation is a broad and subtle topic, but this kind of thing (throw balls into bins, count, repeat) avoids many of the subtleties because you can avoid floating point (its just counting) and because you can avoid being too concerned about the exact values.&lt;/p&gt;

&lt;p&gt;Knowing the closed-form analysis makes it easy to sanity-check the simulation. According to Gonnet, the 𝑀=𝑁 case should approach log𝑁/loglog𝑁 (1+𝑜(1)), and we can plot that curve (choosing a value of 𝑜(1) to minimize the difference) alongside the simulation results to see if the simulation matches the theory. The results look pretty good.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/bb_sim_vs_model.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Gonnet’s paper also contains a table of example values, which compare very well to our simulated and modelled numbers. That all increases our confidence that the simulation is telling us sensible things.&lt;/p&gt;

&lt;p&gt;You can also extend this basic counting to be closer to real-world load-balancing. Follow a Poisson process (a fancy way of saying “use exponentially distributed random numbers to decide how long to wait”) to add random balls into bins over time, and follow your completion time distribution (probably exponential too) to pull them out of the bins. Every so often, sample the size of the biggest bin. Next, output those samples for analysis. If you have real arrival-time data, and completion-time distributions, you can use those to avoid making &lt;em&gt;any&lt;/em&gt; statistical assumptions. Which is nice.&lt;/p&gt;

&lt;p&gt;When you’ve got the basic simulation, it’s easy to add in things like different-sized requests, or bursts of traffic, or the effects of scaling up and down the backends.&lt;/p&gt;

&lt;h3 id=&quot;some-basic-results&quot;&gt;Some Basic Results&lt;/h3&gt;
&lt;p&gt;For small M and N, the constant factors are a big problem. With 𝑀=𝑁=100, I get an expected maximum of around 4.2. In other words, we can expect the busiest backend to be over 4x busier than the average. That means that you either need to significantly over-scale all your backends, or put up with the problems that come with hotspotting. This problem also gets worse (although very very slowly, going back to the closed-form) with scale.&lt;/p&gt;

&lt;p&gt;In the closer-to-reality case with 𝑀=1000 and 𝑁=100, the gap shrinks. The expected maximum comes to as 18.8, compared to a mean (aka 𝑀/𝑁) of 10. That still means that the hottest backend gets 80% more traffic, but the gap is closing. By 𝑀=10000 and 𝑁=100, the gap has closed to 25%, which starts to be close to the realm of acceptable. Up to 𝑀=100,000 and the gap is closed to 8%. In most distributed systems contexts, 8% is probably within the variation in performance due to other factors.&lt;/p&gt;

&lt;p&gt;Still, the conclusion of all of this is that random load balancing (and random shuffle-sharding, and random consistent hashing) distributed load rather poorly when 𝑀 is small. Load-sensitive load-balancing, either stateful or stateless with an algorithm &lt;a href=&quot;//brooker.co.za/blog/2012/01/17/two-random.html&quot;&gt;like best-of-two&lt;/a&gt; is still very much interesting and relevant. The world would be simpler and more convenient if that wasn’t the case, but it is.&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes:&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; That’s a Big Theta, if you’re not familiar with it &lt;a href=&quot;https://en.wikipedia.org/wiki/Big_O_notation#Family_of_Bachmann%E2%80%93Landau_notations&quot;&gt;wikipedia has a good explanation&lt;/a&gt; of what it means. If you don’t feel like reading that, and replace it with a big O in your head, that’s close enough in this case.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; That paper also contains some great analysis on the numerical properties of different hash-table probing strategies versus seperate chaining. If you like algorithm analysis, the conclusion section is particularly interesting.&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    
    <item>
      <title>Is the Mean Really Useless?</title>
      <link>http://brooker.co.za/blog/2017/12/28/mean.html</link>
      <pubDate>Thu, 28 Dec 2017 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2017/12/28/mean</guid>
      <description>&lt;h1 id=&quot;is-the-mean-really-useless&quot;&gt;Is the Mean Really Useless?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Don&apos;t be too mean to the mean.&lt;/p&gt;

&lt;p&gt;“The mean is useless” is a commonly-repeated statement in the systems observation and monitoring world. As people correctly point out, the mean (or average&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) tends to hide information about outliers, tends to be optimistic for many metrics, and can even be wildly misleading in presence of large outliers. That doesn’t mean that the average is useless, just that you need to be careful of how you interpret it, and what you use it for.&lt;/p&gt;

&lt;h3 id=&quot;all-descriptive-statistics-are-misleading&quot;&gt;All descriptive statistics are misleading&lt;/h3&gt;

&lt;p&gt;All descriptive statistics are misleading, and potentially dangerous. The most prosaic reason for that is that they are summaries: by their nature they don’t capture the entire reality of the data they are summarizing. There is no way for a single number to capture everything you need to know about a large set of numbers. &lt;a href=&quot;https://en.wikipedia.org/wiki/Anscombe%27s_quartet&quot;&gt;Anscombe’s quartet&lt;/a&gt; is the most famous illustration of this problem: four data sets that have very different graphs, but the same mean and variance in 𝑥 and 𝑦, and the same linear trend. Thanks to &lt;a href=&quot;http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html&quot;&gt;Albert Cairo&lt;/a&gt; and Autodesk, there’s an even more fun example: &lt;a href=&quot;https://www.autodeskresearch.com/publications/samestats&quot;&gt;the datasaurus dozen&lt;/a&gt;&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;iframe width=&quot;560&quot; height=&quot;315&quot; src=&quot;https://www.youtube.com/embed/DbJyPELmhJc&quot; frameborder=&quot;0&quot; gesture=&quot;media&quot; allow=&quot;encrypted-media&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;There are other, more subtle, reasons that descriptive statistics are misleading too. One is that statistics in real-world computer systems change with time, and you can get very different results depending on how those changes in time align with when you sample and how long you average for. Point-in-time sampling can lead to completely missing some detail, especially when the sampling time is aligned to wall-clock time with no jitter.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/cpu_sampling.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this example, we’ve got a machine that runs a periodic job (like a cron job) every minute, and it uses all the CPU on the box for a second. If we sample periodically, aligned to the minute boundary, we’ll think the box has 100% CPU usage. Instead, if we sample periodically aligned to any other second, we’ll think it completely idle. If, instead, we sample every second and emit a per-minute summary we’ll get a mean of 1.7% usage, a median of 0% and a 99th percentile of 100%. None of those really tell us what’s going on. Graphing the time series helps in this case, punting the problem to our brain’s ability to summarize graphs, but that’s hard to do at scale. The darling on the monitoring crowd, histograms, don’t really help here either&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. That’s obviously a fantastically contrived example. Check out &lt;a href=&quot;https://www.azul.com/files/HowNotToMeasureLatency_LLSummit_NYC_12Nov2013.pdf&quot;&gt;this presentation from Azul systems&lt;/a&gt; for some real-world ones.&lt;/p&gt;

&lt;p&gt;Ok, so descriptive statistics suck. For many operational tasks, medians and percentiles suck less than the mean. But that shouldn’t be taken to imply that averages are useless.&lt;/p&gt;

&lt;h3 id=&quot;throughput&quot;&gt;Throughput&lt;/h3&gt;

&lt;p&gt;The throughput of a serial system, how many items of work it can do in a period of time, is one over the mean latency. That changes when the system can do multiple work items at once, either due to pipelining (like CPUs, networks and storage) or due to true parallelism (again, like CPUs, networks, &lt;a href=&quot;http://brooker.co.za/blog/2014/07/04/iostat-pct.html&quot;&gt;and storage&lt;/a&gt;), but mean latency remains the denominator.&lt;/p&gt;

&lt;p&gt;Consider a serial system that processes requests in 1ms 99% of the time, and 1s 1% of the time. The mean latency is 10.99ms, and the throughput is 91 requests per second. What the monitoring people are saying when they talk about the mean being bad is that neither of those figures (10.99ms or 91 rps) tells you that 1% of requests are having a really bad time. That’s all true. But both of those numbers are still very useful for capacity planning.&lt;/p&gt;

&lt;p&gt;The mean throughput number, 91 requests per second in our example, allows us to compare our expected rate of traffic with the capacity of the system. If we’re expecting 10 requests per second at peak this holiday season, we’re good. If we’re expecting 100, then we’re in trouble. Once we know we’re in capacity trouble we can react by adding a second processor (doubling throughput in theory), or by trying to reduce latency (probably starting with that 1000ms outlier). Just looking at our latency graphs doesn’t tell us that.&lt;/p&gt;

&lt;h3 id=&quot;contention-and-littles-law&quot;&gt;Contention and Little’s Law&lt;/h3&gt;

&lt;p&gt;Another place the mean is really useful is in context of &lt;a href=&quot;https://en.wikipedia.org/wiki/Little%27s_law&quot;&gt;Little’s Law&lt;/a&gt;: 𝐋=𝛌𝐖.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are a lot of reasons that this law is interesting and useful, but the biggest one for system operators is &lt;em&gt;concurrency&lt;/em&gt; and how it relates to scale. In almost all computer systems concurrency is a limited resource. In thread-per-request (and process-per-request) systems the limit is often the OS or language scheduler. In non-blocking, evented, and green-thread systems limits include memory, open connections, the language scheduler, and backend limitations like database connections. In modern serverless systems like AWS Lambda, you can &lt;a href=&quot;http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html&quot;&gt;provision concurrency directly&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Like throughput, Little’s law gives us a way to reason about the long-term capacity of a system, and how close we are to it. In large-scale distributed systems many of the limited resources can be difficult to measure directly, so these capacity measures are also useful in understanding the load on resources we can’t observe.&lt;/p&gt;

&lt;p&gt;It’s very useful to build an intuition around Little’s law, because it provides a handle onto some of the dynamic behaviors of computer systems. In real-world system (often due to contention), latency (𝐖) tends to increase along with concurrency (𝐋), meaning that the actual reaction of 𝐋 to increasing arrival rate (𝛌) can be seriously non-linear. Similarly, timeout-and-retry leads the arrival rate to increase as latency increases, again leading to non-linear effects.&lt;/p&gt;

&lt;p&gt;Little’s law isn’t true of the other descriptive statistics. You can’t plug in a percentile of 𝛌 or 𝐖 and expect to get a correct percentile of 𝐋. It only works, except in exceptional circumstances, on the mean.&lt;/p&gt;

&lt;h3 id=&quot;request-size-and-volume&quot;&gt;Request Size and Volume&lt;/h3&gt;

&lt;p&gt;Mean request size (or packet size, or response size, etc) is another extremely useful mean. It’s useful precisely because of the way the mean is skewed by outliers. Remember that the mean is defined as the sum divided by the count: if you multiply it back by the count you can extract the sum. When it comes to storage, or even things like network traffic, the total count is a very useful capacity measure. Percentiles, by their nature, are robust to outliers, but the measure you’re actually interested in (“how much storage am I using?”) may be driven by outliers.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;Graphs, percentiles, medians, maximums, and moments are all extremely useful tools if you’re interested in monitoring systems. But I feel that, in their fervor to promote these tools, people have over-stated the case against the mean. In some quarters there even seems to be a religious fervor against the average, and immediate judgments of incompetence against anybody who uses it. That’s unfortunate, because the average is a tool that’s well-suited to some important tasks. Like all statistics, it needs to be used with care, but don’t believe the anti-mean zealots (and, importantly, don’t be mean).&lt;/p&gt;

&lt;h3 id=&quot;footnotes&quot;&gt;Footnotes:&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; In this post I use &lt;em&gt;mean&lt;/em&gt; and &lt;em&gt;average&lt;/em&gt; more-or-less interchangeably, even though that isn’t technically correct. You know what I mean.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; The &lt;a href=&quot;https://www.autodeskresearch.com/sites/default/files/SameStats-DifferentGraphs.pdf&quot;&gt;paper on how the datasaurus dozen were made&lt;/a&gt; is worth reading.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Histograms do, obviously, help in other cases, as do other tools like Box plots. Sometimes you have to graph and summarize data in multiple ways before finding the one that answers the question you need to answer. John Tukey’s &lt;a href=&quot;https://www.amazon.com/Exploratory-Data-Analysis-John-Tukey/dp/0201076160&quot;&gt;Exploratory Data Analysis&lt;/a&gt; is the classic book in that field.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Why Must Systems Be Operated?</title>
      <link>http://brooker.co.za/blog/2016/01/03/correlation.html</link>
      <pubDate>Sun, 03 Jan 2016 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2016/01/03/correlation</guid>
      <description>&lt;h1 id=&quot;why-must-systems-be-operated&quot;&gt;Why Must Systems Be Operated?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Latent Failures and the Safety Margin of Systems&lt;/p&gt;

&lt;p&gt;Mirrored RAID&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; is a classic way of increasing storage durability. It’s also a classic example of a system that’s robust against independent failures, but fragile against dependent failure. &lt;a href=&quot;http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf&quot;&gt;Patterson et al’s 1988&lt;/a&gt; paper, which popularized mirroring, even covered the problem:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;As mentioned above we make the same assumptions that disk manufacturers make – that the failures are exponential and independent. (An earthquake or power surge is a situation where an array of disks might not fail independently.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A 2-way striped RAID can be in three possible states: a state with no failures, a state with one failure, or a state with two failures. The system moves between the first and second states, and the second and third states, when a failure happens. It can return from the second state to the first by repair. In the third state, data is lost, and returning becomes an exercise in &lt;em&gt;disaster recovery&lt;/em&gt; (like restoring a backup). The classic Markov model looks like this, with the failure rate λ and repair rate μ:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/markov_2stage.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This model clearly displays its naive thinking: it assumes that the failure rate of 2 disks is double the failure rate of a single disk&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. All experienced system operators know that’s not true in practice. A second disk failure seems more likely to happen soon after a first. This happens for three reasons.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Failures with the same cause. These failures, like Patterson’s earthquakes and power surges, affect both drives at the same time. A roof falling in on a server can move its RAID from state 1 to state 3 pretty quickly. Operator mistakes are also a common (and maybe dominant) source of these kinds of failures.&lt;/li&gt;
  &lt;li&gt;Failures triggered by the first failure. When the first drive fails, it triggers a failure of the second drive. In a RAID, the second drive is going to be put under high load as the system attempts to get back to two good copies. This extra load increases the probability of the second drive failing.&lt;/li&gt;
  &lt;li&gt;Latent failures. These cases start with the system believing (and the system operator believing) that the system is in stage one. When a failure occurs, it very quickly learns that the second &lt;em&gt;good&lt;/em&gt; copy isn’t so good&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/markov_2stage_corr.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The third case, latent failures, may be the most interesting to system designers. They are a great example of the fact that systems&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; often don’t know how far they are from failure. In the simple RAID case, a storage system with a latent failure &lt;em&gt;believes&lt;/em&gt; that it’s in the first state, but actually is in the second state. This problem isn’t, by any means, isolated to RAID.&lt;/p&gt;

&lt;p&gt;Another good example of the same problem is a system with a load balancer and some webservers behind it. The load balancer runs health checks on the servers, and only sends load to the servers that it believes are healthy. This system, like mirrored RAID, is susceptible to having outages caused by failures with the same cause (flood, earthquake, etc), failures triggered by the first failure (overload), and latent failures. The last two are vastly more common than the first: the servers fail one-by-one over time, and the system stays up until it either dies of overload, or the last server fails.&lt;/p&gt;

&lt;p&gt;In both the load-balancer and RAID cases, a &lt;em&gt;black box&lt;/em&gt; monitoring of the system is not sufficient. Black box monitoring, including external monitors, canaries, and so on, only tell the system which side of an &lt;em&gt;externally visible failure boundary&lt;/em&gt; a system is on. Many kinds of systems, including nearly every kind that includes some redundancy, can move towards this boundary through multiple failures without crossing it. Black-box monitoring misses these internal state transitions. Catching them can significantly improve the actual, real-world, durability and availability of a system.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/failure_state_space.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Presented that way, it seems obvious. However, I think there’s something worth paying real attention to hear: complex systems, the kind we tend to build when we want to build failure-tolerant systems, have a property that simple systems don’t. Simple systems, like a teacup, are either working or they aren’t. There is no reason to invest in maintenance (beyond the occasional cleaning) until a failure happens. Complex systems are different. They need to be constantly maintained to allow them to achieve their optimum safety characteristics.&lt;/p&gt;

&lt;p&gt;This requires deep understanding of the behavior of the system, and involves complexities that are often missed in planning and management activities. If planning for, and allocating resources to, maintenance activities is done without this knowledge (or, worse, only considering external failure rates) then its bound to under-allocate resources to the real problems.&lt;/p&gt;

&lt;p&gt;That doesn’t mean that all maintenance must, or should, be done by humans. It’s possible, and necessary at scale, to automate many of the tasks needed to keep systems far from the failure boundary. You’ve just got to realize that your automation is now part of the system, and the same conclusions apply.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Also known as RAID 1. Despite nearly a decade working on computer storage, my brain refuses to store the bit of which of RAID 0 and RAID 1 is mirroring, and which is striping.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; And a whole lot more. &lt;a href=&quot;http://domino.watson.ibm.com/library/CyberDig.nsf/papers/BD559022A190D41C85257212006CEC11/$File/rj10391.pdf&quot;&gt;Hafner and Rao&lt;/a&gt; is a good place to start for a more complete picture of RAID reliability.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; In storage systems the most common cause of these kinds of issues are &lt;em&gt;latent sector errors&lt;/em&gt;. &lt;a href=&quot;https://www.usenix.org/legacy/event/fast10/tech/full_papers/schroeder.pdf&quot;&gt;Understanding latent sector errors and how to protect against them&lt;/a&gt; is a good place to start with the theory, and &lt;a href=&quot;http://research.cs.wisc.edu/wind/Publications/latent-sigmetrics07.pdf&quot;&gt;An Analysis of Latent Sector Errors in Disk Drives&lt;/a&gt; present some (possibly dated) data on their frequency.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; Here, the system includes its operators, both human and automated.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Heuristic Traps for Systems Operators</title>
      <link>http://brooker.co.za/blog/2015/11/05/heuristics.html</link>
      <pubDate>Thu, 05 Nov 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/11/05/heuristics</guid>
      <description>&lt;h1 id=&quot;heuristic-traps-for-systems-operators&quot;&gt;Heuristic Traps for Systems Operators&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;What can we learn from avalanche safety?&lt;/p&gt;

&lt;p&gt;Powder magazine’s new feature &lt;a href=&quot;http://features.powder.com/human-factor-2.0/chapter-1&quot;&gt;The Human Factor 2.0&lt;/a&gt; is a fantastic read. It’s a good disaster story, like the New York Times’ &lt;a href=&quot;http://www.nytimes.com/projects/2012/snow-fall/#/?part=tunnel-creek&quot;&gt;Snow Fall: The Avalanche At Tunnel Creek&lt;/a&gt;, but looks deeply at a very interesting topic: the way that we make risk decisions. I think there are interesting lessons there for operators and designers of computer systems.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“That’s the brutal thing,” Donovan said. “It’s hard to get experience without exposing yourself to risk.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The consequences of making bad decisions in back-country skiing can be life and death. As designers, builders and operators of systems of computers our bad decisions are generally less dramatic. But they are real. Data loss, customer disappointment, and business failure have all come as the results of making poor risk decisions. One way to mitigate those risks is experience, and it’s product: intuition. Unfortunately, as David Page’s article reminds us, intuition isn’t always the route to safety.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The difficulty, he explained, especially when it comes to continuing education for veteran mountain guides and other professionals, is breaking through a culture of expertise that is based on savoir-faire - in other words, on some deep combination of knowledge and instinct derived from experience - especially if that experience may happen to include years of imperfect decisions and sheer luck. “When we talk about human behavior, people feel a bit attacked in their certainty, in their habits”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The problem with intuition is exactly that it is not the product of conscious thought. Intuition is &lt;a href=&quot;http://www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555&quot;&gt;Daniel Kahneman’s&lt;/a&gt; System 1: effortless, implicit, and based on heuristics rather than reason. It’s well documented that these heuristics serve us very well in many situations, especially when speed of movement is important above all else, but they don’t lead to us making good decisions in the long term.&lt;/p&gt;

&lt;p&gt;Ian McCammon’s &lt;a href=&quot;http://avalanche-academy.com/uploads/resources/Traps%20Reprint.pdf&quot;&gt;Evidence of heuristic traps in recreational avalanche accidents&lt;/a&gt; is great evidence of this effect in practice. He presents four heuristics that appear to correlate with snow sports accidents:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The familiarity heuristic, or taking increased risks in familiar places or situations.&lt;/li&gt;
  &lt;li&gt;The social proof heuristic, or taking increased risks because others are doing it.&lt;/li&gt;
  &lt;li&gt;The commitment heuristic, or taking increased risks because we want to appear consistent with our words, keep our promises, or feel committed to the situation. Here, we’re &lt;em&gt;stepped in so far&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;The scarcity heuristic, or taking increased risks to take advantage of limited resource or opportunities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;McCammon presents evidence that the &lt;em&gt;social proof&lt;/em&gt;, &lt;em&gt;commitment&lt;/em&gt; and &lt;em&gt;scarcity&lt;/em&gt; heuristics do correlate with avalanche deaths. Even more interesting is the effect of experience and training. The &lt;em&gt;familiarity&lt;/em&gt; and &lt;em&gt;social proof&lt;/em&gt; heuristics correlate most strongly in those with advanced training in avalanche safety. The strength of the &lt;em&gt;familiarity&lt;/em&gt; heuristic’s effect is remarkable: advanced training appears to lead to clearly better decisions in unfamiliar situations, but equal or worse quality decisions in familiar situations.&lt;/p&gt;

&lt;p&gt;This all applies to the decisions that systems operators make every day. In systems operations, much like in avalanche safety, there is a strong familiarity heuristic. In my experience, operators don’t tend to reflect on the safety of operations they are familiar with as much as unfamiliar operations. This is logical, of course, because if we stopped to think through every action we’d be immobile. Still, it’s critical to reevaluate the safety of familiar operations periodically, especially if conditions change.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;First, most accidents happen on slopes that are familiar to the victims. While it’s likely that people tend to recreate more often on slopes they are familiar with, the high percentage of accidents on familiar slopes suggests that familiarity alone does not correspond to a substantially lower incidence of triggering an avalanche.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The social proof heuristic (and its buddy, appeal to authority) are also common to systems operators. Things that multiple people are doing are seen as safer, despite evidence to the contrary. Like the familiarity heuristic, this one makes some sense on the surface. Successful completion of tasks by others &lt;em&gt;is&lt;/em&gt; evidence of their safety. What is irrational is overweighting this evidence.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;All of this suggests that the social proof heuristic may have some marginal value in reducing risk, but in view of the large number of accidents that occur when social proof cues are present it cannot be considered in any way reliable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally, the commitment heuristic is our noble wish to be true to our word, to keep our promises and to get things done before its dark out. Committing to get work done is a very important social force - it’s what operators are paid for, after all - but can lead to poor risk decisions. This is where I see very interesting overlaps with &lt;a href=&quot;http://brooker.co.za/blog/2014/06/29/rasmussen.html&quot;&gt;Jens Rasmussen’s model of risk management&lt;/a&gt;. The commitment heuristic aligns well with what Rasmussen describes as the “gradient towards least effort* in systems operations.&lt;/p&gt;

&lt;p&gt;There is great value in looking at the way that people in unrelated fields makes risk decisions and exercise heuristics, because it allows us to use our &lt;em&gt;system 2&lt;/em&gt; (in Kahneman’s terminology) to train our &lt;em&gt;system 1&lt;/em&gt;, and recognize where it might be leading us astray.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Is there a CAP theorem for Durability?</title>
      <link>http://brooker.co.za/blog/2015/09/26/cap-durability.html</link>
      <pubDate>Sat, 26 Sep 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/09/26/cap-durability</guid>
      <description>&lt;h1 id=&quot;is-there-a-cap-theorem-for-durability&quot;&gt;Is there a CAP theorem for Durability?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Expanding the taxonomy of distributed systems.&lt;/p&gt;

&lt;p&gt;The CAP theorem considers only two of the axes of tradeoffs in distributed systems design. There are many others, including operability, security, latency, integrity, efficiency, and durability. I was recently talking over a beer or two with a colleague about whether there is a CAP theorem for durability (DAP theorem?). These are my thoughts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is durability?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To have a meaningful conversation, we need to talk about what durability is. Its typically given a few meanings:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Persistence of information to stable storage, to tolerate loss of in-memory (volatile) state. This is the &lt;em&gt;D&lt;/em&gt; in ACID.&lt;/li&gt;
  &lt;li&gt;Loss of the data stored in a database. This is typically measured using population statistics. &lt;a href=&quot;https://en.wikipedia.org/wiki/Annualized_failure_rate&quot;&gt;Annualized failure rate (AFR)&lt;/a&gt;, and Mean Time to Data Loss (MTTDL) are typical, easy to understand, (but flawed&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) metrics.&lt;/li&gt;
  &lt;li&gt;Loss of recently committed transactions or recently-written data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On single-node systems, these topics are deeply connected. Persistence to stable storage is required to keep data around across crashes. RAID and backups&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; are widely used to protect against permanent loss of the single system. Traditionally, non-zero &lt;a href=&quot;https://en.wikipedia.org/wiki/Recovery_point_objective&quot;&gt;RPO&lt;/a&gt; is tolerated on node failure.&lt;/p&gt;

&lt;p&gt;Distributed systems can be different. Instead of having a single gold-plated node with its own great durability properties, distributed databases spread the risk out over multiple machines. That unlinks the topics of persistence to stable storage and loss of data, where systems can be tolerant to some number of node reboots without any stable storage.&lt;/p&gt;

&lt;p&gt;For the rest of this post I’ll define durability as “the ability to tolerate t node failures without losing data”. It’s a flawed but hopefully useful definition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What has this all got to do with CAP?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The limits on consistency are well-known. CAP is one boundary in one system model and set of definitions, another (possibly more useful) one is from the all-time classic &lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf&quot;&gt;Consensus in the Presence of Partial Synchrony&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;For fail-stop or omission faults we show that t-resilient consensus is possible iff N ≥ 2t + 1&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What that means is that you can build systems that keep on going&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; even if &lt;em&gt;t&lt;/em&gt; things fail, as long as at least &lt;em&gt;t + 1&lt;/em&gt; things don’t fail. It also means you can keep going in the majority side of a network partition, if one exists. In the Gilbert and Lynch &lt;em&gt;total availability&lt;/em&gt; sense, that means the system is not available&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. In the common sense, the system is still available for everybody on the majority side of the partition.&lt;/p&gt;

&lt;p&gt;There’s a similar definition possible for durability: “&lt;em&gt;For fail-stop or omission faults, t-resilient durability is possible iff N ≥ t + 1&lt;/em&gt;”.&lt;/p&gt;

&lt;p&gt;The next step in developing the DAP theorem is to define &lt;em&gt;failed&lt;/em&gt;. We quickly descend into rule-lawyering.&lt;/p&gt;

&lt;p&gt;Alternative 1: Nodes that we can’t talk to count as failed. If we’re in a system of &lt;em&gt;N = t + 1&lt;/em&gt; nodes, and we can’t talk to &lt;em&gt;k&lt;/em&gt; other nodes, we can accept writes. That’s because, in this state, we’ve already got &lt;em&gt;k&lt;/em&gt; failures, so we only need to tolerate another &lt;em&gt;t - k&lt;/em&gt;. That’s not a very helpful definition.&lt;/p&gt;

&lt;p&gt;Alternative 2: Nodes that we can’t talk to don’t count as failed. If we’re in a system of &lt;em&gt;N = t + 1&lt;/em&gt; nodes, we can only accept writes if we can talk to another &lt;em&gt;t&lt;/em&gt; nodes.&lt;/p&gt;

&lt;p&gt;In alternative 1, you can stay common-sense available on both sides of a partition. In alternative 2, any partition causes unavailability in both senses. Neither is a very useful definition, and our DAP theorem doesn’t seem useful at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Towards a useful rule&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Abadi’s &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;PACELC&lt;/a&gt; could be a better fit for durability. Let’s revisit Abadi’s definition:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Replacing C with our definition of D (the ability to tolerate t node failures without losing data), and defining A as the common-sense version of availability (&lt;em&gt;at least some clients are able to make writes&lt;/em&gt;), we get PADELD.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;if there is a partition (P), how does the system trade off availability and durability (A and D); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and durability (D)?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That actually does seem to be helpful, in the sense that it could be used to have a real conversation about real systems. In the happy &lt;em&gt;E&lt;/em&gt; case, the tradeoff between latency and durability could be between synchronous and asynchronous replication, or it could be between different write quorum sizes. Asynchronous replication reduces latency because fewer steps are required, or particularly expensive steps (like cross-WAN replication) are skipped. Smaller write quorums (for example, writing to 2 of 3 replicas) also reduces latency, especially outlier latency, because writes can be acked while replication is still proceeding to slower replicas. In both cases, a failure is unlikely to lead to complete data loss, but rather some non-zero RPO, where recent writes are more likely to be lost than colder data.&lt;/p&gt;

&lt;p&gt;In the partition &lt;em&gt;P&lt;/em&gt; case, the tradeoff is between availability and durability. The concerns here are the same as in the &lt;em&gt;E&lt;/em&gt; case, and the implementation flavors will be very similar. The partition case is meaningfully distinct, because systems may either change behavior based on failure detection (choosing to lower durability during partitions), or may best-effort replicate but give up after some latency target has been breached.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;PD/ED is a pure synchronous replication pattern, where writes are rejected if they can’t be offered full durability.&lt;/li&gt;
  &lt;li&gt;PA/ED could be a system with either modal or latency-target based behavior that generally chooses durability, but may fall back to availability if that can’t be achieved.&lt;/li&gt;
  &lt;li&gt;PA/EL is a pure asynchronous or quorum replication system, which offers a non-zero RPO for &lt;em&gt;t&lt;/em&gt; failures at all times.&lt;/li&gt;
  &lt;li&gt;PD/EL appears to be meaningless.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;PADELD&lt;/em&gt; may actually be a useful taxonomy of durability behaviors. Durability, at least if we only consider RPO and &lt;em&gt;t-resiliency&lt;/em&gt;, is also a less subtle topic than consistency, so it may even be a more useful tool than PACELC in its own right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; Greenan et al’s &lt;a href=&quot;http://web.eecs.utk.edu/~plank/plank/papers/Hot-Storage-2010.pdf&quot;&gt;Mean time to meaningless&lt;/a&gt; does a good job of explaining why these metrics aren’t ideal descriptions of true storage system behavior. They propose a different metric, NoMDL, which captures some of the missing subtlety but may be significantly more difficult to understand.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Although backups are a kind of distributed replica.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; By &lt;em&gt;keep on going&lt;/em&gt; I mean &lt;em&gt;keep on doing the consensus thing&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot4&quot;&gt;&lt;/a&gt; The &lt;em&gt;total availability&lt;/em&gt; definition from &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=564601&amp;amp;CFID=716755369&amp;amp;CFTOKEN=66839118&quot;&gt;Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services&lt;/a&gt; is “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response.” A more common sense definition is something like “For a distributed system to be continuously available, some requests received by the system must result in a response within some goal latency”. Gilbert and Lynch’s definition leads to a more beautiful CAP theorem, but probably a less helpful one.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>CALISDO: Threat Modeling for Distributed Designs</title>
      <link>http://brooker.co.za/blog/2015/06/20/calisto.html</link>
      <pubDate>Sat, 20 Jun 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/06/20/calisto</guid>
      <description>&lt;h1 id=&quot;calisdo-threat-modeling-for-distributed-designs&quot;&gt;CALISDO: Threat Modeling for Distributed Designs&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Some steps towards a mnemonic threat model for distributed systems.&lt;/p&gt;

&lt;p&gt;Threat modeling from the security field, and business impact analysis from the continuity management field, are powerful and influential ways of structured thinking about particular kinds of problems. The power of threat modeling comes from its structure. By imposing a structure on the thought process, we reduce the number of things that we miss, and make the information more analyzable and accessible. Two popular classic tools for structuring threat modeling are &lt;a href=&quot;https://msdn.microsoft.com/en-us/library/ee823878%28v=cs.20%29.aspx&quot;&gt;STRIDE&lt;/a&gt; and &lt;a href=&quot;http://blogs.msdn.com/b/david_leblanc/archive/2007/08/13/dreadful.aspx&quot;&gt;DREAD&lt;/a&gt;, both originally from Microsoft. While, on the surface, the mnemonics appear cheesy (what is this, the high school science fair?), in practice they are easy to remember, easy to use, and relatively difficult to misunderstand.&lt;/p&gt;

&lt;p&gt;Can we apply the same kind of structured thinking to analyzing the trade offs in distributed systems we design?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CALISDO&lt;/em&gt; is my first attempt at a mnemonic for doing STRIDE-style modeling of distributed systems designs.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Consistency&lt;/em&gt; How do clients experience the consistency of data in the system?&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Availability&lt;/em&gt; How do clients experience the availability of the system for operations?&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Latency&lt;/em&gt; How long does it take for operations to complete? If the system is eventually consistent, how long does it take for data to be visible?&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Integrity&lt;/em&gt; Under what circumstances could data be corrupted?&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Scalability&lt;/em&gt; How does the system scale under load?&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Durability&lt;/em&gt; Under what circumstances could data be lost?&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Operational Costs&lt;/em&gt; What does it take to operate the system? How much will that cost?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;em&gt;Consistency&lt;/em&gt;, the focus is on client-visible guarantees, in the sense typically used by distributed systems (i.e. more closely related to &lt;em&gt;A&lt;/em&gt; and &lt;em&gt;I&lt;/em&gt; than &lt;em&gt;C&lt;/em&gt; of ACID). Key questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;When is data visible to clients?&lt;/li&gt;
  &lt;li&gt;How are concurrent updates handled?&lt;/li&gt;
  &lt;li&gt;Are operations atomic from the client’s perspective?&lt;/li&gt;
  &lt;li&gt;Can the effects of rolled-back or aborted transactions be seen by other clients?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;em&gt;Availability&lt;/em&gt;, the focus is on how clients experience the ability to interact with the system. This obviously includes classic CAP- and PACELC-style trade offs, but practical concerns are likely to be as important in real systems. Issues such as redundancy, failover, load balancing, infrastructure diversity and hardware quality can all have a significant influence on availability.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Latency&lt;/em&gt; focuses on how much time clients have to wait for operations to complete. Within datacenters, and local areas like AWS regions, compute and storage performance typically dominate. For systems spread around large areas of the world, such as websites and CDNs, networking and locality concerns may dominate. Latency analysis should consider both the happy case when everything is working, and degradation behavior under conditions such as failed dependencies and network packet loss. Issues such as data buffering, caching and pre-computation are also important to latency.&lt;/p&gt;

&lt;p&gt;Data &lt;em&gt;Integrity&lt;/em&gt; is critical to the client experience of distributed systems. Analysis in this area should include both end-to-end properties (such as checksums, error-correcting codes and authenticated encryption modes) and local properties (such as the BER and UBER of storage devices and channels). Key questions should cover how often corruption is expected to happen, how it is detected, and how its existence is communicated to clients. Integrity analysis should also recognized that some types of data (typically indexes and metadata) can have high &lt;em&gt;leverage&lt;/em&gt;, effectively turning small amounts of corruption into wide-scale issues. Extra attention should be payed to integrity of this data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Scalability&lt;/em&gt; looks at how the system’s behavior changes as load increases. Attention should be paid to two parts of the scaling curve: the rise in goodput in response to load offered up to saturation, and the drop in goodput with increased load beyond saturation. Scalability should consider scale-up (larger hardware), scale-out (spreading load across components) and load allocation (how load is distributed across components). In nearly all real systems, scalability is limited by either bottlenecks in the architecture, or by hot-spotting caused by uneven distribution of load.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Durability&lt;/em&gt; deals with data loss. Many distributed systems are stateless, or only handle soft-state that may be lost without major cost. Falling into one of these two categories should be seen as a goal. Systems should only store durable state if they have no other sensible choice. Durability requires attention to the failure rates of individual storage components (hard drives, SSDs, etc) and redundancy used to handle these failures (replication, RAID, etc). Attention should be paid to the blast radius of failures, and potential causes of correlated failure (such as sharing an enclosure). Recognition of the role of latent failures on durability is very important.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Operational Cost&lt;/em&gt; covers both the human cost of operations, and the hardware and services cost of operating the system. On human operations, particular attention should be given to single points of failure, where a human operator needs to take action to prevent or end an outage. Human costs also include deployment, updates and upgrades and root-cause investigations. Hardware costs and services costs dominate the overall cost of operating large systems, but human costs often dominate for smaller systems. System designers must be aware of decisions that trade one operational cost for another.&lt;/p&gt;

&lt;p&gt;All of these categories are inter-related. The theoretical tradeoffs between availability and latency are well known, but in practice also involve scalability and durability. Similarly, latency and scalability are often in tension: architectural decisions taken to improve scalability can often add latency. Operational costs are typically in tension with all the other categories. There are also some areas where different categories pull together. For example, good design decisions for durability are often similar to good decisions for availability and integrity.&lt;/p&gt;

&lt;p&gt;It’s also worth noting that CALISDO doesn’t include many critical security properties. Analysis of the security properties of a system is also needed, and may interact with decisions around &lt;em&gt;integrity&lt;/em&gt;, &lt;em&gt;scalability&lt;/em&gt; and &lt;em&gt;operational cost&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;CALISDO isn’t an exhaustive list of design concerns for distributed systems, but it seems like a good start at not forgetting everything obvious. Taking each aspect of a system design - along with the end-to-end system - and breaking it down into these categories, makes it harder to miss obvious deficiencies, mistakes, and oversights.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Sodium Carbonate, and Ramenized Pasta</title>
      <link>http://brooker.co.za/blog/2015/05/24/sodium-carbonate.html</link>
      <pubDate>Sun, 24 May 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/05/24/sodium-carbonate</guid>
      <description>&lt;h1 id=&quot;sodium-carbonate-and-ramenized-pasta&quot;&gt;Sodium Carbonate, and Ramenized Pasta&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Doing chemistry with baking soda, cabbage and an oven.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://blog.ideasinfood.com/ideas_in_food/2014/10/ramenized.html&quot;&gt;Ramenizing things&lt;/a&gt; is super popular right now. The idea seems to have originated from the &lt;a href=&quot;http://blog.ideasinfood.com/ideas_in_food/&quot;&gt;Ideas In Food&lt;/a&gt; crew, and is obviously adapted from the practice of making Ramen in Japan and similar alkaline noodles in China. In those recipes, the secret is incorporating an alkaline ingredient (such as &lt;em&gt;kansui&lt;/em&gt; into the dough when making noodles). This leads to a tender and springy noodle, without egg or other proteins. Ramenizing is an attempt to cheat, to introduce the ramen texture into western pasta, rice, and other foods.&lt;/p&gt;

&lt;p&gt;Most people online seem to be doing their ramenizing with &lt;a href=&quot;http://en.wikipedia.org/wiki/Sodium_bicarbonate&quot;&gt;Sodium Bicarbonate&lt;/a&gt;, the least interesting of all bases. It’s the &lt;a href=&quot;http://en.wikipedia.org/wiki/Fred_Rogers&quot;&gt;Fred Rogers&lt;/a&gt; of bases: always friendly, always useful, and you don’t mind having it around your home. Baking soda’s mild-manneredness is good for safety, but not so good for actually being an effective base. What else could we use?&lt;/p&gt;

&lt;p&gt;First, there’s &lt;a href=&quot;http://en.wikipedia.org/wiki/Lye&quot;&gt;Lye&lt;/a&gt;. Lye has a long tradition as a kitchen ingredient, most notably for making pretzels. Lye is scary. You don’t want to mess with Lye. Then there’s slaked lime, or &lt;a href=&quot;http://en.wikipedia.org/wiki/Calcium_hydroxide&quot;&gt;calcium hydroxide&lt;/a&gt;. It’s also popular in the kitchen, both as an alkali and as a source of calcium for firming up pickles. Slaked lime’s biggest claim to fame is as the motive power behind &lt;a href=&quot;http://en.wikipedia.org/wiki/Nixtamalization&quot;&gt;Nixtamalization&lt;/a&gt;, the magical process that both softens the corn husk and staves off malnutrition. Having slaked lime around the house is a bit less scary than lye, but still not common.&lt;/p&gt;

&lt;p&gt;Finally, there’s sodium carbonate. This is baking soda’s big brother, a powerful alkali with a long history as a food ingredient. Sodium carbonate grabbed the attention of the online food world back in 2010, when Harold McGee wrote about &lt;a href=&quot;http://www.nytimes.com/2010/09/15/dining/15curious.html?_r=1&quot;&gt;making it at home&lt;/a&gt;. Now, I’m not somebody to doubt Mr McGee. &lt;em&gt;On Food and Cooking&lt;/em&gt; is one of my favorite books, and an endless source of fascinating information about food. It’s also a useful source of information to share with the kind of person who needs some encouragement to go home after a dinner party.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/scale_with_sodium_carbonate.jpg&quot; alt=&quot;Scale with 7g of Sodium Carbonate&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Making sodium carbonate at home is super simple. I took 11g (about 2 teaspoons) of baking soda, spread it out on a wide glass dish, and baked it in the oven at 150 Celsius (300 f) for about an hour. That yielded about 7g of a fine, white powder. The problem with fine, white powder is it could have been anything. Most likely, it could still have been sodium bicarbonate.&lt;/p&gt;

&lt;p&gt;The first test is to see if the masses work out. The reaction I’m going for is:&lt;/p&gt;

&lt;p&gt;2 NaHCO&lt;sub&gt;3&lt;/sub&gt; → Na&lt;sub&gt;2&lt;/sub&gt;CO&lt;sub&gt;3&lt;/sub&gt; + H&lt;sub&gt;2&lt;/sub&gt;O + CO&lt;sub&gt;2&lt;/sub&gt;&lt;/p&gt;

&lt;p&gt;mass-wise, that’s 168 parts of sodium bicarbonate to 106 parts of sodium carbonate, and 62 parts of things that will evaporate away. Assuming I started with pure sodium bicarbonate, I should have ended up with 6.94g of sodium carbonate. Close enough.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/indicators_two_solutions.jpg&quot; alt=&quot;indicators with sodium carbonate and bicarbonate&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The best test would be to test the pH of the ramenizing solution I made from the salt. Unfortunately, I don’t own any kind of pH measuring device. I had to turn to every kid’s favorite food-grade non-toxic indicator: red cabbage water. Making red cabbage water is super easy: start with that red cabbage that’s been lying in your fridge since the 90s, chop it up fine, fill a cup with it, top up with water, and microwave until warm. Depending on the age of your cabbage, the resulting smell is either fairly pleasant, or disastrous. Either way, ignore it. The smell’s not important. It’s all about color.&lt;/p&gt;

&lt;p&gt;I made two solutions, both 1% sodium &lt;em&gt;x&lt;/em&gt;carbonate and 0.5% salt. Then, I dropped a quarter teaspoon of each into separate glasses of cabbage water. A control glass got a &lt;em&gt;ml&lt;/em&gt; of &lt;a href=&quot;http://www.fivestarchemicals.com/wp-content/uploads/Star-San-HB4.pdf&quot;&gt;StarSan&lt;/a&gt;, which turned the cabbage stunning pink. The bicarbonate glass stayed stubbornly purple. The sodium carbonate glass went a beautiful baby blue. Success!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/two_pastas.jpg&quot; alt=&quot;two pastas&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The next step was to soak pasta in the two solutions. After two hours, the dried pasta had become pliant and stretchy while remaining firm in both solutions. One of the things people look for in alkali noodles is a yellow color. The carbonate pasta was much yellower, having taken on the deep yellow hue of egg pasta. The bicarbonate pasta remained wan.&lt;/p&gt;

&lt;p&gt;Soapiness is probably not desirable in a pasta dish, so I washed both down a few times before boiling for three minutes in plenty of salted water. Both samples were firm, with a bit of chewiness but none of the stick-to-your-teethiness of under cooked pasta. Of the two I preferred the carbonate pasta, a little springier and stretchier, with no signs of being soft. My test subject preferred the bicarbonate pasta, citing less chew. On looks, the yellowness of the carbonate sample won out over the pallid bicarbonate one.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/final_bowl.jpg&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The final bowl, when combined with a meat sauce and some fresh basil (thanks Aerogarden, for my unending basil glut), was excellent. The texture of the ramenized pasta is different from both fresh and dried pastas. It’s springier and chewier, which I enjoyed. The end product would probably be more at home with a broth than a chunky meat sauce, but it was still good.&lt;/p&gt;

&lt;p&gt;An experiment I’ll repeat.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>The Zero, One, Infinity Disease</title>
      <link>http://brooker.co.za/blog/2015/04/11/zero-one.html</link>
      <pubDate>Sat, 11 Apr 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/04/11/zero-one</guid>
      <description>&lt;h1 id=&quot;the-zero-one-infinity-disease&quot;&gt;The Zero, One, Infinity Disease&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Numbers are important.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“The only reasonable numbers are zero, one and infinity.” (&lt;a href=&quot;http://www.amazon.com/dp/0195113063/&quot;&gt;Bruce MacLennan&lt;/a&gt;)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Rules and heuristics are important. Within our own heads, they are mental shortcuts we use to save ourselves from needing to reason everything out from first principles. Between us, they are devices that we can use to communicate ideas and share complex concepts. Rules of thumb are named patterns of thinking, pointers to complex discussions that can be used in place of talking things through from the beginning every time. They have huge power. Every so often, we should go through our rules and heuristics and throw out the ones that hurt more than they help.&lt;/p&gt;

&lt;p&gt;I have a candidate: the &lt;a href=&quot;http://en.wikipedia.org/wiki/Zero_one_infinity_rule&quot;&gt;zero-one-infinity rule&lt;/a&gt;. In spirit, zero-one-infinity is valuable. It counsels against arbitrary limits, and points out that arbitrary limits are a strong hint that system or piece of code was poorly considered.&lt;/p&gt;

&lt;p&gt;The value is lost when the word &lt;em&gt;arbitrary&lt;/em&gt; is forgotten.&lt;/p&gt;

&lt;p&gt;Numerical instincts are a critical part of the engineer’s toolkit. Having the ability to understand the scale and size of a problem, to estimate quickly, and think in terms of upper and lower bounds is exceptionally useful for both science and engineering. This includes the ability to look at a number, or graph, or formula, and quickly decide whether it looks &lt;em&gt;about right&lt;/em&gt; or &lt;em&gt;definitely wrong&lt;/em&gt;. Many of the best engineers and scientists (most famously &lt;a href=&quot;http://en.wikipedia.org/wiki/Fermi_problem&quot;&gt;Enrico Fermi&lt;/a&gt;) have numerical intuition as a strength, or even as a superpower. When it’s used well, intuition is irreplaceable. It tells us where to measure, where to calculate, and when to calculate or measure a second time. It’s the pure distillation of hard experience into numbers.&lt;/p&gt;

&lt;p&gt;Numerical intuition is closely related to another very useful tool: statistical intuition. Statistical intuition is a feeling about how often things happen, what the distribution of things looks like, and how likely it is that the unlikely will turn out to be true. Statistical intuition is often hard won, and can be very easily fooled. Humans, as a species, aren’t very good at intuitively understanding statistical concepts. Still, the best engineers and scientists keep practicing. They can guess the general shape of distributions, and build general effects (like the laws of large and small numbers) into their everyday thinking.&lt;/p&gt;

&lt;p&gt;Statistical and numerical intuition are most useful when they work over a large range of scales. Experts make a mental shift from linear to exponential estimation as numbers get to big or too small, from multiplying to adding, and from dividing to subtracting. They discard the mantissa, and use only the exponent.&lt;/p&gt;

&lt;p&gt;These intuitive strengths give designers something of a superpower. They become good at finding solutions that don’t make sense at one or zero, and would never work at infinity, but are perfectly suited to their actual range of uses. They recognize where systems are far from their physical limits, which can be an opportunity to push for lower costs or more performance. They can estimate how close bottlenecks are, and where optimization will really matter.&lt;/p&gt;

&lt;p&gt;Zero-one-infinity is often taken to counsel against numerical instincts.&lt;/p&gt;

&lt;p&gt;Beyond instinct and intuition, absolute numbers are critical to computing. Absolute values, not arbitrary values but real ones, rule the physical world around us. Real limits of storage, bandwidth and latency dominate every field of computing. Real customer requirements, of numbers of entries and request rates, and request patterns, rule over the businesses we build with computers. Computing lives in a world of numbers.&lt;/p&gt;

&lt;p&gt;Zero-one-infinity is often taken as counsel against numbers.&lt;/p&gt;

&lt;p&gt;This is dangerous in two ways. First, it limits our ambitions of solving real-world problems. The ghost of infinity haunts us. There are very relevant, real problem domains where solving problems like the traveling salesman, program termination or exact cover are very practical. Not even domains where we can accept approximate solutions, but domains where we can compute exact solutions. When we talk about infinity, we run the risk of forgetting that there’s huge value in solving problems at finite scales.&lt;/p&gt;

&lt;p&gt;The second problem is more subtle. The success of zero-one-infinity and friends, perhaps exacerbated by our habit of educating all computing people as computer scientists, makes it unfashionable, uncool or unacceptable to talk about real limitations. None of the physical systems we build can scale to infinity on any axis, but it’s hard to shake the feeling that we should be embarrassed about that. Instead of finding and documenting the limits of our systems, we pretend they don’t exist. Perhaps if we don’t talk about physical limits, we can keep pretending we don’t have any.&lt;/p&gt;

&lt;p&gt;That’s the core danger of zero-one-infinity. The most important questions about the scaling of systems are “&lt;em&gt;what are the limits?&lt;/em&gt;”, “&lt;em&gt;how do I know when I’m close to the limits?&lt;/em&gt;”, “&lt;em&gt;what happens when I hit the limits?&lt;/em&gt;”. The core question about each number should be “&lt;em&gt;where did this number come from?&lt;/em&gt;”, not “&lt;em&gt;why not infinity?&lt;/em&gt;”. Of course it can’t be infinity. It’s never going to be infinity. Let’s stop pretending it can be, and have a real conversation about numbers.&lt;/p&gt;

&lt;p&gt;** Notes: **&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;em&gt;apy&lt;/em&gt; pointed out &lt;a href=&quot;https://www.sqlite.org/limits.html&quot;&gt;SQLite’s limits page&lt;/a&gt;, which makes a similar point.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;** Historical Context **&lt;/p&gt;

&lt;p&gt;Bruce MacLennan, the originator of the rule, was kind enough to get in contact with me about this post. He said:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Of course, the Zero-One-Infinity Principle was intended as a design principle for programming languages, and similar things, in order to keep them cognitively manageable. I formulated it in the early 70s, when I was working on programming language design and annoyed by all the arbitrary numbers that appeared in some of the languages of the day. I certainly have no argument against estimates, limits, or numbers in general! As you said, the problem is with &lt;em&gt;arbitrary&lt;/em&gt; numbers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;I don’t think I used it in print before I wrote my 1983 PL book. Dick Hamming encouraged me to organize it around principles (a la Kernigan &amp;amp; Plauger and  Strunk &amp;amp; White), and the Zero-One-Infinity Principle was one of the first. (FWIW, the name “Zero-One-Infinity Principle” was inspired by George Gamow’s book, “One, Two, Three… Infinity,” which I read in grade school.)&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    
    <item>
      <title>How Amazon Web Services Uses Formal Methods</title>
      <link>http://brooker.co.za/blog/2015/03/29/formal.html</link>
      <pubDate>Sun, 29 Mar 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/03/29/formal</guid>
      <description>&lt;h1 id=&quot;how-amazon-web-services-uses-formal-methods&quot;&gt;How Amazon Web Services Uses Formal Methods&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Now in CACM.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext&quot;&gt;How Amazon Web Services Uses Formal Methods&lt;/a&gt; is in this month’s Communications of the ACM. This version isn’t changed much from the versions that have been online for a few months, but it’s great to see it get some more attention.&lt;/p&gt;

&lt;p&gt;In the same issue of CACM is Leslie Lamport’s &lt;a href=&quot;http://cacm.acm.org/magazines/2015/4/184705-who-builds-a-house-without-drawing-blueprints/fulltext&quot;&gt;Who Builds a House without Drawing Blueprints?&lt;/a&gt;. Fans of his writing won’t find anything new in there, but it’s a perspective and opinion that I love to see gain more traction.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We think in order to understand what we are doing. If we understand something, we can explain it clearly in writing. If we have not explained it in writing, then we do not know if we really understand it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And the conclusion:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Thinking does not guarantee that you will not make mistakes. But not thinking guarantees that you will.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s a very good take on the subject. As our experiences at Amazon have shown, specification can be an extremely powerful tool in the system designer’s and programmer’s toolbox. It’s even more useful as a team member, where the ability to communicate particularly tough ideas formally and concisely really helps collaboration.&lt;/p&gt;

&lt;p&gt;Other good formal methods reading this week:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://smalldatum.blogspot.com/2015/03/formal-methods-in-real-world.html&quot;&gt;A post by Mark Callaghan&lt;/a&gt; about using Spin for MySQL development. I haven’t spent as much time with Spin (or Promela) as I would like, but it’s very interesting.&lt;/li&gt;
  &lt;li&gt;Adrian Colyer wrote a good mini-series this week on SPL, one looking at &lt;a href=&quot;http://blog.acolyer.org/2015/03/25/samc-semantic-aware-model-checking-for-fast-discovery-of-deep-bugs-in-cloud-systems/&quot;&gt;deep bugs in distributed systems&lt;/a&gt; and the other at &lt;a href=&quot;http://blog.acolyer.org/2015/03/23/combining-static-model-checking-with-dynamic-enforcement-using-the-statecall-policy-language/&quot;&gt;the background of SPL&lt;/a&gt;. He finished up with &lt;a href=&quot;http://blog.acolyer.org/2015/03/26/lineage-driven-fault-injection/&quot;&gt;Lineage-Driven Fault Injection&lt;/a&gt;. All three posts, and the papers behind them, are good reading.&lt;/li&gt;
  &lt;li&gt;Not really from this week, or this decade, or century, but still worth it - Lamport’s article brought me back to &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/what-good.pdf&quot;&gt;What Good is Temporal Logic?&lt;/a&gt;, one of my favorite papers from him. It’s extremely interesting to see how his thinking, and chosen framing, has evolved in the last 32 years.&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
    
    <item>
      <title>Jitter: Making Things Better With Randomness</title>
      <link>http://brooker.co.za/blog/2015/03/21/backoff.html</link>
      <pubDate>Sat, 21 Mar 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/03/21/backoff</guid>
      <description>&lt;h1 id=&quot;jitter-making-things-better-with-randomness&quot;&gt;Jitter: Making Things Better With Randomness&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Jitter is a good thing.&lt;/p&gt;

&lt;p&gt;Two weeks ago, I wrote an article titled &lt;a href=&quot;http://www.awsarchitectureblog.com/2015/03/backoff.html&quot;&gt;Exponential Backoff and Jitter&lt;/a&gt; for the AWS Architecture blog. It looks at OCC in particular, but the lessons are applicable to all distributed systems. The bottom line is that exponential backoff is good, but not sufficient to prevent both wasted time and wasted effort.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/expo_backoff.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Communication in distributed systems isn’t the only place that adding randomness comes in handy. It’s a remarkably wide-spread idea, that’s found use across many areas of engineering. The basic pattern across all these fields is the same: randomness is a way to prevent systematically doing the wrong thing when you don’t have enough information to do the right thing.&lt;/p&gt;

&lt;p&gt;One classic distributed systems example is in the paper &lt;a href=&quot;http://ee.lbl.gov/papers/sync_94.pdf&quot;&gt;The Synchronization of Periodic Routing Messages&lt;/a&gt; (thanks &lt;a href=&quot;https://news.ycombinator.com/user?id=tptacek&quot;&gt;tptacek&lt;/a&gt;). Sally Floyd&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; and Van Jacobson&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; simulate synchronization emerging in previously unsynchronized systems communicating over a network. This leads to short-lived spikes in contention, and other correlated effects on the network. Their solution is to add randomness, which breaks the loop that creates synchronization. While the exact set of protocols and technologies they look at is very 1990s, the lessons are timeless.&lt;/p&gt;

&lt;p&gt;Closely related to these uses of jitter is dither, or adding noise to prevent artifacts when quantizing. Dither is most visible in images, where it can make a huge difference in quality&lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://commons.wikimedia.org/wiki/File:Dithering_example_undithered_web_palette.png&quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/Dithering_example_undithered_web_palette.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://commons.wikimedia.org/wiki/File:Dithering_example_dithered_web_palette.png&quot;&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/Dithering_example_dithered_web_palette.png&quot; alt=&quot;&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Technically, dither is a way to remove correlation between quantization error and the signal being quantized. That sounds complex, but the underlying concept is extremely simple. Imagine a simple system where we’re rounding a vector of reals to the nearest integer. If those reals are nicely distributed, it works well, but sometimes it works very poorly. If we start with&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[ 1.4, 1.4, 1.3, 1.4, 1.2, 1.4, 1.1, 1.0, 1.4 ]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;it rounds to&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[ 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;leaving the error&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[ 0.4, 0.4, 0.3, 0.4, 0.2, 0.4, 0.1, 0.0, 0.4 ]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There are two problem here. We’ve introduced a bias, because all the errors are positive. The error also looks a whole lot like the signal, and there’s clearly information in the signal that’s left in the error. The solution is to add some noise, the simplest case being uniform noise of half a quantization level (in our case, between -0.5 and 0.5).&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[ 1.52, 1.09, 1.34, 1.04, 1.31, 1.83, 0.93, 1.49, 1.67 ]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After rounding, we’re left with the error&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[ -0.6,  0.4,  0.3,  0.4,  0.2, -0.6,  0.1,  0.0, -0.6 ]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which has much less bias (-0.6 versus +2.6), and the remaining noise doesn’t look like the underlying signal. That’s a good thing if you care about spectral artifacts.&lt;/p&gt;

&lt;p&gt;One point of talking about jitter and dither together is to point out the similarities. In both cases, we’re looking to spread out our error. In the case of jitter it’s error that we have because we don’t have complete knowledge of our distributed system. In the case of dither it’s error we’re introducing to have the opportunity to throw out some information. The other point is to invite thought about the advanced techniques of dither (such as &lt;a href=&quot;http://en.wikipedia.org/wiki/Error_diffusion&quot;&gt;error diffusion&lt;/a&gt; and &lt;a href=&quot;http://en.wikipedia.org/wiki/Noise_shaping&quot;&gt;noise shaping&lt;/a&gt;) and whether they have useful analogs in distributed systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; &lt;a href=&quot;http://www.icsi.berkeley.edu/icsi/gazette/2007/09/sally-floyd-sigcomm-award&quot;&gt;Apparently&lt;/a&gt; “the eighth most highly cited researcher in all of computer science”, which is impressive.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; Every time I hear Van Jacobson’s name, I wonder what his first name is.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Images from &lt;a href=&quot;http://en.wikipedia.org/wiki/user:Wapcaplet&quot;&gt;Wapcaplet&lt;/a&gt; on wikimedia commons.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Electoral Trouble in Sybilania</title>
      <link>http://brooker.co.za/blog/2015/03/03/sybil.html</link>
      <pubDate>Tue, 03 Mar 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/03/03/sybil</guid>
      <description>&lt;h1 id=&quot;electoral-trouble-in-sybilania&quot;&gt;Electoral Trouble in Sybilania&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;An Small Town Struggles To Achieve a Fair Vote.&lt;/p&gt;

&lt;p&gt;Sybilania&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; is a small town on the banks of the Orange river, near where where the river turns North toward the Augrabies falls. The main street runs sleepily from the exclusive retirement communities on the river to the grounds of the Northern Cape Rugby Club, champions eight years in a row. On the way, it passes some knitting shops, the church, the grocer, Piet’s print shop, and the cavernous town hall. Like every small town, there is no greater event in Sybilania’s calendar than the local elections. Everybody wants to be, or at least know, the mayor.&lt;/p&gt;

&lt;p&gt;The recall of Mayor Piet had rocked the town. Not so much the recall itself - nobody had liked him anyway - but the reason. Electoral fraud! It was a crime against politics, a crime against morals, and a crime against the very ideals of democracy. Bridge games, golf courses and changing rooms were all filled with talk of the next election. Sybilania’s political scene was dominated by three parties: the Rugby party, populated by the town’s fit and athletic; the River party, populated by the wealthy retirees who lived on spacious estates near the water; and the Bridge party, dominated by the town’s regional-champion card enthusiasts. The three parties didn’t agree on much, but they did agree that there would be no ballot stuffing ever again.&lt;/p&gt;

&lt;p&gt;You see, that summer the town had voted by mail for the first time. Piet, playboy owner of Piet’s print shop, had won in a landslide. The town had been surprised, both by Piet’s victory, and the record turnout of three times the town’s estimated population. An electoral commission was formed, and tasked with finding a fair way to run elections. They had their work cut out for them. The local branch of the Home Affairs had closed in the ’30s, and nearly nobody had ID.&lt;/p&gt;

&lt;p&gt;The first commission-guided election was held at the high school. The school’s fence had been collapsing for years, and the election was a perfect time to form a work party. Everybody in town reported to the school early one morning, dug holes, and raised posts. The person who dug each hole was allowed to carve the name of a candidate onto each post. A strong and athletic young lady from the rugby club took the mayor’s office that year, in a landslide. The other parties demanded that the electoral commission tear down the proof-of-work&lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; system. They left the fence standing.&lt;/p&gt;

&lt;p&gt;The next election happened in a small room at the office of the town surveyor. The town had been divided into a fine grid, the title for the land each grid segment was on was consulted, and its owner was called to ask for their vote. The process ran well into the night, and most of the next day. By mid-afternoon, the mayor from the River party was confirmed. While they knew better than to demand another recall election, the community demanded a replacement for proof-of-stake.&lt;/p&gt;

&lt;p&gt;The most recent election was run on the rugby club’s main field, right between the posts (which, incidentally, hold the Guinness record for tallest posts in the Southern Hemisphere). Everybody in town arrived wearing a mask (to make sure the vote was secret), and arranged themselves in a wide circle around the field. In turn, each citizen shouted out the name of a candidate, followed by the tally they heard from the previous voter, updated with their vote. As the vote went around the ring, everybody could hear that the tally was fairly kept, and not cheating occurred. The crowd fixed a few mistakes over the course of the afternoon, but left happy that the election was free and fair. The Bridge party won comfortably. A few dissenters still complain about seeing the buses&lt;sup&gt;&lt;a href=&quot;#foot4&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; of bridge clubs from neighbouring towns in the car park that day, but nothing has ever been proven.&lt;/p&gt;

&lt;p&gt;** Footnotes **&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;#foot1&quot;&gt;&lt;/a&gt; &lt;a href=&quot;http://research.microsoft.com/pubs/74220/IPTPS2002.pdf&quot;&gt;The Sybil Attack&lt;/a&gt; by John R. Douceur is a very readable paper, almost definately more readable than this fiction. I recommend it.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;#foot2&quot;&gt;&lt;/a&gt; Ralph Merkle proposed the &lt;a href=&quot;http://www.merkle.com/1974/PuzzlesAsPublished.pdf&quot;&gt;use of puzzles&lt;/a&gt; for proof-of-work, because it’s hard to make computers dig holes.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;#foot3&quot;&gt;&lt;/a&gt; You could probably call the weaker people in the community &lt;a href=&quot;http://www.collinjackson.com/research/papers/iptps.pdf&quot;&gt;strength-challenged imposters&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;#foot3&quot;&gt;&lt;/a&gt; New personalities for Sybil can come from either a printing press or a bridge-club bus.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>Does Bitcoin Solve Byzantine Consensus?</title>
      <link>http://brooker.co.za/blog/2015/02/28/bitcoin.html</link>
      <pubDate>Sat, 28 Feb 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/02/28/bitcoin</guid>
      <description>&lt;h1 id=&quot;does-bitcoin-solve-byzantine-consensus&quot;&gt;Does Bitcoin Solve Byzantine Consensus?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;An Interesting New Publication on Bitcoin and Consensus.&lt;/p&gt;

&lt;p&gt;The Bitcoin community is a fascinating mixture of political idealists, technology enthusiasts, entrepreneurs, investors and others. One group that’s increasingly prominent is distributed systems researchers, attracted to some of the interesting problems around Bitcoin and the blockchain. There’s plenty of interesting work to come, but some valuable research has already been done. Much of this work focuses on the theoretical core of bitcoin, and shows real progress towards answering concerns about bitcoin’s safety and liveness bounds.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;https://eprint.iacr.org/2014/765.pdf&quot;&gt;The Bitcoin Backbone Protocol: Analysis and Applications&lt;/a&gt;, Garay, Kiayias and Leonardos write about the core of Bitcoin, which they call the &lt;em&gt;backbone&lt;/em&gt;. The argument for the correctness of the core of bitcoin from Satoshi’s original paper is far from fulfilling:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The majority decision is represented by the longest chain, which has the greatest proof-of-work effort invested in it. If a majority of CPU power is controlled by honest nodes, the honest chain will grow the fastest and outpace any competing chains.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Garay et. al. attack this core argument directly, and analyze the exact safety and liveness properties of the protocol. The contribution that’s going to launch a million online arguments is that bitcoin does not solve the Byzantine agreement problem&lt;sup&gt;&lt;a href=&quot;#foot1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is because in case the adversary finds a solution first, then every honest player will extend the adversary’s solution and switch to the adversarial input hence abandoning the original input.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Nakamoto’s protocol does not quite solve BA since it does not satisfy Validity with overwhelming probability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Their argument hinges on the &lt;em&gt;validity&lt;/em&gt; property of Byzantine agreement (or, rather, &lt;em&gt;strong validity&lt;/em&gt; &lt;sup&gt;&lt;a href=&quot;#foot2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;), and showing that the chosen value may not be one of the inputs to an honest player. In their definition of Byzantine agreement, agreements are only &lt;em&gt;valid&lt;/em&gt; if they pick the input of one of the honest players. That doesn’t appear to be true of the bitcoin protocol as implemented.&lt;/p&gt;

&lt;p&gt;Reducing the practical importance of this result, they also prove a &lt;em&gt;chain quality&lt;/em&gt; property. This property puts an upper bound on how often a dishonest player’s entry will be added to the chain &lt;sup&gt;&lt;a href=&quot;#foot3&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. That’s obviously critically important for liveness, and preventing denial-of-service against honest players.&lt;/p&gt;

&lt;p&gt;I find this kind of research on Bitcoin very interesting. The community has very strong opinions on the safety and liveness of bitcoin. Until recently, there was little evidence to support these opinions. Proving bitcoin’s distributed systems properties is very useful, even though there are still many interesting questions around topics like scalability and &lt;a href=&quot;http://www.jbonneau.com/doc/BMCNKF15-IEEESP-bitcoin.pdf&quot;&gt;economic incentives&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Footnotes&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a name=&quot;foot1&quot;&gt;&lt;/a&gt; See &lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Lynch/podc85.pdf&quot;&gt;Easy Impossibility Proofs for Distributed Consensus Problems&lt;/a&gt; for very approachable definitions of the problem, and obviously &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf&quot;&gt;The Byzantine Generals Problem&lt;/a&gt; for the classic definition.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot2&quot;&gt;&lt;/a&gt; They cite Neiger &lt;a href=&quot;https://smartech.gatech.edu/bitstream/handle/1853/6776/GIT-CC-93-45.pdf&quot;&gt;Distributed Consensus Revisited&lt;/a&gt;, who provides a definition of &lt;em&gt;strong validity&lt;/em&gt; (stronger than &lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Lynch/podc85.pdf&quot;&gt;Fischer, Lynch and Merritt&lt;/a&gt;’s) and a nice justification for why that’s desirable.&lt;/li&gt;
  &lt;li&gt;&lt;a name=&quot;foot3&quot;&gt;&lt;/a&gt; Obviously related to well known &lt;a href=&quot;https://freedom-to-tinker.com/blog/randomwalker/why-the-cornell-paper-on-bitcoin-mining-is-important/&quot;&gt;selfish mining&lt;/a&gt; attacks.&lt;/li&gt;
&lt;/ol&gt;
</description>
    </item>
    
    <item>
      <title>A Quiet Defense of Patterns</title>
      <link>http://brooker.co.za/blog/2015/01/25/patterns.html</link>
      <pubDate>Sun, 25 Jan 2015 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2015/01/25/patterns</guid>
      <description>&lt;h1 id=&quot;a-quiet-defense-of-patterns&quot;&gt;A Quiet Defense of Patterns&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Twenty years late to the party.&lt;/p&gt;
&lt;p&gt;I find myself coming back to &lt;a href=&quot;http://www.dreamsongs.com/Files/PatternsOfSoftware.pdf&quot;&gt;Patterns of Software&lt;/a&gt; every few years. I think about it often, mostly when I am doing code reviews. One great part is the front matter: a short debate between the author and Christopher Alexander, first author of the much-celebrated &lt;em&gt;A Pattern Language&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The elements of this language are entities called patterns. Each pattern describes a problem which occurs over and over in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice. - &lt;em&gt;A Pattern Language&lt;/em&gt;, Alexander et al&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In some programming circles, Alexander’s book is treated with religious reverence. A kind of Tao Te Ching of oblique anecdotes. Concrete enough to to sound solid, but not enough to be actionable. A source of in-jokes and unhelpful advice. It’s also a source of conflict for this same group, because it was an inspiration for something widely reviled: the &lt;a href=&quot;http://en.wikipedia.org/wiki/Design_Patterns&quot;&gt;Gang of Four&lt;/a&gt; book.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;No design patterns are necessary. In any language. - &lt;a href=&quot;http://programmers.stackexchange.com/a/157946/92093&quot;&gt;Jan Hudec&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When Design Patterns first came out, back in the mid-90s, it captivated me. My access to technical books was limited, and I didn’t have a copy of the book itself, but for a short time I was obsessed with the debate about it. It incited anger, it incited self-righteousness, it incited smugness about &lt;em&gt;missing language features&lt;/em&gt;. For each of these loud critics, it seemed to have an equal and opposite supporter. From the community’s reaction, I couldn’t wait to read &lt;em&gt;Design Patterns&lt;/em&gt;. Judging by the controversy, I felt like it must be a deeply important book, with something profound to say about software and those who build it.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;As for literary criticism in general: I have long felt that any reviewer who expresses rage and loathing for a novel or a play or a poem is preposterous. He or she is like a person who has put on full armor and attacked a hot fudge sundae or a banana split - Kurt Vonnegut&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In my mind, Erich Gamma was a Martin Luther figure. He had written a book that was creating a whole new church, ripping open old wounds and providing new courage to both sides. Imagine my disappointment when I finally got my hands on a copy. Instead of Luther’s protest, I found a taxonomy written by stamp collectors.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;My overall bias is that technology, science, engineering, and company organization are all secondary to the people and human concerns in the endeavor. Companies, ideas, processes, and approaches ultimately fail when humanity is forgotten, ignored, or placed second. Alexander knew this, but his followers in the software pattern language community do not. Computer scientists and developers don’t seem to know it, either. - Richard P Gabriel&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Gabriel is right about how many of us have missed the point of Alexander’s work. We’ve seen it as an exercise in taxonomy, or phylogeny, and missed the fact that it’s primarily a human, rather than technical, endeavour. We should go looking for that aspect of it again, because the human side of our field is broken. We could use all the help we can get. We’ve also missed the range of scale of Alexander’s work, concerned with patterns from the deeply technical to broad ideas with scope across entire societies. To live up to Alexander’s vision in our own field we would need to be doing something much deeper than the Gang of Four did. &lt;em&gt;Design Patterns&lt;/em&gt; isn’t software’s &lt;em&gt;A Pattern Language&lt;/em&gt;.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The more we can feel all the connections in the language, the more rich and subtle are the things we say at the most ordinary times.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Design Patterns&lt;/em&gt; isn’t software’s &lt;em&gt;A Pattern Language&lt;/em&gt;. It doesn’t have to be. The concept is much more useful.&lt;/p&gt;

&lt;p&gt;The most obvious way that it’s useful is in enabling high-bandwidth conversations by building shared context. Two people with a common set of patterns find it easier to communicate - even if the goal is to reject certain patterns - than those without one. Another advantage, and common area of criticism, is in education. Teaching common patterns makes people more effective communicators, and naming and classifying patterns makes them easier to teach.&lt;/p&gt;

&lt;p&gt;A third advantage, perhaps less obvious, is that writing down our shared context lowers the barrier to entry. High bandwidth conversations are needed for efficient teamwork. Effective teams build, and use, a shared context. This is healthy for the team, but can make it difficult to break in. Context can become an impenetrable shield that makes it more difficult to bring others into the group. Whether we intend it or not, this can make groups appear exclusive or exclusionary.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Functional languages are extremely expressive. In a functional language one does not need design patterns because the language is likely so high level, you end up programming in concepts that eliminate design patterns all together. - &lt;a href=&quot;http://www.defmacro.org/ramblings/fp.html&quot;&gt;Slava Akhmechet&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Describing and naming patterns is a powerful way to build context, but is not exhaustive. There is no risk of getting to the point where we have described all patterns, and reduced all communication to references to patterns. Context can’t replace communication. At the same time, a list of common patterns isn’t a monotonically growing thing. Patterns are frequently split, combined, superseded, replaced, destroyed or forgotten.&lt;/p&gt;

&lt;p&gt;Patterns themselves are also dependent on context. Some apply well to object-orientated programming, some to functional programming, some to running design meetings, some to mentoring and some to building large-scale systems. This isn’t a weakness of the idea of patterns, but a strength. They are sensitive to scale, too. Some patterns of success at one scale, or in one context, may be patterns of failure at another scale, or in another context. Claims that a particular list of patterns is complete, either in support or criticism, are likely wrong.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;16 of 23 patterns have qualitatively simpler implementation in Lisp or Dylan than in C++ for at least some uses of each pattern. - &lt;a href=&quot;http://norvig.com/design-patterns/design-patterns.pdf&quot;&gt;Peter Norvig&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While recognizing different scales of patterns is critical, they can’t be totally ordered by scale. The debate around patterns-as-flaws in programming languages appears to make this mistake: claiming superiority by demonstrating that some patterns are irrelevant due to their scale. This school of thought then claims that the patterns at their scale and above are not indeed patterns, because they have no use of patterns.&lt;/p&gt;

&lt;p&gt;This thinking is flawed in two ways. The glaring flaw is in the restrictive definition of patterns. The more subtle flaw is in not recognizing that they have patterns of their own at similar scales to the ones that were rejected. Abstraction is extremely powerful, but operating at higher levels of abstraction doesn’t appear to imply higher productivity or reduced needs for patterns as a medium for sharing context.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Design patterns are a form of complexity. As with all complexity, I’d rather see developers focus on simpler solutions before going straight to a complex recipe of design patterns. - &lt;a href=&quot;http://blog.codinghorror.com/rethinking-design-patterns/&quot;&gt;Jeff Atwood&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Shared context and communication is important, but so is programmer productivity. As we well know, productivity comes from more than the ability to type fast. More than any other factor, productivity comes from solving the right problems. Sometimes that means using solutions that exist in libraries or the platform. More often that means re-using solutions we’ve found before, or ones we’ve learned from others. Being productive requires two things: a rich mental library, and the skills to access that library.&lt;/p&gt;

&lt;p&gt;A rich mental library can only be built by experience. Experience isn’t best accumulated with, or measured with, time. Instead, it’s built by solving problems and reading and understanding the solutions of others.&lt;/p&gt;

&lt;p&gt;As important as the size of the library is the skills to access it. The first step is matching the current problem to the library, or pattern matching. The second step is taking past solutions and adapting them to the exact context. This is seldom a mental (or physical) copy-and-paste exercise. The third part of using this mental library of patterns is taste. Taste means knowing when not to use a pattern. It means carefully adapting patterns to the context of the problem.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Thinking doesn’t guarantee that we won’t make mistakes. But not thinking guarantees that we will. - &lt;a href=&quot;http://www.wired.com/2013/01/code-bugs-programming-why-we-need-specs/&quot;&gt;Leslie Lamport&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I love programming. I’ve fallen in love with the craft of programming. I see similar love in the criticism of design patterns (and formal methods, but that’s another topic). That love of craft is great, and important.&lt;/p&gt;

&lt;p&gt;When it comes to building working software in the long term, the emotional pursuit of craft is not as important as the human pursuit of teamwork, or the intellectual pursuit of correctness. Patterns is one of the most powerful ideas we have. The critics may be right that it devalues the craft, but we would all do well to remember that the craft of software is a means, not an end.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Make Your Program Slower With Threads</title>
      <link>http://brooker.co.za/blog/2014/12/06/random.html</link>
      <pubDate>Sat, 06 Dec 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/12/06/random</guid>
      <description>&lt;h1 id=&quot;make-your-program-slower-with-threads&quot;&gt;Make Your Program Slower With Threads&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;How much do context switches matter?&lt;/p&gt;

&lt;p&gt;Years ago, while taking a numerical methods course, I wrote some code to calculate the expected number of shared birthdays in a group. The code is very simple: each attempt constructs a vector of N birthdays, then counts the duplicates. The outer loop runs millions of attempts, and calculates the mean number of shared birthdays across all the samples. It’s little more than a tight loop around a pseudo-random number generator.&lt;/p&gt;

&lt;p&gt;I was also learning about threading at the time, and decided that I could speed up my program by running it on the lab’s shiny dual-core machine. I knew that communicating between threads was expensive, so I had each of my threads calculate their attempts in parallel, and merge the results right at the end. I was expecting a great speedup. Much to my disappointment, though, the multi-threaded version was slower. Much, much, slower.&lt;/p&gt;

&lt;p&gt;Much like the &lt;a href=&quot;http://en.wikipedia.org/wiki/Birthday_problem&quot;&gt;birthday paradox&lt;/a&gt; runs counter to our intuition about statistics, the behavior of bad multi-threaded programs runs counter to our intuition about computer performance. We’re used to computers being much faster than they used to be, and single-threaded efficiency mattering less than it used to in most cases. Counter to that intuition, the gap between &lt;em&gt;good&lt;/em&gt; and &lt;em&gt;bad&lt;/em&gt; multithreaded programs has gotten worse over time.&lt;/p&gt;

&lt;p&gt;To illustrate just how bad it can be, I replicated my program from back then. It’s not much more than a multi-threaded tight loop around &lt;em&gt;random(3)&lt;/em&gt;. It’s nice and quick single-threaded: running 10 million attempts in under 7 seconds. Going up to two threads makes it a bit faster, down to less than 6 seconds. When we hit three threads (on my four core Haswell E3-1240), it all goes horribly wrong:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/threads_bar.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To figure out what’s wrong, we can turn to Linux’s excellent &lt;a href=&quot;https://perf.wiki.kernel.org/index.php/Tutorial&quot;&gt;perf&lt;/a&gt; tool. Running the 1-thread and 4-thread versions with &lt;em&gt;perf stat&lt;/em&gt; make it obvious that something’s going on. For 1 thread:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;     3,788,352 L1-dcache-load-misses #0.03% of all L1-dcache hits
43,399,424,441 instructions  #1.46  insns per cycle
           734 context-switches
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and for four threads:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  4,110,904,396 L1-dcache-load-misses #6.88% of all L1-dcache hits
248,853,610,160 instructions # 0.51  insns per cycle
     15,993,647 context-switches
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Two things are going wrong here. One is that we’re seeing a more L1 cache misses with more threads, but the bigger issue is that we’re seeing &lt;em&gt;a whole lot more&lt;/em&gt; context switches. The effect of both of these is visible in the much lower &lt;em&gt;instructions per cycle&lt;/em&gt; of the second version. There’s no nice constant for the cost of a context switch, but a good modern estimate is around 3μs. Multiplying 3μs by 16 million context switches gives 48 seconds, which is a good hint that we’re headed in the right direction. So, what’s causing the context switches?&lt;/p&gt;

&lt;p&gt;Back to &lt;em&gt;perf&lt;/em&gt;, this time running &lt;em&gt;perf record&lt;/em&gt; on the processes, followed by &lt;em&gt;perf report&lt;/em&gt;. First, the top few rows for the single-threaded version:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# Overhead   Command  Shared Object    Symbol
# ........  ........  ..............   ........................
62.01%  birthday  libc-2.19.so         [.] msort_with_tmp.part.0
11.40%  birthday  libc-2.19.so         [.] __memcpy_sse2        
10.19%  birthday  birthday             [.] simulate
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re spending 62% of the time sorting the array, which is used to find the duplicates. That’s about what I would have guessed. What about the version with four threads?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# Overhead   Command  Shared Object  Symbol
# ........  ........  .............  ............
46.80%  birthday  [kernel.kallsyms]  [k] _raw_spin_lock   
 8.86%  birthday  libc-2.19.so       [.] __random           
 3.42%  birthday  libc-2.19.so       [.] __lll_lock_wait_private
 3.23%  birthday  [kernel.kallsyms]  [k] try_to_wake_up       
 2.95%  birthday  libc-2.19.so       [.] __random_r        
 2.79%  birthday  libc-2.19.so       [.] msort_with_tmp.part.0
 2.10%  birthday  [kernel.kallsyms]  [k] futex_wake 
 1.46%  birthday  [kernel.kallsyms]  [k] system_call  
 1.35%  birthday  [kernel.kallsyms]  [k] get_futex_value_locked 
 1.15%  birthday  [kernel.kallsyms]  [k] futex_wait_setup  
 1.14%  birthday  [kernel.kallsyms]  [k] futex_wait 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Well, that’s suspicious. There aren’t any locks in my code, but there are a whole lot of references to locks in the trace. &lt;em&gt;raw_spin_lock&lt;/em&gt; is obviously a candidate, and it’s suspicious to see so many &lt;a href=&quot;http://en.wikipedia.org/wiki/Futex&quot;&gt;futex&lt;/a&gt;-related calls. Something’s taking locks, and the fact that &lt;em&gt;random&lt;/em&gt; is near the top of the list makes it a likely candidate. Before we dive in there, though, let’s confirm that we’re doing a lot of syscalls:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;sudo perf stat -e &apos;syscalls:sys_e*&apos; ./birthday
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which spits out a long list of system calls, most (like &lt;em&gt;mmap&lt;/em&gt;) with just a handful of hits. There are two huge outliers:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;46,889,267 syscalls:sys_enter_futex
46,889,267 syscalls:sys_exit_futex
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That confirms it, something’s taking a lot of futexes. Knowing whether it’s &lt;em&gt;random&lt;/em&gt; or not requires a dive into the &lt;em&gt;glibc&lt;/em&gt; source, which nearly instantly &lt;a href=&quot;https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/random.c;h=c75d1d96adecf5ac894ca752a4c54647014bd746;hb=9752c3cdbce2b3b8338abf09c8b9dd9e78908b8a#l194&quot;&gt;reveals something suspicious&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; /* POSIX.1c requires that there is mutual exclusion for the `rand&apos; and
  `srand&apos; functions to prevent concurrent calls from modifying common
   data.  */
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And, &lt;a href=&quot;https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/random.c;h=c75d1d96adecf5ac894ca752a4c54647014bd746;hb=9752c3cdbce2b3b8338abf09c8b9dd9e78908b8a#l292&quot;&gt;just a little bit further down&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; __libc_lock_lock (lock);
 (void) __random_r (&amp;amp;unsafe_state, &amp;amp;retval);
 __libc_lock_unlock (lock);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Getting rid of the locks means getting rid of one of two things: shared state, or the necessity to prevent concurrent modification to that state. It seems like the former is easier: reasoning about a data-race-safe PRNG is tricky. There are a many good ways to get rid of shared state in the PRNG. Linux has one particularly convenient way: the C library exposes a reentrant random number generator called &lt;a href=&quot;http://man7.org/linux/man-pages/man3/random_r.3.html&quot;&gt;random_r&lt;/a&gt; (which is used by &lt;em&gt;random&lt;/em&gt;, as you can see from the snippet above). Dropping &lt;em&gt;random_r&lt;/em&gt; in place of &lt;em&gt;random&lt;/em&gt; has an amazing effect:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/threads_bar_second.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As expected, the context switches are way down and instructions per cycle is nicely improved:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;     4,166,540 L1-dcache-load-misses  # 0.04% of all L1-dcache hits
40,201,461,769 instructions # 1.43  insns per cycle
           572 context-switches
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I recognize that spinning on a tight loop on &lt;em&gt;random&lt;/em&gt; is a contrived example, but it’s not too far away from reality. Many programs that multi-thread for performance end up with library or system calls inside relatively tight loops. Our intuition about these things tends to follow &lt;a href=&quot;http://en.wikipedia.org/wiki/Amdahl%27s_law&quot;&gt;Amdahl’s law&lt;/a&gt;. At worst, it’s tempting to think, these things count as a non-parallel portion of code and lower the maximum achievable parallel speedup. In the real world, though, that’s not the case. Multi-threaded programs can, and very often do, run much more slowly than the equivalent single-threaded program.&lt;/p&gt;

&lt;p&gt;It’s just another thing that makes writing multi-threaded code difficult.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Two Farmers and Common Knowledge</title>
      <link>http://brooker.co.za/blog/2014/11/30/two-farmers.html</link>
      <pubDate>Sun, 30 Nov 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/11/30/two-farmers</guid>
      <description>&lt;h1 id=&quot;two-farmers-and-common-knowledge&quot;&gt;Two Farmers and Common Knowledge&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;A legislative solution to a technical problem.&lt;/p&gt;

&lt;p&gt;When their beloved father passed away, Jan and Marie inherited his famous wine estate. Jan and his family were given the historic homestead and half the grape harvesting equipment. Marie’s family were given the other half, and a graceful home with stunning views down a craggy valley in the Bottelary Hills. As a final practical joke, the old man put strange condition into his will: one small vineyard was split between them, and they could only keep their farms if they met once a year on the same morning with all their inherited equipment to harvest the grapes together. The will insisted that they meet together at the vineyard, at a time when the grapes were perfectly ripe.&lt;/p&gt;

&lt;p&gt;The first few years went well. When either decided that the next day would be right for the harvest, they would send a farm worker on his bicycle to tell the other. Farm workers are highly reliable, and they harvested simultaneously for many years. Then, one year around Christmas, a Karaoke joint opened the the village between the two homes. It’s a poorly kept secret that farmers and farm workers love Karaoke, and the temptation to drop in to croon along with Sinatra often turned out to be too strong for the passing bicycling farmer worker. The workers loved karaoke so much, they could sing essentially forever.&lt;/p&gt;

&lt;p&gt;In early July of the next year, Jan and Marie met to make a plan for that year’s harvest. Their farm’s remote locations meant that the telephone was out of the question, and SMS and email were yet to be discovered. They needed some way to make their cycling workers reliable again, and that meant finding a technical solution to the problem.&lt;/p&gt;

&lt;p&gt;“Ok,” said Jan. “All you need to do is send another message back when you get my message. When I get that message, I know you got my message.”&lt;/p&gt;

&lt;p&gt;“Don’t be dumb Jan. What if that one doesn’t arrive?”&lt;/p&gt;

&lt;p&gt;“Oh. Then you need to send a message back saying you got that message”&lt;/p&gt;

&lt;p&gt;“Ja. Then you know that I know that you know about the harvest, and I know that you know.”&lt;/p&gt;

&lt;p&gt;“But if that one doesn’t come, then you don’t know that I know that you know that I know.”&lt;/p&gt;

&lt;p&gt;“One more message, and I know you know that I know, and you know that I know that you know that I know. Right?”&lt;/p&gt;

&lt;p&gt;Some hours, and most of a bottle of pot-stilled brandy, later Jan and Marie were still arguing.&lt;/p&gt;

&lt;p&gt;“Then one more guy, and you know I know you know I know you know I know you know I know…” Marie counted off the “you knows” and “I knows”, and Jan kept the tally with bottle corks. As the cork pile grew, the pair realized the approach wasn’t going to work, and talk shifted to discouraging Karaoke in the community. After making a plan to ask for it to be mentioned in that Sunday’s sermon, and drafting a list of reasons it wasn’t moral, Marie realized it was in vain.&lt;/p&gt;

&lt;p&gt;“Any chance, Jan. Any small chance and it’s all for &lt;a href=&quot;http://en.wiktionary.org/wiki/niks#Afrikaans&quot;&gt;niks&lt;/a&gt;. The messenger doesn’t have to get lost, it’s enough for one of us not to be sure. Besides, nobody listens to the &lt;a href=&quot;http://en.wiktionary.org/wiki/dominee#Afrikaans&quot;&gt;dominee&lt;/a&gt;.”&lt;/p&gt;

&lt;p&gt;The two farmers continued to struggle with the problem for days, without any result. Eventually, on a visit to the local library, they came across a &lt;a href=&quot;http://65.54.113.26/Publication/3768450/some-issues-in-distributed-processes-communication&quot;&gt;paper from 1979&lt;/a&gt; by Yemini and Cohen. In it, they found their worst fears confirmed: they were going to lose the farm. As long as there is any probability that a messenger gets lost, no algorithm can guarantee they meet on the same morning to harvest. It got worse. In a 1984 &lt;a href=&quot;https://www.cs.cornell.edu/home/halpern/papers/common_knowledge.pdf&quot;&gt;paper by Halpern and Moses&lt;/a&gt; they found that even if all the messengers did eventually arrive, they still couldn’t agree, unless the amount of Karaoke they sung was bounded. They felt like all hope was lost.&lt;/p&gt;

&lt;p&gt;Just as they were ready to leave the library in desperation, Jan read something that made his heart jump:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;On the other hand, if messages are guaranteed to be delivered within ε units of time, then ε-coordinated attack can be accomplished.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;“Hey Marie! Did you see that paper by &lt;a href=&quot;http://researcher.watson.ibm.com/researcher/files/us-fagin/apal99.pdf&quot;&gt;Fagin&lt;/a&gt;? We need to read that Halpern and Moses one again!”&lt;/p&gt;

&lt;p&gt;“Moses?”&lt;/p&gt;

&lt;p&gt;“Probably not the same Moses.”&lt;/p&gt;

&lt;p&gt;Jan and Marie made a copy of the paper, and headed home. Putting the Halpern and Moses paper next to their father’s will, they discovered something amazing. The will didn’t say they have to meet at the field at exactly the same time! It just had to be the same morning. If they could set the maximum amount of Karaoke a messenger could sing to ε, they could meet at the fields within ε of each other. As long as ε is less than a morning, they could keep the farm. Being a well-funded land owner, Marie’s run for local government was as quick as her career was short. She stayed long enough to propose only one law. To this day, it’s illegal to sing karaoke for more than ε seconds in that small &lt;a href=&quot;http://en.wiktionary.org/wiki/dorp#English&quot;&gt;dorp&lt;/a&gt; in that valley in the Bottelary Hills.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Exactly-Once Delivery May Not Be What You Want</title>
      <link>http://brooker.co.za/blog/2014/11/15/exactly-once.html</link>
      <pubDate>Sat, 15 Nov 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/11/15/exactly-once</guid>
      <description>&lt;h1 id=&quot;exactly-once-delivery-may-not-be-what-you-want&quot;&gt;Exactly-Once Delivery May Not Be What You Want&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;It&apos;s hard to get, but that&apos;s OK, because you don&apos;t want it.&lt;/p&gt;

&lt;p&gt;Last week, there was a good discussion on &lt;a href=&quot;http://lobste.rs&quot;&gt;lobste.rs&lt;/a&gt; about &lt;a href=&quot;https://lobste.rs/s/ecjfcm/why_is_exactly-once_messaging_not_possible_in_a_distributed_queue&quot;&gt;why exactly-once messaging is not possible&lt;/a&gt;. The discussion was kicked off with a link to a paper from Patel et al titled &lt;a href=&quot;http://datasys.cs.iit.edu/publications/2014_SCRAMBL14_HDMQ.pdf&quot;&gt;Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues&lt;/a&gt;, which claims to contribute:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;… a highly scalable distributed queue service using hierarchical architecture that supports exactly once delivery, message order, large message size, and message resilience.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I haven’t evaluated the author’s other claims in detail, but the claim of exactly once delivery caught my eye.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;There is no chance of getting two get requests for the same message. When a HTTP message request comes in, a message is sent through HTTP response and the message is deleted at the same time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While I’m not fully satisfied about their &lt;em&gt;at the same time&lt;/em&gt;, they don’t seem to be claiming to break any fundamental laws here. What I do feel is fundamental, though, is that this definition of &lt;em&gt;exactly once delivery&lt;/em&gt; isn’t the one that most systems builders would find useful. The effect that most people are interested in is actually exactly-once processing: a message having a particularly side-effect exactly once per message.&lt;/p&gt;

&lt;p&gt;I like to think about this in terms of redundancy. Fault-tolerant distributed systems deal with all kinds of failures, but it’s often practically useful to break them into two categories: node failure and message loss. Node failures can be tolerated with redundancy &lt;em&gt;in space&lt;/em&gt;, having multiple copies of a piece of data on multiple nodes. Message loss can be tolerated with redundancy &lt;em&gt;in time&lt;/em&gt;, sending the same message multiple times if it doesn’t seem to have been received. Replicated databases are redundant in space, and &lt;a href=&quot;http://en.wikipedia.org/wiki/Transmission_Control_Protocol&quot;&gt;TCP&lt;/a&gt; is a great example of redundancy in time. &lt;em&gt;Side note: I stole this characterization from the excellent talk titled &lt;a href=&quot;http://www.youtube.com/watch?v=ggCffvKEJmQ&quot;&gt;Outwards From The Middle of the Maze&lt;/a&gt; by Peter Alvaro.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Think about a simple system making use of an &lt;em&gt;exactly once&lt;/em&gt; queue. There’s some producer (which we’ll mostly ignore), the queue (which we’ll pretend has magic durability and availability properties), and the focus of the discussion: a fleet of consumers. The producer makes tasks asking the consumers to alter the world in some way. We could build this system in a few ways:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The queue passes each task to exactly one consumer exactly once. If the consumer fails the task is lost, and the system does each task &lt;em&gt;at most once&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;We could ask the consumer to acknowledge the message once it’s been processed. If the consumer fails to do that after some amount of time, the queue will offer it to another consumer. This makes system tolerant to consumer failure, but a consumer just stalling could cause it to pick up the work when it recovers, causing multiple delivery. This system ends up being &lt;em&gt;at least once&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;To fix the stall problem, we could put a timeout on the task itself, saying “don’t perform this task at all if you can’t get it done by five o’clock on Friday”. While we can do this in a way that doesn’t require the queue and consumer to synchronize their clocks, at least we have to depend on the relative rates of their clocks being bounded.&lt;/li&gt;
  &lt;li&gt;We could pass the task to multiple consumers, and ask them to co-ordinate amongst themselves which will execute it. That’s a reasonable solution from the queue’s perspective, but just moves the problem down to the consumer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so on. There’s always a place to slot in &lt;a href=&quot;http://en.wikipedia.org/wiki/Turtles_all_the_way_down&quot;&gt;one more turtle&lt;/a&gt;. The bad news is that I’m not aware of a nice solution to the general problem for all side effects, and I suspect that no such solution exists. On the bright side, there are some very nice solutions that work really well in practice. The simplest is &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=2187821&quot;&gt;idempotence&lt;/a&gt;. This is a very simple idea: we make the tasks have the same effect no matter how many times they are executed.&lt;/p&gt;

&lt;p&gt;Consider Bob, distributed systems enthusiast and pizza restaurateur. When people order from Bob, their orders go into a persistent queue. Bob’s workers take a pizza order off the queue, bake it, deliver it, and go back to the queue. Occasionally one of Bob’s workers gets bored and leaves early in the middle of a task, in which case Bob gives the order to a different worker. Sometimes, this means that multiple pizzas arrive at the customer’s house (though never less than one pizza). Bob doesn’t want people to end up with excess pizza, so he does something very smart: gives each order a unique identifier. On arriving, the pizza delivery guy asks the home owner if they had received a pizza with that order ID before. If the home owner says yes, the pizza guy takes the duplicate pie with him. If not, he leaves the pie. Each home owner gets exactly one pie, and everybody is happy.&lt;/p&gt;

&lt;p&gt;In Bob’s world, pizza baking and delivery is an &lt;em&gt;at least once&lt;/em&gt; operation, but pizza delivery into the customer’s house happens &lt;em&gt;exactly once&lt;/em&gt; thanks to the fact that his deliveries are idempotent. Bob’s obviously got a strong incentive to reduce pizza waste. He tries to make sure that &lt;em&gt;at least once&lt;/em&gt; is also &lt;em&gt;approximately once&lt;/em&gt;, which is easy most of the year, but can be a real challenge when it’s stormy out and the big game is on.&lt;/p&gt;

&lt;p&gt;I think there are two lessons here for people building distributed systems. One is that end-to-end system semantics matter much more than the semantics of an individual building block, and sometimes what seems like a very desirable semantic for a building block may end up making the end-to-end problem harder. The other is that simple, practical, solutions like unique IDs can make really hard problems much easier, and allow us to build and ship real systems that work in predictable ways.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Ice Cream and Distributed Systems</title>
      <link>http://brooker.co.za/blog/2014/10/25/ice-cream.html</link>
      <pubDate>Sat, 25 Oct 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/10/25/ice-cream</guid>
      <description>&lt;h1 id=&quot;ice-cream-and-distributed-systems&quot;&gt;Ice Cream and Distributed Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Can we serve a fair amount of ice cream?&lt;/p&gt;

&lt;p&gt;When I was a child, I really liked to eat ice cream. It’s still pretty great, but back then I was somewhat fanatical about it. My parents knew that the delicious mix of fat and sugar was best served only occasionally, and would carefully limit my intake. I, of course, found a way around the system. First, I’d go to my mother and ask if it was time for ice cream, and she would give me an answer. If she answered in the negative, I’d ask my father the same question. That strategy increased by chances of an affirmative answer, because the decisions that my parents made were not consistent. Occasionally, I’d even eat some served by my mother, and then try my father for a second bowl.&lt;/p&gt;

&lt;p&gt;After a while of running this scheme, my parents figured it out. They decided that they needed to give me a consistent answer, and the only way to do that was to talk to each other every time I asked the question. Their coordination approach worked great. It guaranteed a consistent answer, and only made young Marc wait a little longer for his question to be answered.&lt;/p&gt;

&lt;p&gt;It all broke down when my parents went to work. Being a child, I could find a good excuse to speak to either parent at any time, but their jobs prevented them from speaking to each other. Once again, I could use the situation to my rich, sweet, and creamy advantage. With my parents unable to communicate, I was able to force an inconsistent decision. Did my parents miss a trick that would have allowed them to make a consistent ice cream serving decision without being able to talk to one another?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Assume that the network consists of at least two nodes. Thus it can be divided into two disjoint, non-empty sets: {G1, G2}. The basic idea of the proof is to assume that all messages between G1 and G2 are lost. Then if a write occurs in G1, and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Assuming that my parents didn’t have watches, and had to make the decision based only on the messages they have received and their internal state, &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.67.6951&quot;&gt;Gilbert and Lynch&lt;/a&gt; proved that they can’t have made a consistent decision in general. That’s a general result about writes and reads. Could they do better in this specific case?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting crafty with clocks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Around that time, my parent’s ice cream policy was that I got one bowl a week. They met before work every day when I hadn’t reached my weekly allocation, and decided that for the next eight hours only one of them could hand out an ice cream decision. If it was my mother’s day, and I called my father for a decision, he told me that he couldn’t give me one. As long as I can could contact my mother, I could get a consistent answer. If I couldn’t reach my mother, I was out of luck. The one saving factor, though, was that if mom worked late, dad would notice the eight hours had expired and make a decision.&lt;/p&gt;

&lt;p&gt;Soon, being the crafty young man I was, I realized that dad’s watch ran slightly faster than mom’s. When he got home, I’d go to my dad and ask for a serving of vanilla. He’d look at his watch, see that eight hours had expired, and assume that my mom had lost the authority to make the decision. After checking that the bowl in the freezer was still full, to make sure my mom hadn’t decided during the day, he’d allow me to eat. I’d wolf down my frozen treat, and call my mom. Her slower watch told her the eight hours weren’t up yet, and she’d give me the go ahead for a second bowl. I’d beaten the system!&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[the client’s lease time] is shortened by … the allowance E for clock skew.
As a minimum, the correct functioning of leases requires only that clocks have a known bounded drift&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If my parents had read &lt;a href=&quot;http://web.stanford.edu/class/cs240/readings/89-leases.pdf&quot;&gt;Gray and Cheriton&lt;/a&gt;, they would have known how to fix their lease protocol. My mom and dad would have had to measure the skew between the rate of their watches, and added some time (E) to the time my dad waited before assuming my mom didn’t have the lease anymore.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Putting the results back together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Thanks to a diet fad sweeping the nation, my parents decided that ice cream wasn’t as bad as they assumed. Being responsible parents they still wanted to track my consumption to check their hypothesis. During the work day, my mom and dad went back to making inconsistent decisions, and each just kept their own records of how often they said yes. Once they were home together again, they could add up their numbers to get an accurate total.&lt;/p&gt;

&lt;p&gt;Tracking flavors was a little bit harder. Every time I called to request a scoop, they would write down which flavor I got permission for. Occasionally I’d go to the freezer and find that flavor was out, and I’d call and ask to reduce my tally. Being a small child with a short memory, I couldn’t remember if mom or dad had recorded the &lt;em&gt;yes&lt;/em&gt;, so I’d call one at random to record the &lt;em&gt;no&lt;/em&gt;. That didn’t matter, because they could still total their independent counts at the day’s end to get an accurate tally. An accurate tally, that is, until disaster struck.&lt;/p&gt;

&lt;p&gt;I’d received permission from mom to eat some strawberry gelato, but found none in the usual place between the ice trays and frozen juice. I called her back to report the failure, but the line dropped before I could say goodbye. Distraught at being rudely disconnected from my mother, I called my dad to report the same thing. When the tally was done at the end of the day, my parents were baffled by the count of negative one. Had we invented anti-gelato?&lt;/p&gt;

&lt;p&gt;My parents unfortunately hadn’t kept track of developments in &lt;em&gt;conflict-free replicated data types&lt;/em&gt;. If they had, they could have solved this problem with an &lt;a href=&quot;https://hal.inria.fr/hal-00738680/PDF/RR-8083.pdf&quot;&gt;OR set&lt;/a&gt;, by tracking additions and removals with unique tags. If they had been armed with that paper, and research on CRDTs new and old, could they have gone back to restricting my intake? The intuition is obvious: if we can count something independently, and we can manage a set independently, can we enforce one bowl per day independently? Unfortunately not. The important difference is that &lt;em&gt;add one&lt;/em&gt; and &lt;em&gt;add to set&lt;/em&gt; are &lt;a href=&quot;http://en.wikipedia.org/wiki/Commutative_property&quot;&gt;commutative&lt;/a&gt;, while &lt;em&gt;reduce by one if the count is greater than zero&lt;/em&gt; isn’t.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting everybody to agree&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Following their frustrating battle with flavor tracking, my parents asked their part-time housekeeper Mary to help with the problem. After losing faith in fad diet books, my parents both dedicated part of their work day to investigating the health properties of ice cream, and frequently changed their opinion. They also wanted to keep careful track of how much I was served, in case the dosage turned out to be important to my health. Mom and Dad agreed with Mary that she could allow me to eat some as long as all of them agreed when I asked. Mary was happy with this, but there was one big problem: she didn’t like talking on the phone. Fortunately, she loved to send text messages. Unfortunately, texts were still strangely expensive back then.&lt;/p&gt;

&lt;p&gt;Mary, Mom and Dad sat down and tried to figure out how to all agree on the problem with the fewest number of messages. Mary invented a simple scheme: when I asked her if I could have some ice cream, she messaged both my mom and dad and ask for their opinion, while asking that they didn’t change their opinion until hearing back from her. If they both agreed, she’d go ahead and let them know she was going to serve dessert. If either said no, she let them know that the bowl would remain empty. The protocol, which they called &lt;a href=&quot;http://en.wikipedia.org/wiki/Two-phase_commit_protocol&quot;&gt;two-phase commit&lt;/a&gt; after the frozen and liquid phases of ice cream, took four messages to complete. Could Mary do better and save some money on text messages?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Any commitment protocol … requires at least 2(n - 1) messages to commit a transaction in the absence of processor failures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Luckily for them, my parents didn’t waste too much time thinking about the problem. Mary came across a paper from &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=806705&quot;&gt;Cynthia Dwork and Dale Skeen&lt;/a&gt; which laid out what she needed to know. As long as Mary was sending text messages, there was no way to do better than her protocol.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Harvest and Yield: Not A Natural Cure for Tradeoff Confusion</title>
      <link>http://brooker.co.za/blog/2014/10/12/harvest-yield.html</link>
      <pubDate>Sun, 12 Oct 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/10/12/harvest-yield</guid>
      <description>&lt;h1 id=&quot;harvest-and-yield-not-a-natural-cure-for-tradeoff-confusion&quot;&gt;Harvest and Yield: Not A Natural Cure for Tradeoff Confusion&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Comments on a 15 year old paper.&lt;/p&gt;

&lt;p&gt;As I wrote about in my &lt;a href=&quot;http://brooker.co.za/blog/2014/07/16/pacelc.html&quot;&gt;post on PACELC&lt;/a&gt;, I don’t think the CAP theorem is the right way for teachers to present distributed systems tradeoffs. I also don’t think it’s ideal for working practitioners, despite its wide use. I prefer Abadi’s &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;PACELC&lt;/a&gt;, but there are legitimate criticisms of that one too. One criticism is that it’s poorly formalized, which makes it hard to apply to precise statements. Another is the PC/EL is an awkward edge case. There are more. Fox and Brewer’s &lt;em&gt;harvest&lt;/em&gt; and &lt;em&gt;yield&lt;/em&gt; model, from &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.411&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Harvest, Yield, and Scalable Tolerant Systems&lt;/a&gt;, is a &lt;a href=&quot;http://codahale.com/you-cant-sacrifice-partition-tolerance/&quot;&gt;widely promoted&lt;/a&gt; alternative.&lt;/p&gt;

&lt;p&gt;While I like the concepts of &lt;em&gt;harvest&lt;/em&gt; and &lt;em&gt;yield&lt;/em&gt;, I find it hard to recommend the paper. Both &lt;a href=&quot;http://www.cs.berkeley.edu/~brewer/&quot;&gt;Eric Brewer&lt;/a&gt; and &lt;a href=&quot;http://www.eecs.berkeley.edu/Faculty/Homepages/fox.html&quot;&gt;Armando Fox&lt;/a&gt; have made big contributions to the field, and I like many of their papers. I just don’t like this one.&lt;/p&gt;

&lt;p&gt;I’ll start with what I dislike most about it. From the first page:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Partition-resilience means that the system as whole can survive a partition between data replicas.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;CA without P: Databases that provide distributed transactional semantics can only do so in the absence of a network partition separating server peers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I find these statements awkward, and feel like they support the mistaken belief that CA is a valid option. You certainly can’t pick CA where your C is linearizability (as in &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;Gilbert and Lynch’s&lt;/a&gt; proof) &lt;a href=&quot;http://www.bailis.org/blog/linearizability-versus-serializability/#fn:hardness&quot;&gt;or serializability&lt;/a&gt;. If you’re allowed to pick CA, either your definition of C is weaker than either of those, your definition of A doesn’t require minority partitions to make progress, or you’re in denial about network partitions (which &lt;a href=&quot;http://aphyr.com/posts/288-the-network-is-reliable&quot;&gt;do exist&lt;/a&gt;). Compare the definition of CP:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;CP without A: In the event of a partition, further transactions to an ACID database may be blocked until the partition heals, to avoid the risk of introducing merge conflicts (and thus inconsistency).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Is that saying that CA exists, but just introduces inconsistencies that CP doesn’t? Overall, I don’t follow Fox and Brewer’s thinking about &lt;em&gt;partition tolerance&lt;/em&gt; in this paper.&lt;/p&gt;

&lt;p&gt;On to the model itself, which concerns itself with:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;large applications whose output behavior tolerates &lt;em&gt;graceful degradation&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The idea of &lt;em&gt;graceful degradation&lt;/em&gt; is that a partial response may be more useful to the client than no response, so you can directly trade off between the completeness of the response and the availability of the system. Many real-world systems can tolerate partial responses, especially if you can provide some bounds on the definition of partial. Using probabilistic data structures like &lt;a href=&quot;http://en.wikipedia.org/wiki/Bloom_filter&quot;&gt;Bloom filters&lt;/a&gt; and the &lt;a href=&quot;http://www.cse.unsw.edu.au/~cs9314/07s1/lectures/Lin_CS9314_References/cm-latin.pdf&quot;&gt;count-min sketch&lt;/a&gt; is a widely accepted technique for scaling systems, and it makes sense to apply the same ideas to availability.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We assume that clients make queries to servers, in which case there are at least two metrics for correct behavior: &lt;em&gt;yield&lt;/em&gt;, which is the probability of completing a request, and &lt;em&gt;harvest&lt;/em&gt;, which measures the fraction of the data reflected in the response, i.e. the completeness of the answer to the query.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Yield&lt;/em&gt; is the availability metric that most practitioners end up working with, and it’s worth noting that its different from CAP’s &lt;em&gt;A&lt;/em&gt;. The authors don’t define it formally, but treat it as a long-term probability of response rather than the probability of a response conditioned on there being a failure. That’s a good common-sense definition, and one that fits well with the way that most practitioners think about availability.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In the presence of faults there is typically a tradeoff between providing no answer (reducing yield) and providing an imperfect answer (maintaining yield, but reducing harvest).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s a very powerful idea, and one worth sharing.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Even when the 100%-harvest answer is useful to the client, it may still be preferable to trade response time for harvest when client-to-server bandwidth is limited, for example, by intelligent degradation to low-bandwidth formats.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Another good idea, and one that has been &lt;a href=&quot;http://www.opera.com/mobile/mini&quot;&gt;widely used&lt;/a&gt;. As good an ideas as it is, though, the paper is conflating at least three separate ideas: cases of shrinking data to conserve bandwidth, responding with cached data to conserve latency, and responding with a partial response to conserve availability. A more precise definition of &lt;em&gt;harvest&lt;/em&gt; would be very useful, as would definitions of different availability, latency and bandwidth tradeoffs.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The actual benefit is the ability to provision each subsystem’s state management separately, providing strong consistency or persistent state only for the subsystems that need it, not for the entire application.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s yet another good idea, as is the concept of &lt;em&gt;orthogonal mechanisms&lt;/em&gt; from section 5. Again, the problem is that the idea isn’t fully developed, and has some significant edge cases.&lt;/p&gt;

&lt;p&gt;I really like the concepts of &lt;em&gt;harvest&lt;/em&gt; and &lt;em&gt;yield&lt;/em&gt;, and many of the other ideas in this paper. I just find the whole thing hard to recommend as a unit. It feels like a bag full of unmarked tools. A sharp scalpel. A rusty hammer. A glass bottle of &lt;a href=&quot;http://pipeline.corante.com/archives/2010/02/23/things_i_wont_work_with_dioxygen_difluoride.php&quot;&gt;FOOF&lt;/a&gt;. A nice microscope. There’s a lot in there to like, but sticking your hand in and rummaging around may do more harm than good.&lt;/p&gt;

&lt;p&gt;In any case, &lt;em&gt;harvest&lt;/em&gt; and &lt;em&gt;yield&lt;/em&gt; isn’t really a CAP replacement. The CAP theorem is one boundary of the space of possible designs, a fence that can’t be crossed. Fox and Brewer’s ideas are more about the shape of the landscape inside the fence. That’s useful knowledge, but it’s really in a different category from CAP.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Essential Barbara Liskov</title>
      <link>http://brooker.co.za/blog/2014/09/21/liskov-pub.html</link>
      <pubDate>Sun, 21 Sep 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/09/21/liskov-pub</guid>
      <description>&lt;h1 id=&quot;the-essential-barbara-liskov&quot;&gt;The Essential Barbara Liskov&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Some of my favorite Barbara Liskov publications.&lt;/p&gt;

&lt;p&gt;Barbara Liskov is one of the greats of computer science. Over a research career nearing 45 years, she’s had a resounding impact on multiple different fields, and received an impressive list of honors and awards, including the &lt;a href=&quot;http://amturing.acm.org/award_winners/liskov_1108679.cfm&quot;&gt;2009 Turing Award&lt;/a&gt;. In the same spirit as &lt;a href=&quot;http://brooker.co.za/blog/2014/03/30/lamport-pub.html&quot;&gt;The Essential Leslie Lamport&lt;/a&gt; and &lt;a href=&quot;http://brooker.co.za/blog/2014/05/10/lynch-pub.html&quot;&gt;The Essential Nancy Lynch&lt;/a&gt;, I thought I’d write about some of my favorite Liskov papers. These are just papers I like or found particularly interesting, and I’m likely to have missed some you like.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;What does it mean for one type to be a subtype of another? We argue that this is a semantic question having to do with the behavior of the objects of the two types: the objects of the subtype ought to behave the same as those of the supertype as for as anyone or any program using the supertype objects can tell.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.acm.org/citation.cfm?doid=62139.62141&quot;&gt;Data abstraction and hierarchy&lt;/a&gt;, &lt;a href=&quot;http://dl.acm.org/citation.cfm?doid=197320.197383&quot;&gt;A behavioral notion of subtyping&lt;/a&gt; and &lt;a href=&quot;http://reports-archive.adm.cs.cmu.edu/anon/1999/CMU-CS-99-156.ps&quot;&gt;Behavioral subtyping using invariants and constraints&lt;/a&gt;, are why most working programmers would recognize Liskov’s name. The &lt;em&gt;Liskov Substitution Principle&lt;/em&gt;, widely known as the &lt;em&gt;L&lt;/em&gt; in SOLID, is a widely-followed rule about the relationship between the behavior of its supertypes and subtypes.&lt;/p&gt;

&lt;p&gt;I’ll readily admit that types are an area of computer science that I’m not very familiar with, but I found these papers easy to follow and very applicable. For the working programmers, there’s not much material there that isn’t covered in the &lt;a href=&quot;http://en.wikipedia.org/wiki/Liskov_substitution_principle&quot;&gt;wiki page&lt;/a&gt;, but it’s worth reading to see how Liskov lays out the arguments for the principle. If you’re interested in the history and thinking behind rules, these are great papers to read.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Availability is achieved through replication.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Transaction processing depends on forcing information to backups so that a majority of cohorts know about particular events.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://pmg.csail.mit.edu/papers/vr.pdf&quot;&gt;Viewstamped Replication: A New Primary Copy Method to Support Highly Available Distributed Systems&lt;/a&gt; deserves to be recognized as one of the most influential papers in distributed systems. Viewstamped replication predates Lamport’s Paxos, but solves the same problem in a very similar (though &lt;a href=&quot;http://www.cs.cornell.edu/fbs/publications/viveLaDifference.pdf&quot;&gt;distinct&lt;/a&gt;) way. The viewstamped replication paper remains both readable and relevant, although some of the descriptions and formalisms used show the paper’s age. If you only have time to read one paper on Viewstamped Replication, the recent &lt;a href=&quot;http://pmg.csail.mit.edu/papers/vr-revisited.pdf&quot;&gt;Viewstamped Replication Revisited&lt;/a&gt; is probably a better bet, because it provides a clearer description of the protocol and the design decisions it makes.&lt;/p&gt;

&lt;p&gt;I’ve &lt;a href=&quot;http://brooker.co.za/blog/2014/05/19/vr.html&quot;&gt;written before&lt;/a&gt; about viewstamped replication, and why I think it should be more widely recognized for the contributions it made to consensus, and the idea of &lt;a href=&quot;https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf&quot;&gt;state machine replication&lt;/a&gt;. It would be great to see more knowledge about VR among distributed systems practitioners.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Unlike other multistamp (or vector clock) schemes, our scheme is based on time rather than on logical clocks: each entry in a multistamp contains a timestamp representing the clock time at some server in the system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Version vectors (or vector clocks or multistamps) are a very widely used scheme for versioning data in distributed systems, but keeping them short in the face of scaling or reconfigurations and scaling is a real challenge. &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=259425&quot;&gt;Lazy consistency using loosely synchronized clocks&lt;/a&gt; paper presents an one approach, using loosely synchronized physical clocks. An interesting aspect of it is that it breaks from the orthodoxy that using clocks for ordering is bad (which it is, unless you are careful with your safety properties). If you find reading this worthwhile, I’d also recommend &lt;a href=&quot;ftp://ftp.cs.ucla.edu/tech-report/1997-reports/970022.ps.Z&quot;&gt;Dynamic Version Vector Maintenance&lt;/a&gt; and &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=266711&quot;&gt;Flexible update propagation for weakly consistent replication&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;One reason why Byzantine-fault-tolerant algorithms will be important in the future is that they can allow systems to continue to work correctly even when there are software errors. Not all errors are survivable; our approach cannot mask a software error that occurs at all replicas. However, it can mask errors that occur independently at different replicas, including non-deterministic software errors, which are the most problematic and persistent errors since they are the hardest to detect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Byzantine faults &lt;a href=&quot;http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf&quot;&gt;happen all the time&lt;/a&gt; in the real world. Handling them, and making safe progress in distributed systems despite them, is a huge challenge. Just getting the theory right is tricky. Getting to a practical implementation of Byzantine fault tolerance is even harder. That’s what makes &lt;a href=&quot;http://pmg.csail.mit.edu/papers/osdi99.pdf&quot;&gt;Practical Byzantine fault tolerance&lt;/a&gt; so important. Castro and Liskov describe an practically implementable Byzantine fault tolerant system, that works in a realistic system model. The key contribution here, over precursors like &lt;a href=&quot;https://www.cs.unc.edu/~reiter/papers/1994/CCS.pdf&quot;&gt;Rampart&lt;/a&gt;, is safety in real-world system models, especially removing assumptions about synchronicity.&lt;/p&gt;

&lt;p&gt;On a similar topic, &lt;a href=&quot;http://pmg.csail.mit.edu/papers/rodrigo_tr05.pdf&quot;&gt;Byzantine clients rendered harmless&lt;/a&gt; is also worth reading. Many approaches to Byzantine-fault-tolerant replication make both safety and liveness assumptions about the behavior of clients. Strengthening the protocol against client behavior is very important to practical systems.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Space Between Theory and Practice in Distributed Systems</title>
      <link>http://brooker.co.za/blog/2014/08/10/the-space-between.html</link>
      <pubDate>Sun, 10 Aug 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/08/10/the-space-between</guid>
      <description>&lt;h1 id=&quot;the-space-between-theory-and-practice-in-distributed-systems&quot;&gt;The Space Between Theory and Practice in Distributed Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;How do we learn synthesis?&lt;/p&gt;

&lt;p&gt;Teaching and learning about distributed systems, like any complex topic, requires real thought about what to teach and what to learn. It would be great to have enough time to teach and learn everything, but there’s just too much material out there. Even if we did have the time to cover every paper, result, code base, experience report and blog post in the field, we’d still need to chose an order to cover them in. There’s a natural order to things to the learned that makes them much easier to learn. That’s why I was excited to see Henry Robinson’s &lt;a href=&quot;http://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/&quot;&gt;Distributed systems theory for the distributed systems engineer&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It’s a good list of things to learn about, from the practical to the theoretical. I like the way it’s broken up into clear sections, and uses examples from many real-world systems from industry. This is a list I’m going to be recommending to people for a while, and going to be working through myself. Unfortunately, Henry’s list reflects a greater gap in the overall literature: the gap between theory and practice.&lt;/p&gt;

&lt;p&gt;From &lt;a href=&quot;http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;Dynamo&lt;/a&gt; and &lt;a href=&quot;https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf&quot;&gt;Cassandra&lt;/a&gt; to &lt;a href=&quot;http://static.googleusercontent.com/media/research.google.com/en/us/archive/chubby-osdi06.pdf&quot;&gt;Chubby&lt;/a&gt; and &lt;a href=&quot;https://www.usenix.org/legacy/event/usenix10/tech/full_papers/Hunt.pdf&quot;&gt;ZooKeeper&lt;/a&gt; there’s a wealth of content available on the design and implementation of real systems. Some of these papers go into real depth on seemingly small details (like &lt;a href=&quot;http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf&quot;&gt;Paxos Made Live&lt;/a&gt;) while others concern themselves with high-level architecture. Combined with things like &lt;a href=&quot;https://blogs.oracle.com/jag/resource/Fallacies.html&quot;&gt;Deutsch’s 8 Fallacies&lt;/a&gt; and Jeff Hodges &lt;a href=&quot;http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/&quot;&gt;Notes on Distributed Systems for Young Bloods&lt;/a&gt; there’s a lot of practical advice available to learn about the practical side of distributed systems.&lt;/p&gt;

&lt;p&gt;On the theoretical side, there’s also a wealth of material. Robinson points to &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;the CAP proof&lt;/a&gt; and the &lt;a href=&quot;http://macs.citadel.edu/rudolphg/csci604/ImpossibilityofConsensus.pdf&quot;&gt;FLP&lt;/a&gt; result, and admits that he’s only just scratching the surface. There are thousands of good theoretical results out there, from the usual suspects like &lt;a href=&quot;http://brooker.co.za/blog/2014/03/30/lamport-pub.html&quot;&gt;Lamport&lt;/a&gt; and &lt;a href=&quot;http://brooker.co.za/blog/2014/05/10/lynch-pub.html&quot;&gt;Lynch&lt;/a&gt; to areas like &lt;a href=&quot;http://www.amazon.com/Distributed-Computing-Through-Combinatorial-Topology/dp/0124045782&quot;&gt;topology&lt;/a&gt; and &lt;a href=&quot;http://www.cs.utexas.edu/~lorenzo/papers/Abraham11Distributed.pdf&quot;&gt;game theory&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I feel like if I went through everything I’ve read on distributed systems and arranged them on a spectrum from &lt;em&gt;theory&lt;/em&gt; to &lt;em&gt;practice&lt;/em&gt; the two ends would be really well populated, but the middle would be disturbingly empty. Worse, changing to a graph of citation links would show a low density from theory to practice. I strongly believe that a deep knowledge of theory makes practitioners smarter and better. I believe that a deep knowledge of practice makes researcher’s work more relevant. It would be great to see more material in this gap.&lt;/p&gt;

&lt;p&gt;Many will point at this stage that it’s not a complete gap. I’ll admit that there’s some great material there, including &lt;a href=&quot;http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf&quot;&gt;Paxos Made Live&lt;/a&gt; on the theory end of practice and Kenneth Birman’s &lt;a href=&quot;http://www.amazon.com/Guide-Reliable-Distributed-Systems-High-Assurance/dp/1447124154/&quot;&gt;Guide to Reliable Distributed Systems&lt;/a&gt; or Butler Lampson’s &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/blampson/58-Consensus/Acrobat.pdf&quot;&gt;How to Build a Highly Available System
Using Consensus&lt;/a&gt; on the practice end of theory. There are also blogs like &lt;a href=&quot;http://the-paper-trail.org/blog/&quot;&gt;Henry’s&lt;/a&gt; and &lt;a href=&quot;http://aphyr.com/&quot;&gt;Aphyr’s&lt;/a&gt; which do a good job in that gap. Despite this, I still see some big gaps in material. An example may be the easiest way to illustrate it:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If FLP says consensus is impossible with one faulty process, and faults happen all the time in practice, how are real systems built with consensus?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are a few ways to answer this question. One starts with pointing out that FLP talks about it being &lt;em&gt;not always possible&lt;/em&gt; to solve consensus, rather than &lt;em&gt;never possible&lt;/em&gt;. Another way is to point out that the real world is richer than FLP’s idealized model, and the problem can be solved with clocks or &lt;a href=&quot;http://brooker.co.za/blog/2014/01/12/ben-or.html&quot;&gt;a random oracle&lt;/a&gt;. A third way is to laugh derisively at the asker and point out that the answer is in &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt; (&lt;a href=&quot;https://www.hackerschool.com/manual#sec-environment&quot;&gt;feigning surprise&lt;/a&gt; &lt;em&gt;What‽ You haven’t read Paxos Made Simple‽&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;Despite these ‘obvious’ answers, it’s actually a really interesting question. On one side we see a researcher saying that consensus isn’t always possible, and on the other we hear practitioners talking about how they built highly-available systems using consensus algorithms. Who is right? Does the researcher have their head too far in the clouds? Is the practitioner so ignorant of theory that they have built a ticking time bomb?&lt;/p&gt;

&lt;p&gt;That’s the gap I am talking about: material that explains how the practice is synthesized from the theory, and how the theory is based off analysis of the practice. The exercise of synthesis is very seldom straight forward, but we too frequently leave it to the imagination. In this context, I use &lt;em&gt;synthesis&lt;/em&gt; to mean the process of gathering ideas from the literature and putting them together into a whole working system. Related processes include analysis of other systems, where we break them down into their consistuent parts and see what makes them work (or &lt;a href=&quot;http://aphyr.com/tags/Jepsen&quot;&gt;not work&lt;/a&gt;). These are among the most important processes behind successful engineering, but are written about least.&lt;/p&gt;

&lt;p&gt;I would love to see more material focused on exactly this synthesis problem in distributed systems, because I think it would help improve the quality of practice, and strengthen the dialog between practitioners and researchers. That’s good for all of us.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Use of Formal Methods at Amazon Web Services</title>
      <link>http://brooker.co.za/blog/2014/08/09/formal-methods.html</link>
      <pubDate>Sat, 09 Aug 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/08/09/formal-methods</guid>
      <description>&lt;h1 id=&quot;use-of-formal-methods-at-amazon-web-services&quot;&gt;Use of Formal Methods at Amazon Web Services&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;How we&apos;re using TLA+ at AWS&lt;/p&gt;

&lt;p&gt;Late last year, we published &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf&quot;&gt;Use of Formal Methods at Amazon Web Services&lt;/a&gt; about our experiences with using formal methods at Amazon Web Services (AWS). The focus is on &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html&quot;&gt;TLA+&lt;/a&gt;, and why we think it’s a great fit for the kind of work we do.&lt;/p&gt;

&lt;p&gt;From the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In order to find subtle bugs in a system design, it is necessary to have a precise description of that design. There are at least two major benefits to writing a precise design; the author is forced to think more clearly, which helps eliminate ‘plausible hand-waving’, and tools can be applied to check for errors in
the design, even while it is being written. In contrast, conventional design documents consist of prose, static diagrams, and perhaps pseudo-code in an adhoc untestable language. Such descriptions are far from precise; they are often ambiguous, or omit critical aspects such as partial failure or the granularity of concurrency (i.e. which constructs are assumed to be atomic). At the other end of the spectrum, the final executable code is unambiguous, but contains an overwhelming amount of detail. We needed to be able to capture the essence of a design in a few hundred lines of precise description.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf&quot;&gt;full paper&lt;/a&gt; is worth reading if you’re interested in formal methods.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>CAP and PACELC: Thinking More Clearly About Consistency</title>
      <link>http://brooker.co.za/blog/2014/07/16/pacelc.html</link>
      <pubDate>Wed, 16 Jul 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/07/16/pacelc</guid>
      <description>&lt;h1 id=&quot;cap-and-pacelc-thinking-more-clearly-about-consistency&quot;&gt;CAP and PACELC: Thinking More Clearly About Consistency&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;CAP is confusing. PACELC is better, but still not ideal.&lt;/p&gt;

&lt;p&gt;In some sense, &lt;em&gt;the CAP theorem&lt;/em&gt; has been too successful. With its snappy name and apparently easy-to-understand behavior, CAP has become the go-to way of talking about tradeoffs in distributed systems. Despite its apparent simplicity, confusion, misunderstandings, misrepresentations and &lt;a href=&quot;http://stackoverflow.com/questions/12346326/nosql-cap-theorem-availability-and-partition-tolerance&quot;&gt;debates&lt;/a&gt; about CAP are widespread. Criticism of CAP is also widespread, both in its use as a teaching tool and as a way of reasoning about tradeoffs. &lt;a href=&quot;http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean&quot;&gt;Dan Weinreb&lt;/a&gt;, &lt;a href=&quot;http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/&quot;&gt;Henry Robinson&lt;/a&gt;, and &lt;a href=&quot;http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems-eventual-consistency-and-the-cap-theorem/fulltext&quot;&gt;Michael Stonebraker&lt;/a&gt; have written good examples of the genre. It would be great to find a tool for teaching, and learning about, these tradeoffs that doesn’t have the same shortcomings as CAP.&lt;/p&gt;

&lt;p&gt;In this post, my focus is on CAP as a tool for learning and teaching about distributed systems tradeoffs, rather than its use by experienced practitioners. The problems with CAP as a teaching tool, as I see them, are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The popular belief is that CAP means &lt;em&gt;pick any two&lt;/em&gt;. The confusion is about the existence of CA systems, which pretend that partition tolerance is optional, or claim that partitions don’t happen. Marketing from database vendors plays into this, too. In reality, &lt;a href=&quot;http://codahale.com/you-cant-sacrifice-partition-tolerance/&quot;&gt;you can’t sacrifice partition tolerance&lt;/a&gt;, because &lt;a href=&quot;http://aphyr.com/posts/288-the-network-is-reliable&quot;&gt;partitions happen&lt;/a&gt; in real large-scale systems all the time. There is some evidence that the popularity of this misconception is waning, but it would be premature to announce its death.&lt;/li&gt;
  &lt;li&gt;The second misconception is that the CAP theorem means &lt;em&gt;you can’t be consistent and available during partitions&lt;/em&gt;. That’s not true, at least in the case of &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;Gilbert and Lynch’s&lt;/a&gt; version of the theorem. Specifically, the CAP theorem only prevents &lt;em&gt;everybody&lt;/em&gt; from being consistent and available, not &lt;em&gt;anybody&lt;/em&gt; (some literature calls this &lt;em&gt;always available&lt;/em&gt;). It doesn’t prevent clients and replicas on the majority side of simple partitions from making progress, and experiencing both consistency and availability. There are other restrictions on this, but they aren’t CAP restrictions.&lt;/li&gt;
  &lt;li&gt;The third misconception is that the &lt;em&gt;consistency&lt;/em&gt; in CAP is all or nothing, and that you can’t offer any consistency guarantees at all during partitions. In reality, many very useful consistency models can be offered on all sides of a partition. Implementation tricks like session stickiness and client-side caching can allow systems to offer useful models like &lt;em&gt;read your writes&lt;/em&gt;, &lt;em&gt;monotonic reads&lt;/em&gt; and even &lt;em&gt;causal consistency&lt;/em&gt;. &lt;a href=&quot;http://research.microsoft.com/pubs/192621/sigtt611-bernstein.pdf&quot;&gt;Bernstein and Das&lt;/a&gt;, and &lt;a href=&quot;http://arxiv.org/pdf/1302.0309v2.pdf&quot;&gt;Bailis et al&lt;/a&gt; have good overviews of some of the possibilities.&lt;/li&gt;
  &lt;li&gt;The fourth misconception is that &lt;em&gt;eventual consistency&lt;/em&gt; is all about CAP, and that everybody would chose strong consistency for every application if it wasn’t for the CAP theorem.&lt;/li&gt;
  &lt;li&gt;Another source of confusion is the different versions of the CAP theorem, from &lt;a href=&quot;http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf&quot;&gt;Brewer’s original version&lt;/a&gt;, to his &lt;a href=&quot;http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed&quot;&gt;later writings&lt;/a&gt; to &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;Gilbert and Lynch’s proof&lt;/a&gt;. Some seem to call the former &lt;em&gt;Brewer’s Conjecture&lt;/em&gt; and the latter &lt;em&gt;the CAP theorem&lt;/em&gt;, but this usage is far from universal. Typically, they’re both just called &lt;em&gt;CAP&lt;/em&gt; or &lt;em&gt;the CAP theorem&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This doesn’t mean that the CAP theorem, and Brewer’s more general CAP conjecture, isn’t useful as a teaching and learning tool. I think its usefulness has been overstated by popularity, and in its popularity a lot of the subtlety has been lost. I find that unfortunate, because it means more confused students, and more confused practitioners. It would be great to replace CAP with something equally snappy, with fewer subtle edges.&lt;/p&gt;

&lt;p&gt;Daniel Abadi’s &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;Consistency Tradeoffs in Modern Distributed Database System Design&lt;/a&gt; proposes an alternative: PACELC. Is that what we need?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A more complete portrayal of the space of potential consistency tradeoffs for DDBSs can be achieved by rewriting CAP as PACELC (pronounced “pass-elk”): if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Dan Weinreb &lt;a href=&quot;http://danweinreb.org/blog/improving-the-pacelc-taxonomy/&quot;&gt;suggests&lt;/a&gt; making it PACELCA instead, a formulation that I prefer:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The reason the acronym doesn’t need to be “PACELCA” is that if there are no partitions, then the system must be available. Adding an “A” to the second part is redundant. But for me (maybe not for you), putting in the redundant “A” in the “E” case helps me. A PA/EL system is always “available”, and calling it PA/ELA makes it easier for me to see that availability is always there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;PACELC mostly takes aim at the first and fourth misconceptions I listed: that CA systems exist, and that eventual consistency is all about CAP. First, it doesn’t allow you to ignore partition tolerance. Second, it makes it clear that even in the absence of partitions there’s a tradeoff between consistency and latency. In the general case, though &lt;a href=&quot;http://www.bailis.org/papers/hat-hotos2013.pdf&quot;&gt;not all cases&lt;/a&gt;, consistency &lt;a href=&quot;http://www.cs.utexas.edu/users/dahlin/papers/cac-tr.pdf&quot;&gt;requires a level of coordination&lt;/a&gt; which prevents systems from being &lt;em&gt;always available&lt;/em&gt; (in the Gilbert and Lynch sense), and increases latency when no partition is present. The matter of latency, which is of great practical importance in real-world systems, isn’t captured at all in CAP.&lt;/p&gt;

&lt;p&gt;So far so good for PACELC.&lt;/p&gt;

&lt;p&gt;The first two categories of PACELC are very clear. PC/EC is the most consistent class of systems, which never give up consistency. PA/EL systems don’t try hard to be consistent, and rather take the opportunity to reduce latency and gain availability by reducing coordination. The trickier ground starts with PA/EC. These types of systems give up consistency when there is a partition, and are consistent when there isn’t. That’s more subtle than it looks. When is there a partition? How long does the network need to be down before there is a partition? Is a single dropped connection or lost packet a partition? That may seem like nit picking, but there’s an important line to be drawn between &lt;em&gt;partition&lt;/em&gt; and &lt;em&gt;not partition&lt;/em&gt;. PACELC doesn’t help there.&lt;/p&gt;

&lt;p&gt;If PA/EC is tricky, PC/EL is madness. What does it mean to be more consistent during a partition? Daniel Abadi &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;says that’s the wrong question&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;PNUTS is a PC/EL system. In normal operation, it gives up consistency for latency; however, if a partition occurs, it trades availability for consistency. This is admittedly somewhat confusing: according to PACELC, PNUTS appears to get more consistent upon a network partition. However, PC/EL should not be interpreted in this way. PC does not indicate that the system is fully consistent; rather it indicates that the system does not reduce consistency beyond the baseline consistency level when a network partition occurs—instead, it reduces availability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s interesting, but not really helpful, because it introduces a subtle edge to PACELC of the same kind that exists in CAP. All is not lost. I feel that PACELC is a more useful model for beginners in thinking about distributed systems tradeoffs, but it’s still not the ideal solution. It would be great if a simple easily-understood replacement existed. Any ideas?&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Two traps in iostat: %util and svctm</title>
      <link>http://brooker.co.za/blog/2014/07/04/iostat-pct.html</link>
      <pubDate>Fri, 04 Jul 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/07/04/iostat-pct</guid>
      <description>&lt;h1 id=&quot;two-traps-in-iostat-util-and-svctm&quot;&gt;Two traps in iostat: %util and svctm&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;These commonly-used fields in iostat shouldn&apos;t be commonly-used.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;&quot;&gt;iostat&lt;/a&gt;, from the excellent &lt;a href=&quot;http://sebastien.godard.pagesperso-orange.fr/&quot;&gt;sysstat&lt;/a&gt; suite of utilities, is the go-to tool for evaluating IO performance on Linux. It’s obvious why that’s the case: sysstat is very useful, solid, and widely installed. System administrators can go a lot worse than taking a look at &lt;em&gt;iostat -x&lt;/em&gt;. There are some serious caveats lurking in &lt;em&gt;iostat&lt;/em&gt;’s output, two of which are greatly magnified on newer machines with solid state drives.&lt;/p&gt;

&lt;p&gt;To explain what’s wrong, let me compare two lines of &lt;em&gt;iostat&lt;/em&gt; output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Device:     rrqm/s   wrqm/s       r/s     w/s    rkB/s    wkB/s avgrq-sz 
sdd           0.00     0.00  13823.00    0.00 55292.00     0.00     8.00
             avgqu-sz   await r_await w_await  svctm  %util
                 0.78    0.06    0.06    0.00   0.06  78.40

Device:     rrqm/s   wrqm/s       r/s     w/s    rkB/s    wkB/s avgrq-sz
sdd           0.00     0.00  72914.67    0.00 291658.67    0.00     8.00
             avgqu-sz   await r_await w_await  svctm  %util
                15.27    0.21    0.21    0.00   0.01 100.00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Both of these lines are from the same device (a &lt;a href=&quot;http://www.samsung.com/global/business/semiconductor/minisite/SSD/global/html/about/SSD840EVO.html&quot;&gt;Samsung 840 EVO&lt;/a&gt; SSD), and both are from read-only 4kB random loads. What differs here is the level of parallelism: in the first load the mean queue depth is only 0.78, and in the second it’s 15.27. Same pattern, more threads.&lt;/p&gt;

&lt;p&gt;The first problem we run into with this output is the &lt;em&gt;svctm&lt;/em&gt; field, widely taken to be &lt;em&gt;the average amount of time an operation takes&lt;/em&gt;. The iostat man page describes it as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The average service time (in milliseconds) for I/O requests that were issued to the device.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and goes on to say:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The average service time (svctm field) value is meaningless, as I/O statistics are now calculated at block level, and we don’t know when the disk driver starts to process a request.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The reasons the man page states for this field being meaningless are true, as are the warnings in the sysstat code. The calculation behind &lt;em&gt;svctm&lt;/em&gt; is fundamentally broken, and doesn’t really have a clear meaning. Inside iostat, svctm in an interval is calculated as &lt;em&gt;time the device was doing some work&lt;/em&gt; / &lt;em&gt;number of IOs&lt;/em&gt;, that is the amount of time we were doing work, divided by the amount of work done. Going back to our two workloads, we can compare their service times:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;svctm
0.06
0.01
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Taken literally, this means the device was responding in 60µs when under little load, and 10µs when under a lot of load. That seems unlikely, and indeed the load generator &lt;a href=&quot;https://github.com/axboe/fio&quot;&gt;fio&lt;/a&gt; tells us it’s not true. So what’s going on?&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/Laptop-hard-drive-exposed-Evan-Amos.jpg&quot; alt=&quot;Hard Drive Exposed By Evan-Amos (Own work) CC-BY-SA-3.0 via Wikimedia Commons&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Magnetic hard drives are serial beings. They have a few tiny heads, ganged together, that move over a spinning platter to a single location where they do some IO. Once the IO is done, and no sooner, they move on. Over the years, they’ve gathered some shiny capabilities like &lt;a href=&quot;http://en.wikipedia.org/wiki/Native_Command_Queuing&quot;&gt;NCQ&lt;/a&gt; and &lt;a href=&quot;http://en.wikipedia.org/wiki/Tagged_Command_Queuing&quot;&gt;TCQ&lt;/a&gt; that make them appear parallel (mostly to allow reordering), but they’re still the same old horse-and-carriage sequential devices they’ve always been. Modern hard drives expose some level of concurrency, but no true parallelism. SSDs, like the Samsung 840 EVO in this test, are different. SSDs can and do perform operations in parallel. In fact, the only way to achieve their peak performance is to offer them parallel work to do.&lt;/p&gt;

&lt;p&gt;While SSDs vary in the details of their internal construction, most have the ability to access multiple flash &lt;em&gt;packages&lt;/em&gt; (groups of chips) at a time. This is a big deal for SSD performance. Individual flash chips actually don’t have great bandwidth, so the ability to group the performance of many chips together is essential. The chips are completely independent, and because the controller doesn’t need to block on requests to the chip, the drive is truly doing multiple things at once. Without the single electromechanical head as a bottleneck, even single SSDs can have a fairly large amount of internal parallelism. This diagram from &lt;a href=&quot;http://research.microsoft.com/pubs/63596/usenix-08-ssd.pdf&quot;&gt;Agarwal, et al&lt;/a&gt; shows the high-level architecture:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/agrawal-ssd-arch.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If Jane does one thing at a time, and doing ten things takes Jane 20 minutes, each thing has taken Jane an average of two minutes. The mean time between asking Jane to do something and Jane completing it is two minutes. Alice, like Jane, can do ten things in twenty minutes, but she works on multiple things in parallel. Looking only at Alice’s throughput (the number of things she gets done in a period of time) what can we say about Alice’s latency (the amount of time it takes her from start to finish for a task)? Very little. We know its less than 10 minutes. If she’s busy the whole time, we know it’s 2 or more minutes. That’s it.&lt;/p&gt;

&lt;p&gt;Let’s go back to that iostat output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Device:     rrqm/s   wrqm/s       r/s     w/s    rkB/s    wkB/s avgrq-sz 
sdd           0.00     0.00  13823.00    0.00 55292.00     0.00     8.00
             avgqu-sz   await r_await w_await  svctm  %util
                 0.78    0.06    0.06    0.00   0.06  78.40

Device:     rrqm/s   wrqm/s       r/s     w/s    rkB/s    wkB/s avgrq-sz
sdd           0.00     0.00  72914.67    0.00 291658.67    0.00     8.00
             avgqu-sz   await r_await w_await  svctm  %util
                15.27    0.21    0.21    0.00   0.01 100.00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;What’s going on with &lt;em&gt;%util&lt;/em&gt;, then? The first line is telling us that the drive is 78.4% utilized at 13823 reads per second. The second line is telling us that the drive is 100% utilized at 72914 reads per second. If it takes 14 thousand to fill it to 78.4%, wouldn’t we expect it to only be able to do 18 thousand in total? How is it doing 73 thousand?&lt;/p&gt;

&lt;p&gt;The problem here is the same - parallelism. When iostat says &lt;em&gt;%util&lt;/em&gt;, it means “Percentage of CPU time during which I/O requests were issued to the device”. The percentage of the time the drive was doing &lt;em&gt;at least one thing&lt;/em&gt;. If it’s doing 16 things at the same time, that doesn’t change. Once again, this calculation works just fine for magnetic drives (and Jane), which do only one thing at a time. The amount of time they spend doing one thing is a great indication of how busy they really are. SSDs (and RAIDs, and Alice), on the other hand, can do multiple things at once. If you can do multiple things in parallel, the percentage of time you’re doing &lt;em&gt;at least one thing&lt;/em&gt; isn’t a great predictor of your performance potential. The iostat man page does provide a warning:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Device saturation occurs when this value is close to 100% for devices serving requests serially.  But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As a measure of general IO busyness &lt;em&gt;%util&lt;/em&gt; is fairly handy, but as an indication of how much the system is doing compared to what it can do, it’s terrible. Iostat’s &lt;em&gt;svctm&lt;/em&gt; has even fewer redeeming strengths. It’s just extremely misleading for most modern storage systems and workloads. Both of these fields are likely to mislead more than inform on modern SSD-based storage systems, and their use should be treated with extreme care.&lt;/p&gt;

&lt;p&gt;&lt;sub&gt;Hard drive image by Evan-Amos (Own work) (&lt;a href=&quot;http://creativecommons.org/licenses/by-sa/3.0&quot;&gt;CC-BY-SA-3.0&lt;/a&gt; or &lt;a href=&quot;http://www.gnu.org/copyleft/fdl.html&quot;&gt;GFDL&lt;/a&gt;), via Wikimedia Commons&lt;/sub&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Operations Gradient: Improving Safety in Complex Systems</title>
      <link>http://brooker.co.za/blog/2014/06/29/rasmussen.html</link>
      <pubDate>Sun, 29 Jun 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/06/29/rasmussen</guid>
      <description>&lt;h1 id=&quot;the-operations-gradient-improving-safety-in-complex-systems&quot;&gt;The Operations Gradient: Improving Safety in Complex Systems&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Can we improve the safety of complex systems by listening to operators more?&lt;/p&gt;

&lt;p&gt;This week, I watched &lt;a href=&quot;https://www.youtube.com/watch?v=PGLYEDpNu60&amp;amp;feature=youtu.be&quot;&gt;an excellent lecture&lt;/a&gt; by &lt;a href=&quot;http://www.ctlab.org/Cook.cfm&quot;&gt;Richard Cook&lt;/a&gt;. He goes in some detail about why failures happen, through the lens of Rasmussen’s model of system safety. If you build or maintain any kind of complex system, don’t miss this lecture.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;What is surprising is not that there are so many accidents, it’s that there are so few.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model that takes up most of the lecture is best expressed in Rasmussen’s &lt;a href=&quot;http://www.sciencedirect.com/science/article/pii/S0925753597000520&quot;&gt;Risk Management in a Dynamic Society: A Modelling Problem&lt;/a&gt;, a classic paper that deserves more attention among engineers. The core of the insight of the model from Rasmussen’s paper comes from Figure 3:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/rasmussen-figure3.png&quot; alt=&quot;Rasmussen, 1997, Figure 3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Rasmussen describes the process of developing systems as an &lt;em&gt;adaptive search&lt;/em&gt; within a boundary defined by a set of economic constraints (it’s not economically viable to run the system beyond this boundary), engineering effort constraints (there are not enough actors to push the system beyond this boundary), and safety constraints (the system has failed beyond this boundary). The traditional balance between engineering effort and economic return plays out in pushing the operating point of the system away from two of these boundaries. From the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;During the adaptive search the actors have ample opportunity to identify an &lt;em&gt;effort gradient&lt;/em&gt; and management will normally supply an effective &lt;em&gt;cost gradient&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The combination of optimizing for these two gradients tends to push the operating point towards the safety boundary (or &lt;em&gt;boundary of acceptable performance&lt;/em&gt;). A conscious push for safety (or &lt;em&gt;availability&lt;/em&gt;, &lt;em&gt;durability&lt;/em&gt; and other safety-related properties) forces the operating point away from this boundary. One danger of this is that the position of the safety boundary is not always obvious, and it’s also not a single clean line. From the paper:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;in systems designed according to a defence-in-depth strategy, the defenses are likely to degenerate systematically through time, when pressure towards cost-effectiveness is dominating.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a key point, because defence-in-depth is frequently seen as a very good thing. It’s danger in this model is that it turns the safety boundary into a gradient, and significant degeneration in safety can happen before there is any accident. In response to this, we estimate an error margin, we put up an organizational “&lt;em&gt;perceived boundary of acceptable performance&lt;/em&gt;”, and we put systems in place to monitor that boundary. That’s a good idea, but doesn’t solve the problem. In Cook’s talk he says “&lt;em&gt;repeated experience with successful operations leads us to believe that the margin is too conservative, that we have space in there that we could use&lt;/em&gt;”. We want to use that space, because that allows us to optimize both for economics (getting further from our economic failure boundary) and engineering effort (getting further from our effort boundary). The response to this is organizational pressure to shrink the margin.&lt;/p&gt;

&lt;p&gt;On the other hand, growing the margin, or at least the perceived margin, doesn’t necessarily increase safety:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;… drivers tend to try to keep their arousal at a desired, constant level and, consequently, go faster if conditions become too undemanding. … traffic safety is hard to improve beyond a certain limit.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Rasmussen cites &lt;a href=&quot;http://www.tandfonline.com/doi/abs/10.1080/00140138108924870?journalCode=terg20#preview&quot;&gt;Taylor&lt;/a&gt; to provide evidence of &lt;em&gt;risk homeostasis&lt;/em&gt; in traffic safety. This effect suggests that people (and organizations) will push the limits of safety to a certain perceived level, and increasing perceived safety will encourage risky behavior. While Rasmussen cites some data for this, and it has been suggested by many others, it is hard to reconcile some of these claims with the evidence that real-world traffic safety has been improved significantly since these claims were made. Whether risk compensation behavior exists, and the extent to which is contributes to a &lt;em&gt;risk homeostasis&lt;/em&gt; effect, appears to be an area of active research. In skiing, for example, there &lt;a href=&quot;http://journals.lww.com/epidem/Fulltext/2012/11000/Does_Risk_Compensation_Undo_the_Protection_of_Ski.35.aspx&quot;&gt;does not appear to be a significant risk compensation effect&lt;/a&gt; with helmet use. Still, this effect may be a significant one, and should be considered before attempting to widen the perceived safety margin of a system.&lt;/p&gt;

&lt;p&gt;Another way to move the operating point away from the safety boundary is to shift it in a big discontinuity after accidents occur. Using accidents as a driver away from the safety boundary is difficult for three reasons. The biggest one is that it requires accidents to occur, and that no feedback is provided between accidents. This &lt;em&gt;wait until you crash before turning the corner&lt;/em&gt; approach can have very high costs, especially in safety and life critical systems. Another difficulty is the effect of the &lt;a href=&quot;http://en.wikipedia.org/wiki/Availability_heuristic&quot;&gt;availability heuristic&lt;/a&gt;, our natural tendency to discount evidence that is difficult to recall. The longer it has been since an accident occurred, the smaller the push it provides away from the safety boundary. The third difficulty is that investigating accidents is really hard, and knowing which direction to move the operating point relies on good investigations. Simple &lt;em&gt;it was human error&lt;/em&gt; conclusions are unlikely to move the operating point in the right direction.&lt;/p&gt;

&lt;p&gt;Is all lost? No, but to make progress we may need to change the way that we think about measuring the safety of our systems. &lt;a href=&quot;http://www.ctlab.org/documents/Cook%20and%20Nemeth-Observations%20of%20the%20Usefulness%20of%20Error.pdf&quot;&gt;Cook and Nemeth&lt;/a&gt; make a distinction between those at the &lt;em&gt;sharp end&lt;/em&gt; (operators) and those at the &lt;em&gt;blunt end&lt;/em&gt; (management).&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Those who are closest to the blunt (management) end are most remote from sharp end operations and are concerned with maintaining the organization, and threats to the organization are minimized by casting adverse events as anomalies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of treating crossings of the safety boundary as anomalies, we should incorporate more feedback from the &lt;em&gt;sharp end&lt;/em&gt; into the process that chooses the system operating point. This sharp-end gradient, mostly supplied by operators of complex systems, can provide a valuable third gradient (along with the gradients towards least effort and towards efficiency). The advantage of this approach is that it is continuous, in that it doesn’t rely on big accidents and investigations, and adaptive, it constantly measures the local position of the safety boundary and provides a slope away from it. Getting this right requires constant attention from operators, and requires a conscious decision to include operators in the decision making process.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Viewstamped Replication: The Less-Famous Consensus Protocol</title>
      <link>http://brooker.co.za/blog/2014/05/19/vr.html</link>
      <pubDate>Mon, 19 May 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/05/19/vr</guid>
      <description>&lt;h1 id=&quot;viewstamped-replication-the-less-famous-consensus-protocol&quot;&gt;Viewstamped Replication: The Less-Famous Consensus Protocol&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;The first practical consensus protocol may be the least famous.&lt;/p&gt;

&lt;p&gt;There’s no doubt that Paxos is the most famous distributed consensus protocol. Distributed systems reading lists (such as &lt;a href=&quot;http://dancres.github.io/Pages/&quot;&gt;this one by Dan Creswell&lt;/a&gt; and &lt;a href=&quot;http://christophermeiklejohn.com/distributed/systems/2013/07/12/readings-in-distributed-systems.html&quot;&gt;this one by Christopher Meiklejohn&lt;/a&gt;) nearly all include at least one paper describing it. That’s as it should be. There’s no doubt Paxos has been extremely influential in industry, and has formed the basis of many extremely successful systems. Lamport’s &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt; is very readable, and papers like Google’s &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=1281103&quot;&gt;Paxos Made Live&lt;/a&gt; have helped raise the visibility of good Paxos implementation techniques.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf&quot;&gt;Raft&lt;/a&gt;, on the other hand, is the most fashionable distributed consensus protocol. It seems like everybody’s implementing it right now. I don’t know if it’s as &lt;em&gt;understandable&lt;/em&gt; as it claims to be, but it’s definitely in vogue. While I’m not a big fan of technology fads, I find it difficult to be upset about this one. Anything that encourages people to use well-proven distributed algorithms instead of crafting their own is good in my book.&lt;/p&gt;

&lt;p&gt;By comparison to Paxos and Raft, one distributed consensus protocol seems frequently overlooked: Oki and Liskov’s &lt;a href=&quot;http://www.pmg.csail.mit.edu/papers/vr.pdf&quot;&gt;Viewstamped Replication&lt;/a&gt;. Introduced in May 1988 in &lt;a href=&quot;http://www.pmg.csail.mit.edu/papers/MIT-LCS-TR-423.pdf&quot;&gt;Brian Oki’s PhD thesis&lt;/a&gt;, Viewstamped Replication predates the first publication of Paxos by about a year. If you’re looking for intrigue you may be disappointed: both Lamport and Liskov claim the inventions were independent. First, from &lt;a href=&quot;http://pmg.csail.mit.edu/papers/vr-revisited.pdf&quot;&gt;Viewstamped Replication Revisited&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;VR was originally developed in the 1980s, at about the same time as Paxos, but without knowledge of that work.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and from Keith Marzullo’s comments in the 1998 reprinting of &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf&quot;&gt;The Part-Time Parliament&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The author was also apparently unaware that the view management protocol by Oki and Liskov seems to be equivalent to the Paxon protocol.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In many ways, the Paxos protocol as described in The Part-Time Parliament and the Viewstamped Replication protocol are surprisingly different. Paxos’ &lt;em&gt;Synod&lt;/em&gt; protocol is the basic building block of consensus, and is used more-or-less directly for data replication. On the other hand, in VR, normal requests are merely stamped with a &lt;em&gt;view number&lt;/em&gt; on their way through the primary, and are sent to all the replicas in parallel. The similarities start to become apparent when looking at how that &lt;em&gt;view number&lt;/em&gt; is chosen: VR’s &lt;em&gt;view change&lt;/em&gt; protocol. In fact, the view change protocol describe in Section 4.2 of &lt;a href=&quot;http://pmg.csail.mit.edu/papers/vr-revisited.pdf&quot;&gt;Viewstamped Replication Revisited&lt;/a&gt; bears a striking resemblance to the Paxos Synod protocol, especially when compared to the description in Section 2.2 of &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It’s easy to believe that these two protocols are, in fact, the same. That doesn’t appear to be the case. A new paper by van Renesse et. al., titled &lt;a href=&quot;http://www.cs.cornell.edu/fbs/publications/viveLaDifference.pdf&quot;&gt;Vive La Difference: Paxos vs. Viewstamped Replication vs. Zab&lt;/a&gt;, looks at Paxos and VR through the lenses of refinement and abstraction, and finds they are not exactly equivalent due to &lt;em&gt;design decisions&lt;/em&gt; in the way they refine a model the paper calls Multi-Consensus. One of the key differences is active (Paxos) vs. passive (VR) replication:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Passive vs. Active Replication: In active replication, at least f + 1 replicas each must execute operations. In passive replication, only the sequencer executes operations, but it has to propagate state updates to the backups.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I don’t have an answer for why Paxos is so much more famous than Viewstamped Replication. The first publication of viewstamped replication was more readable, though less entertaining, than the first publication of Paxos. Implemented &lt;em&gt;out of the paper&lt;/em&gt;, VR likely has better performance properties than Paxos, for similar implementation effort and complexity. Barbara Liskov is more widely known among programmers and computer scientists than Leslie Lamport, thanks to the &lt;a href=&quot;http://en.wikipedia.org/wiki/Liskov_substitution_principle&quot;&gt;Liskov substitution principle&lt;/a&gt;. I can’t think of a good explanation at all.&lt;/p&gt;

&lt;p&gt;Whatever the cause, both &lt;a href=&quot;http://www.pmg.csail.mit.edu/papers/vr.pdf&quot;&gt;Viewstamped Replication&lt;/a&gt; and &lt;a href=&quot;http://pmg.csail.mit.edu/papers/vr-revisited.pdf&quot;&gt;Viewstamped Replication Revisited&lt;/a&gt; are well worth including on your distributed systems reading list.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Essential Nancy Lynch</title>
      <link>http://brooker.co.za/blog/2014/05/10/lynch-pub.html</link>
      <pubDate>Sat, 10 May 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/05/10/lynch-pub</guid>
      <description>&lt;h1 id=&quot;the-essential-nancy-lynch&quot;&gt;The Essential Nancy Lynch&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Some of my favorite Nancy Lynch publications.&lt;/p&gt;

&lt;p&gt;While reading distributed systems papers, one of the names that comes up most often is Nancy Lynch’s. From a standard textbook for university distributed systems courses (&lt;a href=&quot;http://www.amazon.com/dp/1558603484&quot;&gt;Distributed Algorithms&lt;/a&gt;), to some of the earliest successful results on consensus, to the proof of the CAP theorem, Lynch’s name is everywhere. In the same spirit as &lt;a href=&quot;http://brooker.co.za/blog/2014/03/30/lamport-pub.html&quot;&gt;The Essential Leslie Lamport&lt;/a&gt;, I thought I’d write about some of my favorite Nancy Lynch papers. The criteria are the same as last time: I like these papers for some reason. I’d probably make a different list if I wrote this post again next week.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;What good are impossibility results, anyway? They don’t seem very useful at first, since they don’t allow computers to do anything they couldn’t previously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf&quot;&gt;A Hundred Impossibility Proofs for Distributed Computing&lt;/a&gt; is a great read. It covers a huge amount of ground across most of the distributed systems field as it stood in 1989, and presents an overwhelming number of results. The focus is on impossibility results and bounds (as the title suggests), but the paper frequently wanders off this path.&lt;/p&gt;

&lt;p&gt;This paper is worth reading on it’s own, but it’s also a really great way to discover distributed systems papers you haven’t seen before. With 103 references, there’s plenty to keep you busy if you’re looking for papers and books to read. Despite covering some deep results quite formally, the paper remains readable even without deep expertise in some of the areas it covers. A Hundred Impossibility Proofs is also a great piece of history, a snapshot of the distributed systems world 25 years ago.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading:&lt;/em&gt; It presents a huge number of results in a very compact and readable package. You won’t get bored reading this paper.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;However the &lt;em&gt;read&lt;/em&gt; request does not begin until after the write request … has completed. This therefore contradicts the atomicity property, proving that no such algorithm exists.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The algorithm that doesn’t exist is one that implements a writeable data object guaranteeing both consistency and availability in all executions in an asynchronous system. In other words, one that solves one definition of Brewer’s CAP theorem. &lt;a href=&quot;http://theory.lcs.mit.edu/tds/papers/Gilbert/Brewer6.ps&quot;&gt;Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.&lt;/a&gt; is important to the practice of distributed systems today, because it’s held up as a proof of the CAP theorem (which it is), and provides some definitions under which conditions the CAP theorem is true. The paper spends about half its length looking at ways to circumvent the CAP theorem in partially synchronous networks.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading:&lt;/em&gt; It both proves the CAP theorem, and debunks many of the common statements of it. The proof itself is simple and succinct, and provides real insight into why CAP is true.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It is easy to see that all correct processors make decisions&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://theory.lcs.mit.edu/tds/papers/Lynch/jacm88.pdf&quot;&gt;Consensus in the Presence of Partial Synchrony&lt;/a&gt; is one of three solutions to the consensus problem from the late 1980s. The others, Oki and Liskov’s &lt;a href=&quot;http://www.pmg.csail.mit.edu/papers/vr.pdf&quot;&gt;Viewstamped Replication&lt;/a&gt; and Lamport’s &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf&quot;&gt;Paxos&lt;/a&gt;, are arguably more general and perhaps more interesting solutions, but this one is still very influential. The algorithms in &lt;a href=&quot;http://theory.lcs.mit.edu/tds/papers/Lynch/jacm88.pdf&quot;&gt;Consensus in the Presence of Partial Synchrony&lt;/a&gt; are interesting because they break the problem up differently from both Oki and Liskov and Lamport, and provide real insight into the structure of the consensus problem. The algorithms appear to be more complex and more cumbersome than either VR or Paxos, mostly because of the way they execute ballots in rounds. Still, this is very interesting stuff.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading:&lt;/em&gt; This is more a piece of history than the others papers in this list, but it’s still worth reading because it provides a view of a common problem from a different angle.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;we show the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. … the stopping of a single process at an inopportune time can cause any distributed commit protocol to fail to reach agreement. Thus, this important problem has no robust solution without further assumptions about the computing environment or still greater restrictions on the kind of failures to be tolerated!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There are fairly few results in computer science that are seen as so influential that they have a widely recognized initialism. The FLP result, named after Fischer, Lynch and Paterson, is one such results. It may be even more notable because it’s an impossibility result: instead of describing how to do something, FLP simply states that it can’t be done.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;http://theory.lcs.mit.edu/tds/papers/Lynch/pods83-flp.pdf&quot;&gt;Impossibility of distributed consensus with one faulty process&lt;/a&gt;, FLP describe how no asynchronous protocol exists that can always reach consensus, even in the case of a single process failure, no matter how many participants there are. It’s easy to see why this result is so influential. Still, as famous as FLP is, it seems to be less widely known (and much less widely misinterpreted) than CAP.&lt;/p&gt;

&lt;p&gt;The thing that stands out for me in this paper is the beauty and simplicity of the proof. I found Lemma 2 and Lemma 3 in the paper both surprising and enlightening. The proof in the paper is worth reading, but may be easiest to approach once you already understand it. Henry Robinson’s &lt;a href=&quot;http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/&quot;&gt;A Brief Tour of FLP Impossibility&lt;/a&gt; is a great place to start.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading:&lt;/em&gt; It’s both an important result and a beautiful proof.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;at each round, until termination is reached, each process sends its latest value to all processes (including itself). On receipt of a collection V of values, the process computes a certain function f(V) as its next value.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Lynch/jacm86.pdf&quot;&gt;Reaching approximate agreement in the presence of faults&lt;/a&gt; is one of my favorite distributed systems papers. I like it because it contains some incredibly simple, beautiful and yet non-obvious algorithms. I also like it because it take a different look at a common problem. It looks at the FLP result and says &lt;em&gt;“OK, if I can’t get exact agreement, how close can I get?”&lt;/em&gt;. It turns out that you can get arbitrarily close to agreement with guaranteed termination. In some ways, this is the flip side of &lt;a href=&quot;http://brooker.co.za/blog/2014/01/12/ben-or.html&quot;&gt;Ben-Or’s consensus with probability 1&lt;/a&gt; results.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is worth reading because&lt;/em&gt; it’ll change the way you think about agreement in distributed systems.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Failure Detectors, and Non-Blocking Atomic Commit</title>
      <link>http://brooker.co.za/blog/2014/04/14/failure-detectors.html</link>
      <pubDate>Mon, 14 Apr 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/04/14/failure-detectors</guid>
      <description>&lt;h1 id=&quot;failure-detectors-and-non-blocking-atomic-commit&quot;&gt;Failure Detectors, and Non-Blocking Atomic Commit&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Non-blocking atomic commit is harder than uniform consensus. Why would that be?&lt;/p&gt;

&lt;p&gt;Many of the most interesting results in distributed systems have come from looking at problems known to be impossible under one set of constraints, and finding how little those constraints can be relaxed before the problem becomes possible. One great example is how adding a &lt;a href=&quot;http://brooker.co.za/blog/2014/01/12/ben-or.html&quot;&gt;random Oracle&lt;/a&gt; to the asynchronous system model used by &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf&quot;&gt;FLP&lt;/a&gt; makes consensus possible. That result is very interesting, but not as practically important as the idea of failure detectors.&lt;/p&gt;

&lt;p&gt;The theoretical importance of detecting failures in the asynchronous model dates back to work in the 1980s from &lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Stockmeyer/DolevDS83-focs.pdf&quot;&gt;Dolev, Dwork and Stockmeyer&lt;/a&gt; and &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.3423&quot;&gt;Dwork, Lynch and Stockmeyer&lt;/a&gt;. The latter of these papers is very interesting, because it describes what can be argued is the first practical consensus algorithm before the publication of Viewstamped Replication and Paxos. More on that another time. A great, detailed, description and characterization of failure detectors can be found in &lt;a href=&quot;http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf&quot;&gt;Unreliable Failure Detectors for Reliable Distributed Systems&lt;/a&gt; by Chandra and Toueg. They also introduced the concept of &lt;em&gt;unreliable failure detectors&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In this paper, we propose an alternative approach to circumvent such impossibility results, and to broaden the applicability of the asynchronous model of computation. Since impossibility results for asynchronous systems stem from the inherent difficulty of determining whether a process has actually crashed or is only “very slow,” we propose to augment the asynchronous model of computation with a model of an external failure detection mechanism that can make mistakes. In particular, we model the concept of &lt;em&gt;unreliable failure detectors&lt;/em&gt; for systems with &lt;em&gt;crash failures&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The failure detectors that Chandra and Toueg describe are distributed, rather than global, failure detectors. Each process uses local state to keep a list of other processes that it suspects have failed, and adds and removes processes from this list based on communication with other processes. A local failure detector like that can make two kinds of mistakes: putting processes that haven’t failed onto the list, and not putting processes that have failed onto the list. Remember, that like in all real distributed systems, there is no central oracle that can tell a node whether its list is correct. These two kinds of mistakes lead to the definition of two properties. From &lt;a href=&quot;http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf&quot;&gt;Chandra and Toueg&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;completeness&lt;/em&gt; requires that a failure detector eventually suspects every process that actually crashes, while &lt;em&gt;accuracy&lt;/em&gt; restricts the mistakes that a failure detector can make.⋄&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s start with the best combination of these properties, a failure detector where every correct process eventually permanently suspects every crashed process of failing (&lt;em&gt;strong completeness&lt;/em&gt;), and never suspects a non-crashed process (&lt;em&gt;strong accuracy&lt;/em&gt;). This failure detector, &lt;strong&gt;&lt;em&gt;P&lt;/em&gt;&lt;/strong&gt;, can be seen as the ideal asynchronous failure detector. It doesn’t make mistakes, and it does the best it can at detection while remaining asynchronous. At the other end of the scale is ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt;. With this failure detector, every crashed process is eventually permanently suspected by some crashed process, and eventually some correct process is not suspected by any correct process.  ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt;, unlike &lt;strong&gt;&lt;em&gt;P&lt;/em&gt;&lt;/strong&gt;, can make lots and lots of mistakes, for arbitrarily long amounts of time.&lt;/p&gt;

&lt;p&gt;Before going further, it’s worth introducing one piece of notation. Even informal writing about failure detectors tends to make heavy use of the ◇ operator from &lt;a href=&quot;http://en.wikipedia.org/wiki/Temporal_logic&quot;&gt;temporal logic&lt;/a&gt;. Don’t be put off by the notation, ◇F simply means &lt;em&gt;F is eventually true&lt;/em&gt;. There is some state in the future where F is true. To better understand that, let’s compare the failure detectors ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt;. Both of these meet the weak completeness condition:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Weak Completeness.&lt;/em&gt; Eventually every process that crashes is permanently suspected by some correct process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; meets the weak accuracy condition:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Weak Accuracy&lt;/em&gt;. Some correct process is never suspected.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; only meets the strictly weaker eventual weak accuracy condition.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Eventual Weak Accuracy&lt;/em&gt;. There is a time after which some correct process is never suspected by any correct process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Comparing those two makes the difference more obvious. ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; is allowed to make mistakes early on (before &lt;em&gt;a time&lt;/em&gt;) what &lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; isn’t allowed to make.&lt;/p&gt;

&lt;p&gt;The existence of these classes of failure detectors allows meaningful comparisons to be made about the difficulty of different distributed problems, much like &lt;a href=&quot;http://en.wikipedia.org/wiki/Complexity_class&quot;&gt;complexity classes&lt;/a&gt; allow us to compare the difficulty of computational problems. For example, &lt;a href=&quot;http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p685-chandra.pdf&quot;&gt;it is known&lt;/a&gt; that consensus can be solved using ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; if only a minority of processes fail. The problem known as non-blocking atomic commit (NB-AC), on the other hand, &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.6456&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;cannot be solved&lt;/a&gt; with ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; if there is a single failure. In a very meaningful sense, NB-AC is &lt;em&gt;harder than&lt;/em&gt; consensus. When I first learned about that result, I found it surprising: my assumption had been that uniform consensus was equivalent to the hardest problems in distributed systems.&lt;/p&gt;

&lt;p&gt;First, let’s define the NB-AC and consensus problems. They have a lot in common, both being non-blocking agreement problems. Both consensus and NB-AC attempt to get a multiple processes to agree on a single value without blocking in the presence of failures. &lt;a href=&quot;http://en.wikipedia.org/wiki/2PC&quot;&gt;Two-phase commit&lt;/a&gt; is, like NB-AC and consensus, an agreement protocol, but it is a blocking one. The presence of a single failure will cause 2PC to block forever.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.6456&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Guerraoui&lt;/a&gt; defines consensus with three conditions:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Agreement&lt;/em&gt;: No two correct participants decide different values&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Uniform-Validity:&lt;/em&gt; If a participant decides &lt;em&gt;v&lt;/em&gt;, then &lt;em&gt;v&lt;/em&gt; must have been proposed by some participant&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Termination:&lt;/em&gt; Every correct process eventually decides&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Uniform consensus expands the &lt;em&gt;agreement&lt;/em&gt; condition to a stronger one, called &lt;em&gt;uniform agreement&lt;/em&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Uniform-Agreement&lt;/em&gt;: No two participants (correct or not) decide different values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Consensus is, therefore, about &lt;em&gt;deciding&lt;/em&gt;. NB-AC, on the other hand, is about &lt;em&gt;accepting&lt;/em&gt; or voting on whether to &lt;em&gt;commit&lt;/em&gt; a transaction. &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.6456&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Guerraoui&lt;/a&gt; defines it with four conditions:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Uniform-Agreement&lt;/em&gt;: No two participants AC-decide different values.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Uniform-Validity:&lt;/em&gt; If a participant AC-decides &lt;em&gt;commit&lt;/em&gt;, then all participants voted ‘‘yes’’.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;Termination:&lt;/em&gt; Every correct process eventually AC-decides&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;NonTriviality&lt;/em&gt;: If all participants vote &lt;em&gt;yes&lt;/em&gt;, and there is no failure, then every correct participant eventually AC-decides &lt;em&gt;commit&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice how similar this appears to be to the uniform consensus problem. Guerraoui describes how it is the last one of these conditions, &lt;em&gt;NonTriviality&lt;/em&gt;, which has the effect of requiring that a solution to NB-AC has precise knowledge about failures. To meet the &lt;em&gt;Termination&lt;/em&gt; condition, eventually each process needs to &lt;em&gt;commit&lt;/em&gt; or &lt;em&gt;abort&lt;/em&gt;. Eventual strong accuracy doesn’t provide the knowledge required to make that decision, because it admits a time &lt;em&gt;t&lt;/em&gt; where a process is simply delayed but it’s vote is ignored (violating the &lt;em&gt;uniform validity&lt;/em&gt; or &lt;em&gt;NonTriviality&lt;/em&gt; conditions depending on the vote). Weak accuracy doesn’t provide the right knowledge either, because it allows an incorrect abort (and hence violation of &lt;em&gt;NonTriviality&lt;/em&gt;) based on incomplete knowledge of the failed set.&lt;/p&gt;

&lt;p&gt;If you only have unreliable failure detectors, uniform consensus is &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.6456&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;no harder than consensus&lt;/a&gt;, though reliable failure detectors (like &lt;strong&gt;&lt;em&gt;P&lt;/em&gt;&lt;/strong&gt;) &lt;a href=&quot;http://infoscience.epfl.ch/record/88273/files/CBS04.pdf?version=1&quot;&gt;make consensus easier&lt;/a&gt; than uniform consensus. Therefore, the addition of the &lt;em&gt;uniform agreement&lt;/em&gt; requirement doesn’t explain why consensus can be solved with ◇&lt;strong&gt;&lt;em&gt;W&lt;/em&gt;&lt;/strong&gt; and NB-AC can’t. Instead, it’s that seemingly harmless &lt;em&gt;NonTriviality&lt;/em&gt; condition that makes NB-AC harder. That’s a great example of how intuition is often a poor guide in distributed systems problems: seemingly similar problems, with very similar definitions, may end up with completely different difficulties.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The Essential Leslie Lamport</title>
      <link>http://brooker.co.za/blog/2014/03/30/lamport-pub.html</link>
      <pubDate>Sun, 30 Mar 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/03/30/lamport-pub</guid>
      <description>&lt;h1 id=&quot;the-essential-leslie-lamport&quot;&gt;The Essential Leslie Lamport&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Some of my favourite Leslie Lamport publications.&lt;/p&gt;

&lt;p&gt;After it was announced that Leslie Lamport had won &lt;a href=&quot;http://amturing.acm.org/award_winners/lamport_1205376.cfm&quot;&gt;the 2013 A.M. Turing award&lt;/a&gt;, the link to his &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html&quot;&gt;list of publications&lt;/a&gt; found popularity on most of the tech-related sites I visit. It’s an excellent page, with a long (and growing) list of Lamport’s publications, and witty comments by the author on each one. The whole list is worth a read, but can feel overwhelming, so I thought I’d try distill it down it some papers that I feel are really worth reading, if you read nothing else on that page. The criteria are: I like these papers for some reason. I’d probably make a different list if I wrote this post again next week.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The algorithm is quite simple. It is based upon one commonly used in bakeries, in which a customer receives a number upon entering the store. The holder of the lowest number is the next one served.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/bakery.pdf&quot;&gt;A New Solution of Dijkstra’s Concurrent Programming Problem&lt;/a&gt; Lamport describes the mutual exclusion problem formally posed by &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=365617&quot;&gt;Dijkstra&lt;/a&gt;, and presents a solution to it. The &lt;em&gt;bakery algorithm&lt;/em&gt; is remarkable. Unlike the earlier solutions, which depended on shared memory locations with fairly restrictive behaviors, the bakery algorithm works without any underlying mutual exclusion. Lamport’s invention, or discovery, of this algorithm seems to kicked off a cascade of other solutions. &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.2277&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Szymanski’s&lt;/a&gt;, which seems to have been the first to offer both strong fairness and use of a fixed number of shared variables of a bounded size, is particularly interesting.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading&lt;/em&gt;: The bakery algorithm, while not very relevant to today’s concurrent software due to changes in memory models, is very simple, very beautiful, and solves a complex problem in an innovative way. It’s simply a beautiful piece of computer science.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In a distributed system, it is sometimes impossible to say that one of two events occurred first. The relation &lt;em&gt;“happened before”&lt;/em&gt; is therefore only a partial ordering of the events in the system. We have found that problems often arise because people are not fully aware of this fact and its implications.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;Being able to totally order the events can be very useful in implementing a distributed system. In fact, the reason for implementing a correct system of logical clocks is to obtain such a total ordering.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf&quot;&gt;Time, Clocks and the Ordering of Events in a Distributed System&lt;/a&gt; is Lamport’s most widely cited paper by a large margin. Google Scholar claims over 8000 citations, &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.142.3682&amp;amp;rank=1&quot;&gt;CiteSeer&lt;/a&gt; finds 2335. His next most cited paper, &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf&quot;&gt;The Byzantine Generals Problem&lt;/a&gt;, has fewer than a quarter of as many citations. Citeseer’s view of &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/similar?doi=10.1.1.142.3682&amp;amp;type=cc&quot;&gt;co-citations&lt;/a&gt; of this paper is interesting. The top two are Fischer, Lynch and Paterson’s &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.6760&quot;&gt;Impossibility of Distributed Consensus with One Faulty Process&lt;/a&gt; which contains the famous &lt;em&gt;FLP impossibility&lt;/em&gt; result, and Lamport’s own The Byzantine Generals problem. Browsing through the papers that cite both of these, it’s disturbingly easy to find cases where they are included more for name recognition than any real relevance to the subject at hand.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading&lt;/em&gt;: Aside from the sad realities of citation practices, this is legitimately a fascinating, important and highly influential paper. The paper points to &lt;a href=&quot;https://tools.ietf.org/html/rfc677&quot;&gt;The Maintenance of Duplicate Databases&lt;/a&gt; as the first published description of logical clocks in distributed systems, and both extends and clarifies the idea. Time, Clocks… presents a way to extract a deterministic total ordering (note, not &lt;em&gt;the&lt;/em&gt; total ordering) from a partial ordering of events. It also introduces the idea of replicated state machines, which Lamport later expanded on in &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/implementation.pdf&quot;&gt;The Implementation of Reliable Distributed Multiprocess Systems&lt;/a&gt;. It deserves to be recognized for both of these contributions, and is well worth reading because it presents the foundation of these ideas in a way that’s easy to understand. Don’t be ashamed if you skip over the proof in the Appendix, but it’s worth your time if you’d like to get a deeper understanding of why logical clocks work.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We cannot ensure that the states of all processes and channels will be recorded at the same instant because there is no global clock; however, we require that the recorded process and channel states form a “meaningful” global system state.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The algorithm described in &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf&quot;&gt;Distributed Snapshots: Determining Global States of a Distributed System&lt;/a&gt;, written by Lamport with &lt;a href=&quot;http://infospheres.caltech.edu/people/mani&quot;&gt;K. Mani Chandy&lt;/a&gt;, does something that is apparently impossible: it creates a consistent global snapshot of a distributed system without requiring global synchronization. It does this by ensuring that a snapshot is taken of the local state of each process in the system, along with every message in flight to that process at the time that local snapshot was taken. The &lt;a href=&quot;http://en.wikipedia.org/wiki/Snapshot_algorithm#Working&quot;&gt;wikipedia page&lt;/a&gt; has a nice summary of how it works.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading&lt;/em&gt;: the algorithm presented in this paper, like many great algorithms, does something that seems really difficult in a way that, in retrospect, appears to be simple or even trivial. The paper also does a nice job of explaining the system model, and breaking down each step of the algorithm in clear terms. It’s worth reading because it’s a fascinating piece of computer science, and because of the way it presents its ideas.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Designing a concurrent program is a difficult task; no formalism can make it easy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
  &lt;p&gt;when designing a concurrent program, we cannot restrict our attention to what is true before and after its execution; we must also consider what happens &lt;em&gt;during&lt;/em&gt; its execution.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Lamport has written a great deal on temporal logic, including the deservedly heavily-cited &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-actions.pdf&quot;&gt;The Temporal Logic of Actions&lt;/a&gt;, but none of his other papers on the subject are (in my opinion) as well written as the earlier &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/what-good.pdf&quot;&gt;What Good Is Temporal Logic?&lt;/a&gt; This paper came fairly early on in the development of TLA, and isn’t a complete picture of the idea, but what is there is presented in a way that is both precise and compelling. The paper compares the approach to Milner’s work on the Calculus of Communicating Systems, explains the philosophy of temporal logic, explores the value of specifications, and presents the idea of refinement mappings. One point of interest is the conclusion about the expense of mechanical verification, something that has been substantially improved since 1983 more by hardware power than by theory.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading&lt;/em&gt;: it is excellent scientific writing, and covers a lot of ground in not much text.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;as commerce flourished, priests began wandering in and out of the Chamber while the Synod was in progress&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf&quot;&gt;The Part-Time Parliament&lt;/a&gt; is important because of what it contains, and when it was published. It describes, it a fun but impractical way, the Paxos algorithm for distributed consensus. Originally submitted in 1989 (or 1990, depending on who you believe), it came at around the same time as Oki and Liskov’s work on &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=62549&quot;&gt;viewstamped replication&lt;/a&gt; and is one of the earliest solutions for a problem of both practical and theoretical importance. &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf&quot;&gt;Paxos Made Simple&lt;/a&gt;, &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/web-dsn-submission.pdf&quot;&gt;Cheap Paxos&lt;/a&gt;, &lt;a href=&quot;http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&amp;amp;id=966&quot;&gt;Fast Paxos&lt;/a&gt; and &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/vertical-paxos.pdf&quot;&gt;Vertical Paxos and Primary-Backup Replication&lt;/a&gt; all contain more easily understandable descriptions of various variants of Paxos.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why this is worth reading&lt;/em&gt;: it’s fun, has a great sense of humor, and no sense of self-importance to go with its real importance.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Don’t read this if&lt;/em&gt;: you’re trying to understand or implement Paxos.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Snark, Chord, and Trust in Algorithms</title>
      <link>http://brooker.co.za/blog/2014/03/08/model-checking.html</link>
      <pubDate>Sat, 08 Mar 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/03/08/model-checking</guid>
      <description>&lt;h1 id=&quot;snark-chord-and-trust-in-algorithms&quot;&gt;Snark, Chord, and Trust in Algorithms&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Good journals, well-known authors and informal proofs are not sufficient.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://cacm.acm.org/magazines/2014/2/171689-mars-code/fulltext&quot;&gt;Mars Code&lt;/a&gt;, in February’s CACM, is a very interesting look from the outside at some of the software engineering practices that helped make the &lt;a href=&quot;http://mars.jpl.nasa.gov/msl/&quot;&gt;Mars Science Laboratory&lt;/a&gt; mission successful. The authors cover a lot of ground in the article, from code reviews to coding standards to model extraction and model checking. While writing about model checking, they tell the story of &lt;a href=&quot;http://people.csail.mit.edu/shanir/publications/evenbetterDCAS2001.pdf&quot;&gt;Snark&lt;/a&gt;, a non-blocking deque algorithm.&lt;/p&gt;

&lt;p&gt;The Snark paper looks great. The algorithm is presented clearly, including both clear text descriptions and blocks of C-like pseudocode. It’s published in a well-respected journal, &lt;a href=&quot;http://www.springer.com/computer/lncs?SGWID=0-164-0-0-0&quot;&gt;LNCS&lt;/a&gt;. It dedicates four and a half of its fifteen pages to a well-structured and clearly written sketch proof of the correctness of the algorithm, with clear diagrams explaining some of the tricky cases. It’s got &lt;a href=&quot;http://en.wikipedia.org/wiki/Guy_L._Steele,_Jr.&quot;&gt;Guy Steele&lt;/a&gt; on the author list. It’s from Sun. As far as my &lt;em&gt;is this paper likely to be trustworthy?&lt;/em&gt; heuristics go, this one doesn’t raise many suspicions.&lt;/p&gt;

&lt;p&gt;Unfortunately, Snark is broken. Not subtly broken on an obscure liveness measure, but fundamentally broken in that it’s unsafe. In &lt;a href=&quot;http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/DCAS.pdf&quot;&gt;DCAS is not a Silver Bullet for Nonblocking Algorithm Design&lt;/a&gt;, Doherty et al lay out two bugs in the Snark algorithm. It’s worth noting, though, that &lt;em&gt;et al&lt;/em&gt; in this case includes most of the authors of the original paper. The paper explains the bugs well, and it’s interesting how subtly 30 lines of code can be broken. It’s well worth reading, especially if you are interested in non-blocking algorithms. Later, Leslie Lamport used the same bugs as a test of the &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/pluscal.html&quot;&gt;PlusCal&lt;/a&gt; language. In &lt;a href=&quot;http://research.microsoft.com/pubs/64627/dcas.pdf&quot;&gt;Checking a Multithreaded Algorithm with +CAL&lt;/a&gt; he explains how he model checked the original and fixed Snark algorithms using TLC.&lt;/p&gt;

&lt;p&gt;Snark is not an isolated example either. Another great example is &lt;a href=&quot;http://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf&quot;&gt;Chord&lt;/a&gt;. The hits are similar: top authors, top journal, good institution, and a detailed sketch proof. In Chord’s case, it’s also widely implemented and well respected. It even won the &lt;a href=&quot;http://www.sigcomm.org/awards/test-of-time-paper-award&quot;&gt;2011 SIGCOMM Test of Time Paper Award&lt;/a&gt;. Like Snark, Chord appears to be subtly broken. Pamela Zave, in &lt;a href=&quot;http://public.research.att.com/~pamela/chord-ccr.pdf&quot;&gt;Using Lightweight Modeling To Understand Chord&lt;/a&gt;, found a number of places where either Chord is broken, or the version presented in the paper (and proven in the proofs!) isn’t safe. More details can be found in the slides for a talk titled &lt;a href=&quot;http://www.cs.cornell.edu/conferences/formalnetworks/pamela-slides-i.pdf&quot;&gt;How to make Chord Correct&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The purpose of pointing out these examples isn’t self-superiority or even schadenfreude, but to demonstrate that the heuristics we use when deciding whether to trust a published algorithm don’t always work. Even talented teams of really smart people get this stuff wrong all the time. Not only are multi-threaded and distributed algorithms tricky to reason about, but techniques for proving their correctness are also complex and often not approachable. Most practitioners lack the training to verify the correctness of the proofs that do exist. Often, it’s even quite challenging to write down the requirements sufficiently clearly, though techniques like &lt;a href=&quot;http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/refinement%20mappings.pdf&quot;&gt;Refinement Mappings&lt;/a&gt; can often help.&lt;/p&gt;

&lt;p&gt;Modeling, specification and model checking play an important role. Lamport demonstrated issues in Snark using &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html&quot;&gt;TLA+&lt;/a&gt;, Doherty et al used &lt;a href=&quot;http://spinroot.com/spin/whatispin.html&quot;&gt;Spin&lt;/a&gt;, and Zave used &lt;a href=&quot;http://alloy.mit.edu/alloy/&quot;&gt;Alloy&lt;/a&gt;. Each of these tools has areas of strength, but each allows computer-aided checking of specifications or models. Lamport puts it this way:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;a hand proof is no substitute for model checking. As the two-sided queue example shows, it is easy to make a mistake in an informal proof.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Tools like TLA+ have a big advantages for practitioners: they tend to be much more approachable than proof techniques, allow exploration of algorithm variants without starting from scratch, and allow easy exploration of the effect of a model on the correctness of a specification.The also look like code, and the tools frequently look and behave like the editors and compilers most are familiar with. If you can pick up Haskell or Scheme, and have a passing familiarity with basic set theory, you can learn TLA+, PlusCal or Spin. As another benefit, formal specifications of algorithms like &lt;a href=&quot;https://ramcloud.stanford.edu/~ongaro/raft.tla&quot;&gt;Raft&lt;/a&gt; and &lt;a href=&quot;http://research.microsoft.com/pubs/64634/web-dsn-submission.pdf&quot;&gt;Cheap Paxos&lt;/a&gt; are often (though &lt;a href=&quot;https://groups.google.com/forum/#!topic/raft-dev/yu-wOUx-gnA&quot;&gt;not always&lt;/a&gt;) more complete and precise than either the text or pseudocode descriptions given in papers. Theses specifications allow automated model checking, and provide evidence that such model checking has been done. A specification provides some evidence that the author has thought through all the edge cases of an implementation, at least within the boundaries of the model.&lt;/p&gt;

&lt;p&gt;Of course, model checking isn’t a panacea. It’s just as easy to make mistakes in either the specification or invariants as it is in a proof. Lamport explains some other limitations of model checking:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Model checking is no substitute for proof. Most algorithms can be checked only on instances of an algorithm that are too small to give us complete confidence in their correctness. Moreover, a model checker does not explain why the algorithm works.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Traditional software engineering practices, such as unit tests, are another promising approach. Approaches for testing distributed and multi-threaded algorithm implementations, with good coverage of ordering and failure cases, are still in their infancy, but there is a lot of very encouraging progress. More broadly, and more in industry than academia, we need a cultural shift. A developer who developed their own block cipher and dismissed concerns with &lt;em&gt;it’s easy to reason that it is secure&lt;/em&gt; would get laughed at by the security community. Somebody who proposed a new sorting algorithm and refused to demonstrate that it actually sorts things would be the target of derision. People who propose their own distributed algorithms, both in the research press and in practice, are too frequently allowed to get away with &lt;em&gt;trust me&lt;/em&gt; and &lt;em&gt;it’s obvious&lt;/em&gt; as demonstrations of correctness. I’m not accusing the authors of the Snark or Chord papers of this, but it is very common. Depending on what you are doing, that’s not good enough. You don’t need to read much more than the &lt;a href=&quot;http://www.wired.com/opinion/2013/01/code-bugs-programming-why-we-need-specs/#disqus_thread&quot;&gt;comments&lt;/a&gt; on Lamport’s &lt;a href=&quot;http://www.wired.com/opinion/2013/01/code-bugs-programming-why-we-need-specs/&quot;&gt;Why We Should Build Software Like We Build Houses&lt;/a&gt; to see evidence of a strong anti-design and anti-specification bent in software engineering practitioners. While these formal techniques and methods aren’t needed for much of the work programmers do, there are many types of systems for which they are extremely helpful (and should more often be required).&lt;/p&gt;

&lt;p&gt;If you’re looking for a distributed algorithm, and your business, your life or your reputation are on the line, don’t accept &lt;em&gt;I copied it from a paper&lt;/em&gt;. Don’t accept &lt;em&gt;it’s obviously right&lt;/em&gt;. Don’t accept correctness based on reputation. Think twice before trusting only an informal proof.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Distributed Consensus: Beating Impossibility with Probability One</title>
      <link>http://brooker.co.za/blog/2014/01/12/ben-or.html</link>
      <pubDate>Sun, 12 Jan 2014 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2014/01/12/ben-or</guid>
      <description>&lt;h1 id=&quot;distributed-consensus-beating-impossibility-with-probability-one&quot;&gt;Distributed Consensus: Beating Impossibility with Probability One&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Distributed systems models are critical to understanding impossibility results&lt;/p&gt;

&lt;p&gt;Reading Nancy Lynch’s 1989 paper &lt;a href=&quot;http://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf&quot;&gt;A Hundred Impossibility Proofs for Distributed Computing&lt;/a&gt; was the first time I came to a real understanding of the value of impossibility proofs. Before reading it, I was aware of many of the famous impossibility proofs, including &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;Brewer’s CAP Theorem&lt;/a&gt;, &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf&quot;&gt;FLP impossibility&lt;/a&gt; and the &lt;a href=&quot;http://research.microsoft.com/pubs/64633/bertinoro.pdf&quot;&gt;lower bounds of number of rounds needed for consensus&lt;/a&gt;, but I’d always held existence proofs to be somehow more important. My attitude was along these lines:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;What good are impossibility results, anyway? They don’t seem very useful at first, since they don’t allow computers to do anything they couldn’t previously.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Following that question (in Section 3.5 of &lt;em&gt;A Hundred Impossibility Proofs&lt;/em&gt;), Lynch goes on to justify the importance of impossibility proofs. The whole case is worth reading, but the one that resonates with me most strongly as a practitioner is:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;… the effect of the impossibility result might be to make a systems developer clarify his/her claims about what the system accomplishes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nearly 25 years have passed since the publication of this paper, and that remains something of a hopeful dream. Despite the efforts of Lynch, &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/state-the-problem.pdf&quot;&gt;Lamport&lt;/a&gt;, &lt;a href=&quot;http://aphyr.com/tags/jepsen&quot;&gt;Aphyr&lt;/a&gt;, &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;Daniel Abadi&lt;/a&gt; and many others, there’s still a long way to go in having distributed systems developers clearly state the guarantees their systems are making.&lt;/p&gt;

&lt;p&gt;Another effect of impossibility proofs, and the clear definition of the models in which they exist, has been research into how little it is possible to change the model to get around the impossibility result. Easily my personal favorite result in this area is another paper from the 1980s, Michael Ben-Or’s &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=806707&quot;&gt;Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols&lt;/a&gt; from 1983 (sadly, I can’t seem to find an open-access version of that paper), and a similar result by Rabin in the same year. Ben-Or looked at the &lt;a href=&quot;http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/&quot;&gt;FLP impossibility result&lt;/a&gt;, and discovered an algorithm which can achieve consensus with &lt;a href=&quot;http://en.wikipedia.org/wiki/Almost_surely&quot;&gt;probability one&lt;/a&gt; in a slightly modified system model.&lt;/p&gt;

&lt;p&gt;The first two sections of the paper lay out the problem to be solved, describe the properties of the solution and present the system model. The system model is the standard asynchronous message passing one, with the additional ability of each process to make non-deterministic decisions. This includes the key difference between the problem Ben-Or is solving, and the problem FLP proves is impossible. At each &lt;em&gt;step&lt;/em&gt; (i.e. after receiving a message), a process can make a decision based on its internal state, the message state, and some probabilistic state. FLP’s processes can’t do the last of these: the decisions they make must be deterministic based only on their internal state, and the message state. This is a great illustration of the importance of models in distributed systems proofs. A slight variation of the model turns the problem from an impossible one, to one that is both possible and not particularly complex.&lt;/p&gt;

&lt;p&gt;The other key point from section 2, which the correctness of the whole algorithm depends on, is this one:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If for all &lt;em&gt;P&lt;/em&gt;, &lt;em&gt;x_p = v&lt;/em&gt;, then the decision must be &lt;em&gt;v&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the paper’s language, &lt;em&gt;x_p&lt;/em&gt; is the original binary input made by process &lt;em&gt;P&lt;/em&gt;. This is different from the &lt;em&gt;majority wins&lt;/em&gt; model which is frequently used when informally talking about consensus. The algorithm is correct if it chooses 1, as long as at least one of the original processes originally proposed 1. In a system with five processes, if four propose 0 and one proposes 1, then 1 is still a correct solution. If all five propose 0, only 0 is the correct solution. This definition of correctness becomes critical when we look at the algorithm itself.&lt;/p&gt;

&lt;p&gt;The algorithm proceeds in rounds, with four steps per round. In the first step of each round, each process broadcasts it’s &lt;em&gt;x_p&lt;/em&gt;, along with the round number. It then waits until it receives &lt;em&gt;N - t&lt;/em&gt; of these first-step messages, where &lt;em&gt;N&lt;/em&gt; is the number of processes, and &lt;em&gt;t&lt;/em&gt; is the number of faulty processes (more on &lt;em&gt;t&lt;/em&gt; later). The second step then depends on the set of messages received.&lt;/p&gt;

&lt;p&gt;If more than N/2 messages have the same &lt;em&gt;v&lt;/em&gt;, then the process broadcasts a message the paper calls a &lt;em&gt;D-message&lt;/em&gt;, basically just a message indicating that the process has seen a majority of the same value. Obviously if there have been no failures, this happens on the first round (because it’s binary consensus, and there’s always a majority). Similarly, in the trivial case, where all &lt;em&gt;x_p&lt;/em&gt; were the same, all processes will send &lt;em&gt;D-message&lt;/em&gt;s. On the other hand, if a process has seen a split vote, it sends a message indicating that it’s still unsure.&lt;/p&gt;

&lt;p&gt;In the third step, each process waits for &lt;em&gt;N-t&lt;/em&gt; of the step 2 messages, and counts how many of those were &lt;em&gt;D-messages&lt;/em&gt;. If it gets only one &lt;em&gt;D-message&lt;/em&gt; it sets &lt;em&gt;x_p&lt;/em&gt; to the &lt;em&gt;v&lt;/em&gt; in that message for future rounds. If a process gets more than &lt;em&gt;t&lt;/em&gt; &lt;em&gt;D-messages&lt;/em&gt;, we’re done and can decide on &lt;em&gt;v&lt;/em&gt;. In this case, all the &lt;em&gt;D-messages&lt;/em&gt; will have the same &lt;em&gt;v&lt;/em&gt;, because it’s not possible in step 2 for more than one &lt;em&gt;v&lt;/em&gt; to be in more than N/2 messages. At this point, the algorithm may be feeling oddly similar to &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf&quot;&gt;Paxos’&lt;/a&gt; Synod protocol. Finally, if no &lt;em&gt;D-messages&lt;/em&gt; were received, the process does something interesting - it randomly selects a new &lt;em&gt;v&lt;/em&gt; with probability 0.5.&lt;/p&gt;

&lt;p&gt;This is where things start to get interesting for the correctness criterion. If a process gets to this random selection part of step 3 in the first round, it must be true that &lt;em&gt;x_p&lt;/em&gt; didn’t have the same value for all &lt;em&gt;P&lt;/em&gt;. If that isn’t the case, all the processes could chose a different &lt;em&gt;v&lt;/em&gt;, and break the correctness of the protocol. For this protocol to be true, it must decide in a single round in the trivial case, and not allow random re-selection. This protocol does two things, random re-selection and non-triviality, which are not obviously compatible at first glance.&lt;/p&gt;

&lt;p&gt;The number of rounds used by this algorithm, and it’s Byzantine fault tolerant counterpart, is surprisingly low. For many executions, consensus can be reached on the first round, and the number of rounds increases as slowly as you would expect the number of randomly selected ties to increase. Here’s the number of rounds needed for each of 100k runs of the N=5 t=1 case based on a simulation of the algorithm:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/ben_or_rounds.png&quot; alt=&quot;Number of rounds required to reach consensus&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I find this paper particularly interesting for a few reasons. The first reason is that it demonstrates how sensitive the FLP result is to the problem statement and model in which it is proven. As distributed systems practitioners who use academic research and formal models to inform our designs (as we should), we need to be careful to not over- or understate what various results actually mean. It’s possible, and actually extremely common, to read the CAP and FLP results to mean something like &lt;em&gt;distributed consensus is impossible&lt;/em&gt;, when they actually mean &lt;em&gt;exactly this problem is impossible in exactly this system model&lt;/em&gt;. These results should only be extended to other problems and other models with care.&lt;/p&gt;

&lt;p&gt;The second reason is that it’s a very creative solution to a tricky problem. Backed into a corner by FLP, Ben-Or found a very creative solution that still solves a useful problem in a meaningful system model. For practitioners like me, that’s the dream. I want to solve real problems in real systems, and I really admire solutions like this. The third reason is that it’s a great reminder, when faced with a claim that a system is solving an apparently impossible problem, that we should ask exactly what problem is being solved, and in exactly what system model. It would be easy to package up Ben-Or’s result in a press release titled &lt;em&gt;New Algorithm Proves FLP Wrong&lt;/em&gt;, but that would be missing the point entirely.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Restricted Transactional Memory on Haswell</title>
      <link>http://brooker.co.za/blog/2013/12/16/intel-rtm.html</link>
      <pubDate>Mon, 16 Dec 2013 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2013/12/16/intel-rtm</guid>
      <description>&lt;h1 id=&quot;restricted-transactional-memory-on-haswell&quot;&gt;Restricted Transactional Memory on Haswell&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Exploring the performance of Intel&apos;s RTM&lt;/p&gt;

&lt;p&gt;In my last post, I looked at the performance of HLE on Intel’s Haswell processor, and found that while it offered a nice speedup in some cases, it can cost performance in others. Still, Intel’s &lt;a href=&quot;http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell&quot;&gt;TSX&lt;/a&gt; is an extremely exciting technology. In this post, I look at the other half of TSX, which Intel calls &lt;strong&gt;Restricted Transactional Memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you’ve never heard of transactional memory before, it’s worth reading up. As usual, &lt;a href=&quot;http://en.wikipedia.org/wiki/Transactional_memory&quot;&gt;the Wikipedia article&lt;/a&gt; isn’t a bad place to start. Some languages, like Clojure, offer &lt;a href=&quot;http://clojure.org/concurrent_programming&quot;&gt;software transactional memory&lt;/a&gt; out of the box. When it fits, STM can be an extremely nice way to write concurrent programs. The programming model can be simpler, and some classes of bugs (correctness, mostly, rather than liveness) are easier to avoid.&lt;/p&gt;

&lt;p&gt;Unlike all the great STM libraries, the current interfaces available to RTM, at least in the form of compiler builtins, don’t offer much in the way of a simpler programming model. They do, however, offer us a great way to taste some of the performance benefits that Intel promises for RTM. First, let’s take a look at what Intel says about RTM. Starting with the &lt;a href=&quot;http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf&quot;&gt;Intel® 64 and IA-32 Architectures Optimization Reference Manual&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Intel® Transactional Synchronization Extensions (Intel TSX) aim to improve the performance of lock-protected critical sections while maintaining the lock-based programming model&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;OK, so a simpler programming model isn’t really Intel’s aim here. I’m still pretty sure that there are great opportunities for TM libraries, some of which are already starting to appear, like &lt;a href=&quot;https://github.com/amidvidy/xsync&quot;&gt;xsync&lt;/a&gt;. Some more from the manual:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Intel TSX allows the processor to determine dynamically whether threads need to serialize through lock-protected critical sections, and to perform serialization only when required. This lets hardware expose and exploit concurrency hidden in an application due to dynamically unnecessary synchronization through a technique known as lock elision.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s the high-level view. When the CPU detects that the lock isn’t held, it tries to run without it. If that all goes horribly wrong, because another core tried to do the same thing, the processor undoes its misdeeds and tries again with the lock. It’s clever stuff, and with inter-core synchronization becoming more and more of a bottleneck in multicore systems, it’s not surprising Intel’s investing in it.&lt;/p&gt;

&lt;p&gt;Let’s take a look at how RTM performs. For today’s test, I wrote a simple separately-chained hash table, and inserted random &lt;em&gt;long&lt;/em&gt;s into it until it reached 1000% occupancy. I also made insertions a little artificially slow by doing tail, rather than head, insertions when chaining. For synchronization, I followed Example 12-3 in the &lt;a href=&quot;http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf&quot;&gt;manual&lt;/a&gt;, which shows a pattern for the use of RTM. For the fallback lock, I used the simple spinlock from Example 12-4 (without HLE, because you can’t use both together on Haswell). In unoptimized assembly, the lock function ended up looking like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    movq    %rdi, -8(%rbp)
    movl    $-1, %eax
    xbegin  .L10
.L10:
    cmpl    $-1, %eax
    jne     .L12
    movq    -8(%rbp), %rax
    movl    (%rax), %eax
    testl   %eax, %eax
    jne     .L13
    jmp     .L9
.L13:
    xabort  $255
.L12:
    movq    -8(%rbp), %rax
    movq    %rax, %rdi
    call    orig_spinlock    .L9:
    leave
    ret
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There you can see the two of the three new operations that make up RTM. Interestingly, the &lt;tt&gt;xbegin&lt;/tt&gt; operation can jump (hence the label argument), but the patterns in the manual when used with the builtins in GCC don’t use that functionality. Next, we see that we test the lock, and if it’s free we return right away. Finally, if somebody else has taken the lock, we &lt;tt&gt;xabort&lt;/tt&gt; to end our transaction and fall back on the lock path (in this case by calling orig_spinlock, which is my spinlock implementation). The unlock side tests the lock again to differentiate between the elided path and the locked path, and calls &lt;tt&gt;xend&lt;/tt&gt; on the elided path. Nothing very complex at all code-wise.&lt;/p&gt;

&lt;p&gt;First, let’s look at the results of doing a million inserts into a 100k bucket hash table by number of threads. I ran each test 100 times, and inter-test variance was very low (less than 5%). The RTM implementation is in red, and the straight spinlock-based one (with no HLE) is in blue:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/tsx_rtm_threads.png&quot; alt=&quot;RTM vs baseline&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The RTM version is repeatably very marginally slower with a single thread (about 0.2% on average), but otherwise faster across the board. The spinlock implementation succumbs to increasing contention costs and is &lt;em&gt;slower&lt;/em&gt; than single-threaded, while RTM is faster until 4 threads. The really impressive performance here is at 2 threads, where the HLE version is much, much faster. It shows true parallel speedup on this task, which is impressive considering it’s nearly entirely dominated by lock contention. Comparing performance directly:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/tsx_rtm_speedup.png&quot; alt=&quot;RTM vs baseline relative&quot; /&gt;&lt;/p&gt;

&lt;p&gt;RTM is nicely quicker across the board, and runs in only 63% of the time with two threads. It’s a really great little performance gain, with very little programmer effort.&lt;/p&gt;

&lt;p&gt;What happens if we throw HLE into the mix on this program? I added an HLE version of the code (following Example 12-4) to the two I already had. This was the result:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/tsx_rtm_hle_threads.png&quot; alt=&quot;RTM, HLE vs baseline&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That’s extremely interesting. The HLE implementation is a solid 18% slower than baseline on the single-threaded version, then shows a massive performance advantage until 8 threads. It doesn’t improve total parallel speedup beyond three threads, but does very effectively prevent slowdown.&lt;/p&gt;

&lt;p&gt;I started looking at this hoping to find some good answers, and it just left me with more questions. Clearly, HLE and RTM have great potential to improve multi-threaded performance on contended data structures, but it’s not clear-cut when they should be used. So far, my experiments have shown RTM is better across the board than nothing, while HLE can be even better than that at a potential cost with one thread.&lt;/p&gt;

&lt;p&gt;I suspect it’s going to take a while for compiler and library writers to untangle all of this. We’re not going to be seeing the full performance benefits of these features any time soon.&lt;/p&gt;

&lt;p&gt;I’ll release the source for these tests as soon as I can.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Hardware Lock Elision on Haswell</title>
      <link>http://brooker.co.za/blog/2013/12/14/intel-hle.html</link>
      <pubDate>Sat, 14 Dec 2013 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2013/12/14/intel-hle</guid>
      <description>&lt;h1 id=&quot;hardware-lock-elision-on-haswell&quot;&gt;Hardware Lock Elision on Haswell&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Exploring the performance of Intel&apos;s HLE&lt;/p&gt;

&lt;p&gt;A couple of months ago, I bought myself a new home PC, upgrading from my old Core2 Q6600 to a shiny new &lt;a href=&quot;http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29&quot;&gt;Haswell&lt;/a&gt;-based Xeon E3-1240v3. Honestly, I don’t use my home PC that much, so the biggest draw for upgrading was trying out some of the features in Haswell, and getting loads of ECC RAM to support another project.&lt;/p&gt;

&lt;p&gt;The biggest thing I was excited about with Haswell is Intel’s new &lt;a href=&quot;http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell&quot;&gt;TSX&lt;/a&gt;, a step towards true hardware transactional memory on commodity processors. Transactional memory is a very exciting idea, and seeing better support for it in hardware is really great. TSX provides two broad sets of functionality: restricted transactional memory (RTM), and hardware lock elision (HLE). HLE can be seen as a subset of RTM, offering backward compatibility with pre-Haswell processors. I started my investigations by looking at HLE.&lt;/p&gt;

&lt;p&gt;Intel describes Haswell’s HLE like this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If multiple threads execute critical sections protected by the same lock but they do not perform any conflicting operations on eachother’s data, then the threads can execute concurrently and without serialization. Even though the software uses lock acquisition operations on a common lock, the hardware is allowed to recognize this, elide the lock, and execute the critical sections on the two threads without requiring any communication through the lock if such communication was dynamically unnecessary.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Based on this description, I was expecting HLE to work best on low-contention locks, possibly significantly increasing performance. Intel’s backward-compatible HLE is based on two new instruction prefixes (rather than new instructions): XACQUIRE (F2) and XRELEASE (F3). You basically put the XACQUIRE prefix on the instruction that starts your critical section, and XRELEASE on the instruction that ends it. There are a bunch of good ways to implement locks on x86, but most commonly the start instruction will be an &lt;em&gt;xcgh&lt;/em&gt; or &lt;em&gt;cmpxchg&lt;/em&gt;, and the end will be one of those, or just a &lt;em&gt;mov&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Luckily, to try this out I didn’t need to write any assembly, because GCC 4.8 supports HLE through &lt;a href=&quot;http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins&quot;&gt;atomic builtins&lt;/a&gt;, thanks to the work of &lt;a href=&quot;http://halobates.de/adding-lock-elision-to-linux.pdf&quot;&gt;Andi Kleen&lt;/a&gt;. Taking advantage of HLE is as simple as passing an additional &lt;tt&gt;ATOMIC_HLE_ACQUIRE&lt;/tt&gt; flag to &lt;tt&gt;atomic_exchange_n&lt;/tt&gt; in your lock implementation, and &lt;tt&gt;ATOMIC_HLE_RELEASE&lt;/tt&gt; to your unlock. GCC then emits the prefixes on the lock instructions.&lt;/p&gt;

&lt;p&gt;Implementing a spinlock with &lt;tt&gt;&lt;a href=&quot;http://software.intel.com/en-us/blogs/2013/05/20/using-hle-and-rtm-with-older-compilers-with-tsx-tools&quot;&gt;atomic_exchange_n&lt;/a&gt;&lt;/tt&gt; the difference in the emitted assembly is very simple:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-	xchgl	(%rdi), %eax
+	xacquire xchgl	(%rdi), %eax
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To test all of this out, I implemented a multithreaded count-sort of 100MB of random integers into 10000 buckets in C. The count buckets were protected by striped spinlocks, with some number of buckets sharing a single spinlock. Each thread looped over its unique data, and for each item took the spinlock, increased the bucket count and released the spinlock. Obviously a very short critical section, and probably better implemented without locks (directly with an atomic &lt;em&gt;cmpxchg&lt;/em&gt;, for example), but I’m only starting out here.&lt;/p&gt;

&lt;p&gt;I then ran two versions of the program for various thread counts and lock counts: a version with the HLE prefixes and a version without them and measured the difference in performance:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/tsx_all_wireframe.png&quot; alt=&quot;HLE performance differences wireframe&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s a cut through the number of locks, for 3 threads (blue) and 10 threads (red). With 3 threads, HLE is faster across the board, and with 10 threads, HLE is a wash with more locks and a big loss with fewer. Both win big with a single lock:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/tsx_3_10.png&quot; alt=&quot;HLE performance cut&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The results are actually very interesting. For this (admittedly extremely lock-intensive) program, the version with HLE takes nearly twice as long as the one without when run with a single thread. However, when run with one lock and 10 threads, a massive amount of contention, the HLE version more than &lt;strong&gt;6 times faster&lt;/strong&gt;. That’s pretty amazing.&lt;/p&gt;

&lt;p&gt;If we peek under the hood a bit, we should be able to see what’s going on inside the processor to make this big performance differences. As always with low-level CPU performance stuff, &lt;a href=&quot;https://perf.wiki.kernel.org/index.php/Tutorial&quot;&gt;perf stat&lt;/a&gt; is a great place to start. First, let’s look at a case where HLE is much faster, with 3 threads and 10 locks. The HLE version is more than 2x faster in this case.&lt;/p&gt;

&lt;p&gt;The first step of perf stat for the two versions shows this for the HLE version:
    1,333,417,638 instructions # 0.20  insns per cycle
and this for the non-HLE version:
    1,333,549,171 instructions # 0.09  insns per cycle&lt;/p&gt;

&lt;p&gt;The number of executed instructions is nearly the same, as expected, but instructions per cycle is very different. Obviously, the CPU is doing something other than running instructions, and that something is most likely waiting for memory. Looking at some of the cache counters makes this more obvious. First, the version with HLE:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; 27,804,277 LLC-stores
 25,302,175 LLC-loads
412,030,786 L1-dcache-loads
 65,940,443 L1-dcache-load-misses
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and the version without:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; 67,958,384 LLC-stores
 32,624,733 LLC-loads
400,883,047 L1-dcache-loads
 89,777,892 L1-dcache-load-misses
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Look at the big difference between LLC-stores (writes to &lt;em&gt;Last Level Cache&lt;/em&gt;, L3 in our case) explains a big part of the issue here. The difference between L1-dcache-load-misses is also significant. The non-HLE version is making a lot more trips to memory. HLE is clearly doing its job. In the single-lock case, with HLE’s 6x performance improvement, the difference in LLC-stores is huge (6x, actually).&lt;/p&gt;

&lt;p&gt;So what about the single thread case? Why does HLE make that slower? I don’t understand enough about Intel’s architecture to make a good guess, but the 2x slower HLE-enabled version is doing about 2x the LLC-stores. I suspect this is an implementation issue with HLE in TSX, and I hope it’s an artifact of this benchmark rather than a general performance issue with TSX. More testing, especially more testing with significant critical sections, should tell.&lt;/p&gt;

&lt;p&gt;I’ll release the source for these tests as soon as I can. I’m hoping to find some time to play with RTM over the holidays, too.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Beyond iostat: Storage performance analysis with blktrace</title>
      <link>http://brooker.co.za/blog/2013/07/14/io-performance.html</link>
      <pubDate>Sun, 14 Jul 2013 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2013/07/14/io-performance</guid>
      <description>&lt;h1 id=&quot;beyond-iostat-storage-performance-analysis-with-blktrace&quot;&gt;Beyond iostat: Storage performance analysis with blktrace&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;An under appreciated set of IO analysis tools.&lt;/p&gt;

&lt;p&gt;If you’ve spent much time at all investigating IO performance on Linux, you’re no doubt already familiar with &lt;em&gt;iostat&lt;/em&gt; from the venerable &lt;a href=&quot;http://sebastien.godard.pagesperso-orange.fr/&quot;&gt;sysstat&lt;/a&gt; package. &lt;em&gt;iostat&lt;/em&gt; is the go-to tool for Linux storage performance monitoring with good reason: it’s available nearly everywhere, it works on the vast majority of Linux machines, and it’s relatively easy to use and understand. Some of what it measures can be &lt;a href=&quot;http://dom.as/2009/03/11/iostat/&quot;&gt;subtle&lt;/a&gt;, and the exact definitions of its measurements can be &lt;a href=&quot;http://www.xaprb.com/blog/2010/09/06/beware-of-svctm-in-linuxs-iostat/&quot;&gt;confusing, and even contentious&lt;/a&gt;, but it’s still a great start.&lt;/p&gt;

&lt;p&gt;Sometimes, though, you need to go into more detail than iostat can provide. The aggregate view from &lt;em&gt;iostat&lt;/em&gt; is simple, but makes it difficult to tell which processes are doing which IOs. Averaging over a period of time can hide subtle performance issues and the real causes of may IO-related problems. To get around these issues, you’ll want to go deeper. If you have the guts for it, a recent kernel, and a good understanding of IO performance issues, you’ll want to reach for &lt;a href=&quot;http://git.kernel.org/cgit/linux/kernel/git/axboe/blktrace.git/tree/README&quot;&gt;blktrace&lt;/a&gt; and friends. The blktrace toolkit provides an extremely powerful way to look at the exact IO performance of a Linux machine, at a wide range of levels of detail, and is vastly more capable than the simple &lt;em&gt;iostat&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For a start, let’s look at the performance of a random read workload to a magnetic drive, with 16k IOs. The manufacturer’s spec sheet says this drive should be delivering about 120 IOs per second on a completely random load. &lt;em&gt;iostat -x&lt;/em&gt; has this to say about the drive:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Device:       rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
sdb             0.00     0.00  124.67    0.00  3989.33     0.00    32.00
  avgqu-sz   await  svctm  %util
     1.00    8.01   8.02 100.00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As expected, we’re doing about 125 random IOs per second, each at 16k (32.00 512 byte sectors), at a mean service time of around 8ms. That’s pretty much exactly what we would expect from a 7200 RPM magnetic drive. Nothing to see there, then. Next up, is a drive I’m a little bit suspicious about. The demanding IO application I’ve been running on it has been sluggish, but other than that I have no real evidence that it’s bad. SMART checks out, for as little as that’s worth. &lt;em&gt;iostat -x&lt;/em&gt; indicates that things are a little slow, but not off the charts:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Device:        rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
sdf              0.00     0.00   41.33   50.67  1322.67  1621.33    32.00
  avgqu-sz   await  svctm  %util
      1.00   10.90  10.86  99.87
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Time to turn to &lt;em&gt;blktrace&lt;/em&gt; and see what we can find out. The first step to using &lt;em&gt;blktrace&lt;/em&gt; is to capture an IO trace for a period of time. Here, I’ve chosen 30 seconds:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;blktrace -w 30 -d /dev/sdf -o sdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;em&gt;blktrace&lt;/em&gt; command writes down, in a group of files starting with &lt;em&gt;sdf&lt;/em&gt;, a trace of all the IOs being performed to that disk. The trace is stored in a binary format, which obviously doesn’t make for convenient reading. The tool for that job is &lt;em&gt;blkparse&lt;/em&gt;, a simple interface for analyzing the IO traces dumped by &lt;em&gt;blktrace&lt;/em&gt;. They are packaged together, so you’ll have &lt;em&gt;blkparse&lt;/em&gt; if you have &lt;em&gt;blktrace&lt;/em&gt;. When given a &lt;em&gt;blktrace&lt;/em&gt; file, &lt;em&gt;blkparse&lt;/em&gt; outputs a stream of events like these:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;8,32   0    19190    28.774795629  2039  D   R 94229760 + 32 [fio]
8,32   0    19191    29.927624071     0  C   R 94229760 + 32 [0]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;At this point I’ll have to come clean and admit that the “demanding IO application” is actually the IO benchmark tool, &lt;a href=&quot;http://freecode.com/projects/fio&quot;&gt;fio&lt;/a&gt;, but that doesn’t change the results. What you are looking at, in each of those events, is a fixed-field format like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;major,minor cpu sequence timestamp pid action rwbs offset + size [process_name]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is nothing more than a stream of events - “this thing happened at this time”. The first one means “At 28.774 seconds, a read (R) was issued to the driver (D)”. The second one means “At 29.92 seconds, a read (R) was completed (C)”. This is just two example events among many. &lt;em&gt;blktrace&lt;/em&gt; writes down a large number of event types, so you’ll end up with multiple events for each IO. Events include when the IO is queued (Q), merges (M), &lt;a href=&quot;http://lwn.net/Articles/438256/&quot;&gt;plugging&lt;/a&gt; (P) and unplugging (U) and others. Let’s look at a second example, a single traced direct read from this device:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dd if=/dev/sdf bs=1k of=/dev/null count=1 iflag=direct
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That should be simple, right? It turns out that the Linux block IO layer is actually doing a bunch of work here:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;8,32   3        1     0.000000000  2208  Q   R 0 + 2 [dd]
8,32   3        2     0.000002113  2208  G   R 0 + 2 [dd]
8,32   3        3     0.000002891  2208  P   N [dd]
8,32   3        4     0.000004193  2208  I   R 0 + 2 [dd]
8,32   3        5     0.000004802  2208  U   N [dd] 1
8,32   3        6     0.000005487  2208  D   R 0 + 2 [dd]
8,32   0        1     0.000744872     0  C   R 0 + 2 [0]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the IO is queued (Q), a request struct is allocated (G), the queue is &lt;a href=&quot;http://lwn.net/Articles/438256/&quot;&gt;plugged&lt;/a&gt; (P), the IO is scheduled (I), the queue is unplugged (U), the IO is dispatched to the device (D), and the IO is completed (C). All of that took only 744us, which makes me suspect that it was served out of cache by the device. That’s a really simple example. Once merging and read ahead behaviors come into play, the traces can be difficult to understand. There’s still a really big gap between having this IO trace, and being able to say something about the performance of the drive. If you’re anything like me, you’re considering the possibilities of writing a tool to parse these traces and come up with aggregate statistics about the whole trace. Luckily, one has already been written: &lt;a href=&quot;http://www.cse.unsw.edu.au/~aaronc/iosched/doc/btt.html&quot;&gt;btt&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Passing our trace through &lt;em&gt;btt&lt;/em&gt; gives us a whole lot of output, but the really interesting stuff (in this case) is in the first section. In fact, two lines tell us a huge amount about this disk:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
D2C               0.000580332   0.010877224   1.152828442        2744
Q2C               0.000584308   0.010880923   1.152832326        2744
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, Q2C is the total IO time (time from being queued to being completed, just like in the example above), and D2C is the IO time spent in the device. The values are in seconds, and N is the number of IOs observed. As was obvious from the &lt;em&gt;iostat&lt;/em&gt; output, the queue time isn’t very high, so most of what is going on with performance is the device (D2C) time. Here, that’s shown by the relatively small difference between the D2C and Q2C lines. The minimum IO time, 584us, is very short. Those IOs must be served from cache somewhere. The mean time, 10.8ms, is slightly high for what we would expect from this drive (it’s sibling averaged just over 8ms), but isn’t crazy. The maximum, at 1.15s, clearly shows that there’s something amiss about this drive. Our healthy drive’s maximum D2C over the same test was only 160ms.&lt;/p&gt;

&lt;p&gt;If you want even more IO latency detail, &lt;em&gt;btt&lt;/em&gt; is capable of exporting nearly all the statistics it calculates in raw form. For example, the -l flag outputs all the samples of D2C latencies. Combined with the plotting capabilities of &lt;a href=&quot;http://www.r-project.org/&quot;&gt;R&lt;/a&gt; or &lt;a href=&quot;http://matplotlib.org/&quot;&gt;matplotlib&lt;/a&gt;, you can quickly make graphs of the finest details of a system’s IO performance. Two lines of R gave me this IO latency histogram:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/io_latency_hist.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Another useful capability of &lt;em&gt;btt&lt;/em&gt; is extracting the offsets of IOs (with the &lt;em&gt;-B&lt;/em&gt; flag). The offsets can then be plotted, showing an amazing amount of detail about the current IO workload. In this example, I did an &lt;em&gt;apt-get update&lt;/em&gt; then &lt;em&gt;apt-get install libssl1.0.0&lt;/em&gt; on my aging Ubuntu desktop. The whole thing took about 90s, and issued about 7100 IO operations. Here’s what the IO pattern looked like:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/aptget-io-pattern.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That’s an incredible set of capabilities. Creating a plot of the IO latency histogram of a running system is tremendously powerful, as is graphing access patterns over time. It’s also just scratching the surface of the uses of the &lt;em&gt;blktrace&lt;/em&gt; family. Other capabilities include counting seeks (though that’s getting less interesting as SSDs take over the world), understanding merge behavior, and analyzing the overhead of various parts of the Linux IO subsystem. This is a set of tools that should be more widely known and used.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Some Patterns of Engineering Design Meetings</title>
      <link>http://brooker.co.za/blog/2013/05/25/patterns-of-design.html</link>
      <pubDate>Sat, 25 May 2013 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2013/05/25/patterns-of-design</guid>
      <description>&lt;h1 id=&quot;some-patterns-of-engineering-design-meetings&quot;&gt;Some Patterns of Engineering Design Meetings&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;On discussing designs.&lt;/p&gt;

&lt;p&gt;I spend a lot of time in engineering design discussions and meetings, talking about how we are going to introduce new features, solve problems, and increase capacity. For me, a good design meeting is the highlight of any day - sharing ideas, comparing options and considering alternatives can be extremely rewarding. Good design meetings, unfortunately, aren’t the rule. Frequently, meetings don’t go anywhere. Earlier in my career, I thought that there is a direct correlation between the complexity of the problem to be solved and the productivity of the design meetings. It felt natural that solving harder problems would be more difficult.&lt;/p&gt;

&lt;p&gt;Recently, though, I’ve been thinking about some of the things that go right and wrong in design meetings, and have come to the conclusion that there is very little correlation between problem complexity and meeting productivity. Instead, the problem typically seems to be that we are starting in the wrong place. To explain what I mean, let me start with the simplest design process in a single-person isolated world.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Identify the current state of the system, &lt;em&gt;where you are&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Identify the goal state of the system, &lt;em&gt;where you are going&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Decide how you are going to get from the current state of the system to the goal state, &lt;em&gt;how do you get there&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/design_base.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Naively, it appears as though group design meeting should focus on step three (&lt;em&gt;how do we get there?&lt;/em&gt;). The group considers a variety of different paths from &lt;em&gt;where we are&lt;/em&gt; to &lt;em&gt;where we are going&lt;/em&gt;, compares their merits, and chooses one of the alternatives. The focus is always on the design, which is what most engineers find most interesting.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/design_two.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, it is this naive belief - the belief that design discussions should focus on design - that derails most design discussions. The most common manifestation of this problem is when different stakeholders enter a design meeting with different goals in mind. Imagine trying to discuss driving directions with somebody who before agreeing on where you are going. “No”, you’ll say, “we have to turn &lt;em&gt;left&lt;/em&gt; on 5th Avenue”. “Absolutely not! &lt;em&gt;Right&lt;/em&gt; on 5th!”. “Left!”. “Right!”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/design_goal.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;There are three forms of this problem that I have observed. One is the simple disagreement, where participants don’t have the same goals in mind. All participants have concrete goals, but they aren’t the same. The second form is when one or more of the participants haven’t really thought about where they want to go. The destination isn’t concrete in their head, so they can’t meaningfully contribute to the discussion about the route. The third is a mismatch of models. In this case, the participants may agree on the goal, but don’t agree on how they see the goal. I want to spend time outdoors, and my colleague wants to play a game of skill. We’d both be happy with a round of golf, but often we can’t see that because we don’t realize that we are looking at the problem from different angles.&lt;/p&gt;

&lt;p&gt;Agreeing on goals can always take some time, but it’s almost always time well invested. It’s much more wasteful of time when you start discussion the design before you agree on goals. In that case, it becomes very unlikely that the discussion will find consensus, and even less likely that the consensus will have any real value. I’ve found that starting with goals and requirements is critical to a successful design meeting. This isn’t an argument for being inflexible - goals should be allowed to evolve if the discussion of the solution shows them to be incorrect or inadequate - but an argument for explicit consensus.&lt;/p&gt;

&lt;p&gt;The other end of the design arrow is just as important to agree on. Before meaningfully discuss directions not only do you need to agree on where you are going, but you need to agree on where you are now. At first glance, it appears obvious that everybody should agree on where we are now. The current state of the world appears to be a simple and concrete thing.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/design_current.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately that’s not the case. At best, everybody brings a different perspective on the current state of the world to the discussion. Different experience, different levels of exposure to various pieces, and different job roles all contribute to different perspectives. Beyond this, however, is a deeper problem with common understanding. Every system designed by humans is designed to fit not into the real world, but to fit into a mental model of the real world. We don’t have the capability to include all possible factors into a design, so we build a simplified mental model to fit the design into. Models necessarily differ between people. The most common forms of differences in mental model are disagreements about model fidelity, how important a factor needs to be before being included in the model, and factor importance, how important any given factor is.&lt;/p&gt;

&lt;p&gt;I don’t believe that it is possible, or even desirable, to completely match the models of different participants in design meetings. On the other hand, it is critical to spend some time understanding the differences between these models. They will always differ in detail, but need to be the same general shape for productive discussion.&lt;/p&gt;

&lt;p&gt;So far, I’ve mostly written about these patterns of disagreement in context of meeting productivity. The much more serious issue is the issue of meeting outcomes. When we discuss design without agreeing on the current and goal states of the system, we run a high risk of making very bad design decisions. The outcome of a good design meeting is often a compromise, mixing various aspects of each proposal or idea into a coherent whole.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/design_compromise.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The outcome of a bad design meeting is a false compromise, where various aspects of each proposal are mixed to make a design that doesn’t match any model of the current world, and doesn’t achieve anybodies goals. This is the most common cause of bad design I have seen in my career: mismatched goals and mismatched models leading to a non-solution. The best way, in my opinion, to avoid this mistake is to make the steps of finding consensus on &lt;em&gt;where we are&lt;/em&gt; and &lt;em&gt;where we are going&lt;/em&gt; explicit and upfront, and not being ashamed to loop back to them during the discussion of &lt;em&gt;how are we going to get there?&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/design_bad_compromise.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Exploring TLA+ with two-phase commit</title>
      <link>http://brooker.co.za/blog/2013/01/20/two-phase.html</link>
      <pubDate>Sun, 20 Jan 2013 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2013/01/20/two-phase</guid>
      <description>&lt;h1 id=&quot;exploring-tla-with-two-phase-commit&quot;&gt;Exploring TLA+ with two-phase commit&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Using testable pseudocode to test a distributed algorithm&lt;/p&gt;

&lt;p&gt;There are very few distributed algorithms more widely known by working programmers than the &lt;a href=&quot;http://en.wikipedia.org/wiki/Two-phase_commit_protocol&quot;&gt;two-phase commit&lt;/a&gt; atomic commit protocol. It’s a great algorithm to use for teaching purposes: two-phase commit is both extremely simple to write down, and has significant caveats. Some of these shortcomings are obvious, and easily noticed by most students, and some are much more subtle. At a high level, two-phase commit is an atomic commit protocol: it ensures that changes across multiple database systems are either applied to all the systems or to none of them. Assuming a serial stream of transactions, two-phase commit ensures atomicity - the transaction is either fully applied or not applied at all.&lt;/p&gt;

&lt;p&gt;A single coordinator (let’s call her Alice) runs a group of fried chicken restaurants, and wants each restaurant manager (the literature calls them &lt;em&gt;cohorts&lt;/em&gt;, let’s them Bob and Chuck) to paint their green restaurant blue. Alice really cares that her customers get a consistent fried chicken experience, so wants to make sure that all the managers do the work or none of them to do it.  If Alice simply asked Bob to do the work, then asked Chuck, she’d be in trouble. If Bob went ahead and did the work, then Chuck couldn’t (say he didn’t have enough paint), Alice would need to ask Bob undo his work. If Bob was then out of green paint, Alice would be stuck with inconsistent restaurant colors. In Alice’s world, that’s a catastrophe.&lt;/p&gt;

&lt;p&gt;Instead, Alice uses two-phase commit. First, she calls Bob and Chuck and asks them to check if they can repaint today. When both acknowledge they can, Alice calls them and asks them to go ahead. For this to work, she doesn’t have to get both of them on the same conference call. She just needs to call them one after the other. Alice also needs to be sure that Bob and Chuck won’t lie to her about being able to do the work, and that Bob and Chuck will keep answering their phones. If Bob leaves work early after he’s acknowledged that he can do the work, but before he does it, Chuck will be left with the cans open and ladders up, and Alice won’t be sure if Bob did the painting or not. She doesn’t know what to tell Chuck.&lt;/p&gt;

&lt;p&gt;Even for such a simple protocol, two-phase commit has some subtle downsides and the distributed nature of the algorithm makes it exceptionally hard to reason about in prose. We could make little dolls of Alice, Bob and Chuck and act out every possible scenario, but that would take a really long time. Even if we managed to do that (and not screw up), we’d need to start the whole exercise again if Alice opened a third chicken frying location. What if we could have a computer do that checking for us? What if we could write down the protocol clearly and precisely, then write down everything we need to make sure is true, then have a computer run through every possible scenario and tell us if it works. That would be good, right?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Leslie_Lamport&quot;&gt;Leslie Lamport&lt;/a&gt;’s &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html&quot;&gt;TLA+ tools&lt;/a&gt; allow us to do exactly that - write pseudocode implementations of complex algorithms, and ask the computer to exhaustively check them. Going through every possible path in a code base is a painstaking and time consuming process without any creativity required - the exact kind of problem that computers excel at. Let’s see how we can use TLA+ to ask the computer to solve Alice’s problem. I’ve used the &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/pluscal.html&quot;&gt;PlusCal&lt;/a&gt; algorithm language here, because I find it much easier to write and understand than raw TLA+. First, let’s define some things about the world:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;variables
    managers = { &quot;bob&quot;, &quot;chuck&quot;, &quot;dave&quot; };
    restaurant_stage = [ i \in managers |-&amp;gt; &quot;start&quot; ];   
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, we’re telling PlusCal that there are three managers (Bob, Chuck and Dave), and creating an array of states (one per restaurant) with the initial state of each set to “start”. Next, we need to define how each manager behaves:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;process (Restaurant \in managers) {
    c: await restaurant_stage[self] = &quot;propose&quot;;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each manager waits for a call from Alice, proposing that they repaint their restaurant. They’ll be happy to wait for ever in this stage, patiently staring at the phone while their employees cut, spice, fry and sell chicken after chicken.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    either {
        restaurant_stage[self] := &quot;accept&quot;;
    } or {
        restaurant_stage[self] := &quot;refuse&quot;;
    };
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the next stage, the managers are allowed to do one of two things - either accept the work that’s been given to them, or refuse to do the work. Using &lt;em&gt;either&lt;/em&gt; tells PlusCal that we can go down either of these paths non-deterministically.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    c1: await (restaurant_stage[self] = &quot;commit&quot;) 
    	  \/ (restaurant_stage[self] = &quot;abort&quot;);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;They then wait for the next call from Alice, giving them the go ahead to paint, or telling them to put away the ladders.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    if (restaurant_stage[self] = &quot;commit&quot;) {
        restaurant_stage[self] := &quot;committed&quot;;
    } else {
        restaurant_stage[self] := &quot;aborted&quot;;
    }
  }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, they act on Alice’s orders - either painting or aborting. Next, we have to specify how Alice behaves. To simplify that code substantially, we can use PlusCal’s handy macro feature:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;macro SetAll(state, k) {
    while (k # {}) {
        with (p \in k) {
           restaurant_stage[p] := state;
           k := k \ {p};
       };
    };
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This macro loops over every restaurant (in non-deterministic order), and sends them a message. Let’s use it to define Alice’s behavior:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;process (Controller = &quot;alice&quot;) 
variable k, aborted = FALSE;
{
    n: k := managers;        
    n2: SetAll(&quot;propose&quot;, k);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;First up, create the process and define the local variables. Then, send a message to each manager proposing the change.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    k := managers;
    n3: while (k # {}) {
            with (p \in k) {
                await (restaurant_stage[p] = &quot;accept&quot;) 
	    	  \/ (restaurant_stage[p] = &quot;refuse&quot;);
                if (restaurant_stage[p] = &quot;refuse&quot;) {
                    aborted := TRUE;
                };
                k := k \ {p};
            };
       };
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Wait for each manager to return the call (checking in non-deterministic order), and write down whether anybody wants to abort the operation.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    k := managers;
    if (aborted = TRUE) {
        n6: SetAll(&quot;abort&quot;, k);
    } else {
        n4: SetAll(&quot;commit&quot;, k);
   }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If all the managers were happy to continue, then tell everybody to continue. That’s the end of the specification of Alice’s behavior, and the end of our PlusCal program. Writing down the program like this is valuable already. The precision of the PlusCal language, and the way it ignores many of the other challenges that would complicate real code, forces you to think clearly and completely about the behavior of each player. Programmers are all aware that fuzzy thinking doesn’t last long when you have to translate it to code, and this is even more true of PlusCal. Just the act of writing the program this way is valuable. In terms of value, though, we’re only just getting started.&lt;/p&gt;

&lt;p&gt;TLA+ includes a &lt;em&gt;model checker&lt;/em&gt; called TLC. In short, it runs through every possible path of the code and checks some invariants at each stage. Remember all of those &lt;em&gt;non-deterministic&lt;/em&gt; steps in the code? When it hits those, it takes all possible paths. To make TLC useful, we need to tell it what it should check, both invariants (things that are true in every state) and properties (things that must become true). The simplest check is one at PlusCal generates itself:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Termination == &amp;lt;&amp;gt;(\A self \in ProcSet: pc[self] = &quot;Done&quot;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the TLA+ languages, this means “for all &lt;em&gt;self&lt;/em&gt; in the set of processes (alice, bob, chuck and dave), check that the program counter eventually reaches &lt;em&gt;Done&lt;/em&gt;”. The &lt;em&gt;Done&lt;/em&gt; state is a magic state that means the code has fallen off the end of our process. This is a valuable thing to check, because it makes sure that all the process run the entire algorithm. Next, we define an invariant:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;StateOK == /\ (\A i \in managers: restaurant_stage[i] \in {&quot;start&quot;, &quot;propose&quot;,
	        &quot;accept&quot;, &quot;commit&quot;, &quot;abort&quot;, &quot;committed&quot;, &quot;aborted&quot;, &quot;refuse&quot;})
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This simply makes sure that &lt;em&gt;restaurant_stage&lt;/em&gt;, the variable we have used to simulate the telephone, never goes off into a state we don’t know about. Then, we want to check if all the restaurants either get painted or don’t:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Committed == /\ \/ &amp;lt;&amp;gt;(\A i \in managers: restaurant_stage[i] = &quot;committed&quot;)
                \/ &amp;lt;&amp;gt;(\A i \in managers: restaurant_stage[i] = &quot;aborted&quot;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Running the code through the handy TLC model checker will check that all of these things are true. Even for this little program, TLC found 718 states the program can be in, 296 of them unique. If Alice opened another two restaurants, these numbers would increase to 21488 states, and 5480 unique states. Long before the time Alice runs a multinational chicken empire, we’d have no chance of enumerating all these states by hand - let alone doing it correctly. To further illustrate the value of TLA+, let’s introduce a subtle bug into the system, one that allows Alice to ignore a refuse message from Bob (in the real world, this could be a poorly handled timeout). Replace this line:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;if (restaurant_stage[p] = &quot;refuse&quot;) {
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;with this one:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;if ((restaurant_stage[p] = &quot;refuse&quot;) /\ (p # &quot;bob&quot;)) {
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That change lets Alice ignore the ‘refuse’ message from Bob. Running the model checker TLC again reveals something odd. The protocol still works when it shouldn’t. We need to tell TLC to check one other invariant: that everybody aborts when somebody asks to. There are several ways to do this, including adding an explicit invariant. Another way to do this is to use the &lt;em&gt;assert&lt;/em&gt; functionality in PlusCal. We can track whether each restaurant asked for an abort like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;or {
  restaurant_stage[self] := &quot;refuse&quot;;
  refused := TRUE;
};
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then assert that when we are asked to commit:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;if (restaurant_stage[self] = &quot;commit&quot;) {
  assert(refused = FALSE);
  restaurant_stage[self] := &quot;committed&quot;;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Running TLC again reveals that the program is broken and that reverting Alice’s behavior to always care about Bob’s opinion fixes the issue. This reveals the most enlightening thing I’ve found about playing with TLA+. It’s extremely easy to write a TLA+ specification and a set of invariants that work. What’s much harder is coming up with a set of invariants that cover all the cases we actually care about, and making sure that modifications to the specification break those invariants. This is a great lesson about writing unit tests too - you have to be very honest to avoid studying to the test if you write the code and write the tests.&lt;/p&gt;

&lt;p&gt;In a larger sense, it would be really cool to have a tool that does for TLA+ what &lt;a href=&quot;http://jester.sourceforge.net/&quot;&gt;Jester&lt;/a&gt; does for Java: make random modifications to the specification and show cases where the invariants are not violated. This would be very interesting for building quality invariants, but also for automated exploration of the space of algorithms which meet a given set of invariants.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>C++11's atomic and volatile, under the hood on x86</title>
      <link>http://brooker.co.za/blog/2013/01/06/volatile.html</link>
      <pubDate>Sun, 06 Jan 2013 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2013/01/06/volatile</guid>
      <description>&lt;h1 id=&quot;c11s-atomic-and-volatile-under-the-hood-on-x86&quot;&gt;C++11’s atomic and volatile, under the hood on x86&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;How do C++11&apos;s atomic and volatile work?&lt;/p&gt;

&lt;p&gt;In my previous post &lt;a href=&quot;http://brooker.co.za/blog/2012/11/13/increment.html&quot;&gt;Java’s Atomic and volatile, under the hood on x86&lt;/a&gt; I look at Atomic and volatile in Java, and how they affect the generated assembly. In this post, I’m looking at &lt;a href=&quot;http://en.cppreference.com/w/cpp/atomic/atomic&quot;&gt;std::atomic&lt;/a&gt; and &lt;em&gt;volatile&lt;/em&gt; in C++.&lt;/p&gt;

&lt;p&gt;Like in Java, it’s &lt;a href=&quot;http://stackoverflow.com/questions/8819095/concurrency-atomic-and-volatile-in-c11-memory-model&quot;&gt;well known&lt;/a&gt; that &lt;em&gt;std::atomic&lt;/em&gt; and &lt;em&gt;volatile&lt;/em&gt; have different meanings in C++, but it’s still interesting to take a look at how that translates to what actually gets run. Let’s start with a very simple program:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;for (int i = 0; i &amp;lt; 500000000; i++) {
x += 0x3;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then define &lt;em&gt;x&lt;/em&gt; in one of three ways:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;long x;
volatile long x;
std::atomic_long x;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Before digging directly into the assembly, we can compare the run-time of the three programs (on a Core2 Q6600 compiled with gcc4.6 -O2):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;long&lt;/em&gt; took 0.0018s&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;volatile&lt;/em&gt; took 1.9s&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;atomic_long&lt;/em&gt; took 8.5s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s clear from the difference in run times that these three programs do produce significantly different code. The step up from &lt;em&gt;long&lt;/em&gt; to &lt;em&gt;volatile long&lt;/em&gt; is 100x, and another 4x up to &lt;em&gt;atomic_long&lt;/em&gt;. Starting the with assembly for the &lt;em&gt;long&lt;/em&gt; version, we can see why it’s so fast:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;addq    $1500000000, %rsi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Oh gcc, you’re sneaky. The compiler has completely discarded the loop, and calculated the result into a constant. Without the guarantees of &lt;em&gt;atomic&lt;/em&gt; or &lt;em&gt;volatile&lt;/em&gt;, it’s free to make optimizations like this. Next, the volatile version:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  movl    $500000000, %eax
.L2:
  movq    x(%rip), %rdx
  addq    $3, %rdx
  subl    $1, %eax
  movq    %rdx, x(%rip)
  jne     .L2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The inclusion of volatile has forced the compiler to not only run the loop, but also load and store the variable from memory on every run (the two &lt;tt&gt;movq&lt;/tt&gt; instructions). The overhead of this is clearly significant, but it’s hard to seperate the effects of the two. Moving the load and store out of the loop breaks the &lt;em&gt;volatile&lt;/em&gt; guarantee, but keeps the loop:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  movl	$500000000, %eax
  movq	x(%rip), %rdx
.L2:
  addq	$3, %rdx
  subl	$1, %eax
  jne	.L2
  movq	%rdx, x(%rip)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That version takes about 0.3s to run, so the it’s clear that &lt;a href=&quot;http://norvig.com/21-days.html#answers&quot;&gt;memory access time&lt;/a&gt; still dominates the run time of the program. It’s also somewhat interesting the gcc has chosen to load, modify and store instead of doing the add directly to memory. We can modify the original &lt;em&gt;volatile&lt;/em&gt; version to do that:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  movl	$500000000, %eax
.L2:
  addq	$3, x(%rip)
  subl	$1, %eax
  jne	.L2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That’s much less code, but it’s no faster. In fact, taking the average of a very large number of runs shows that it’s about 1% slower on my hardware. &lt;em&gt;perf stat&lt;/em&gt; tells the story for the unmodified load-modify-store code:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;3,006,477,937 cycles
2,504,227,487 instructions #    0.83  insns per cycle
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and for the modified code:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;3,035,431,555 cycles
1,504,575,146 instructions #    0.50  insns per cycle
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The modified code saves two fifths of the instructions, but drops the instructions per cycle by slightly more. After that diversion, on to the &lt;em&gt;atomic_long&lt;/em&gt; version of our original program:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   movl    $500000000, %eax
.L2:
   lock addq       $3, x(%rip)
   subl    $1, %eax
   jne     .L2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;GCC is generating smarter code than &lt;a href=&quot;http://brooker.co.za/blog/2012/11/13/increment.html&quot;&gt;java does with volatile&lt;/a&gt;, but uses a similar basic approach. It uses the &lt;em&gt;lock&lt;/em&gt; prefixed instruction to generate a write barrier. In this case, gcc uses the same instruction to do the addition, which makes the code much simpler (and much faster on some machines). The performance impact of that lock prefix, compared to the nearly identical code above without it, is quite clear. &lt;em&gt;perf stat&lt;/em&gt; says:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;13,522,633,468 cycles
1,513,530,642 instructions #    0.11  insns per cycle
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That’s 0.11 instructions per cycle compared to 0.5 for the non-locked version above. Comparing these two pieces of code makes the difference of intention between &lt;em&gt;atomic&lt;/em&gt; and &lt;em&gt;volatile&lt;/em&gt; quite clear. Even ignoring memory ordering issues, the &lt;em&gt;volatile&lt;/em&gt; code can lead to the incorrect answer with concurrent access. Take another look at that code:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  movl    $500000000, %eax
.L2:
  movq    x(%rip), %rdx
  addq    $3, %rdx
  subl    $1, %eax
  movq    %rdx, x(%rip)
  jne     .L2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If another writer wrote to the location of &lt;em&gt;x&lt;/em&gt; between the first and second &lt;tt&gt;movq&lt;/tt&gt; instructions, their updates would be completely lost. Clearly, volatile doesn’t imply atomic. The picture gets even worse on a multi-processor machine, where the lack of any barriers means that the results are highly unlikely to be correct. Those of us who move between C++ and Java need to be very clear about the difference in meaning between C++ volatile and Java volatile. It’s really unfortunate that the designers of Java didn’t chose a different keyword.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Java's Atomic and volatile, under the hood on x86</title>
      <link>http://brooker.co.za/blog/2012/11/13/increment.html</link>
      <pubDate>Tue, 13 Nov 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/11/13/increment</guid>
      <description>&lt;h1 id=&quot;javas-atomic-and-volatile-under-the-hood-on-x86&quot;&gt;Java’s Atomic and volatile, under the hood on x86&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;How exactly do AtomicInteger and volatile do their magic in Java?&lt;/p&gt;

&lt;p&gt;It’s well known that &lt;em&gt;volatile&lt;/em&gt; in Java doesn’t mean the same thing as &lt;em&gt;atomic&lt;/em&gt;. As &lt;a href=&quot;http://jeremymanson.blogspot.com/2007/08/volatile-does-not-mean-atomic.html&quot;&gt;Jeremy Manson says&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If you do an increment of a volatile integer, you are actually performing three separate operations:
1) Read the integer to a local.
2) Increment the local
3) Write the integer back out to the volatile field.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On the other hand, the Java memory model is known to offer looser guarantees than some real hardware implementations. Most modern desktops (and most current servers) are powered by x86-family processors, which have fairly predictable memory behavior, especially &lt;a href=&quot;http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu&quot;&gt;compared to some other processor families&lt;/a&gt;. Despite these stronger guarantees, &lt;em&gt;volatile int&lt;/em&gt; doesn’t behave like AtomicInteger, even on x86 and even for a very simple operation like counting.&lt;/p&gt;

&lt;p&gt;Why not?&lt;/p&gt;

&lt;p&gt;To understand what is going on under the hood, let’s start with a very simple piece of code, which does only three things:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Launches M threads&lt;/li&gt;
  &lt;li&gt;Loops a large number of times (N) in each thread, incrementing a shared variable.&lt;/li&gt;
  &lt;li&gt;Joins the threads.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The three threads are running in parallel, and sharing the value of the variable. For it to end up with the right value at the end (M*N), two things need to be true. First, changes made by one threads must be immediately &lt;em&gt;visible&lt;/em&gt; to the other threads. Second, changes made to the variable must be &lt;em&gt;atomic&lt;/em&gt; - each threads must perform the load, increment and save as one effective operation. Visibility is not enough, because it allows something like this to happen:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Thread 1 loads 5&lt;/li&gt;
  &lt;li&gt;Thread 1 increments its private copy to 6.&lt;/li&gt;
  &lt;li&gt;Thread 2 stores 6&lt;/li&gt;
  &lt;li&gt;Thread 1 stores 6&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even if changes are immediately visible, increments done by done by one thread can be lost by others. To see the effects of this, we can start with a version of the program with a non-&lt;em&gt;volatile&lt;/em&gt; shared variable, which offers neither &lt;em&gt;visibility&lt;/em&gt; nor &lt;em&gt;atomicity&lt;/em&gt;. On my system, based on a four-core Intel Core2 Q6600 CPU, with 3 threads each counting to 5 million the result is:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Final value 7147559, expected 15000000, time 0.05ms&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That worked rather poorly, but was nice and quick. Adding &lt;em&gt;volatile&lt;/em&gt; adds visibility, but not atomicity. With the same parameters:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Final value 5191650, expected 15000000, time 2286ms&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this particular sample, it’s actually worse, and much much slower. It’s worth noting that both of these return some seemingly random value between 5 million and 15 million, mostly clustered near to 5 million. Obviously &lt;a href=&quot;http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicInteger.html&quot;&gt;AtomicInteger&lt;/a&gt;, which guarantees both &lt;em&gt;atomicity&lt;/em&gt; and &lt;em&gt;visibility&lt;/em&gt;, will solve the problem:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Final value 15000000, expected 15000000, time 3041ms&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Atomic version is correct. It’s also slower, but not by a huge margin. Clearly these three versions are leading to very different CPU behavior, so let’s turn to &lt;a href=&quot;https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly&quot;&gt;-XX:+PrintAssembly&lt;/a&gt; to see if we can figure out what’s going on. First, the increment code from the non-&lt;em&gt;volatile&lt;/em&gt; version:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;add    $0x10,%ecx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice how the increment has turned into an add of 16. The compiler has optimized away a lot of the looping, and is doing 1/16 of as many passes through the actual loop as we specified. I’m somewhat surprised that it even does that amount of work. On to the &lt;em&gt;volatile&lt;/em&gt; version. Here, we can see the three separate steps:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mov    0xc(%r10),%r8d ; Load
inc    %r8d           ; Increment
mov    %r8d,0xc(%r10) ; Store
lock addl $0x0,(%rsp) ; StoreLoad Barrier
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I wrote about the StoreLoad barrier in &lt;a href=&quot;http://brooker.co.za/blog/2012/09/10/volatile.html&quot;&gt;my previous post on Java’s volatile&lt;/a&gt;. It’s exact semantics &lt;a href=&quot;http://preshing.com/20120710/memory-barriers-are-like-source-control-operations&quot;&gt;can be subtle&lt;/a&gt;, but the quick version is that it does two things: makes every store before the &lt;em&gt;lock addl&lt;/em&gt; visible to other processors, and ensures that every load after the &lt;em&gt;lock addl&lt;/em&gt; gets at least the version visible at the time it is executed. In this case, volatile gives &lt;em&gt;visibility&lt;/em&gt;, in that each of the processors immediately gets the version from the other processors after each increment. What it doesn’t give is &lt;em&gt;atomicity&lt;/em&gt;. Any stores that happen on other processes between the load and the store are lost, as in the example above. &lt;a href=&quot;http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicInteger.html&quot;&gt;AtomicInteger&lt;/a&gt; fixes this problem by seeking the processor’s help to be truly atomic.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mov    0xc(%r11),%eax       ; Load
mov    %eax,%r8d            
inc    %r8d                 ; Increment
lock cmpxchg %r8d,0xc(%r11) ; Compare and exchange
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To understand how this works, we need to understand what &lt;em&gt;cmpxchg&lt;/em&gt; does. The &lt;a href=&quot;http://download.intel.com/products/processor/manual/325383.pdf&quot;&gt;Intel Software Developer’s Manual&lt;/a&gt; describes it as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Compares the value in the EAX register with the destination operand. If the two values are equal, the source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the EAX register.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, we’re loading the value, incrementing it, then only writing it into memory if nobody else has overwritten it with a different value since we loaded the value. Obviously, the operation needs to be retried if the store to memory fails. Indeed, the Java does just that. I’ll spare you the verbose assembly version, and present that Java version from OpenJDK:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;public final int incrementAndGet() {
  for (;;) {
    int current = get();
    int next = current + 1;
    if (compareAndSet(current, next))
      return next;
  }
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;em&gt;compareAndSet&lt;/em&gt; function is a native implementation from Unsafe.java. Counting the instructions from the disassemblies would suggest that the Atomic version should be, at most, 100 times slower than the non-Atomic non-Volatile version. So why is it tens of thousands of times slower? The best evidence comes from modifying the programs so the threads don’t contend - simply running one, then the other. Removing the contention doesn’t do anything to the non-Atomic non-volatile version: it still runs in about 0.05ms. The performance difference when contention is removed from the volatile and Atomic versions is rather startling. The volatile version drops from 2286ms to 0.1ms, and the Atomic version drops from 3041ms to 0.15ms. In both cases, the serial version is around twenty thousand times faster. In trivial examples like this, it’s clear that &lt;a href=&quot;http://en.wikipedia.org/wiki/Amdahl%27s_law&quot;&gt;Amdahl’s Law&lt;/a&gt;, which assumes that paralellizing a program doesn’t make it slower, is highly over-optimistic. This is &lt;a href=&quot;http://en.wikipedia.org/wiki/Parallel_slowdown&quot;&gt;parallel slowdown&lt;/a&gt; in action.&lt;/p&gt;

&lt;p&gt;To understand what’s making this amazing 10-thousand-fold difference in program runtime, we can turn to &lt;a href=&quot;https://perf.wiki.kernel.org/index.php/Main_Page&quot;&gt;Linux performance counters&lt;/a&gt;, and run the JVM under the &lt;em&gt;perf stat&lt;/em&gt; program. Here are some highlights of the report for the two versions, with the no-contention (fast) version first:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   806,970,762 instructions              #    0.52  insns per cycle 
   165,637,507 branches                  #   57.996 M/sec         
     2,293,172 branch-misses             #    1.38% of all branches
   195,282,430 L1-dcache-loads           #   68.375 M/sec          
     4,348,143 L1-dcache-load-misses     #    2.23% of all L1-dcache hits
     1,978,387 LLC-loads                 #    0.693 M/sec                
       110,472 LLC-load-misses           #    5.58% of all LL-cache hits 
   0.713189740 seconds time elapsed
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then the slow version, with contention:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; 1,066,079,264 instructions              #    0.08  insns per cycle      
   227,089,065 branches                  #   17.347 M/sec                
    10,659,632 branch-misses             #    4.69% of all branches      
   408,262,745 L1-dcache-loads           #   31.187 M/sec                
    24,905,458 L1-dcache-load-misses     #    6.10% of all L1-dcache hits
    21,168,959 LLC-loads                 #    1.617 M/sec                
    12,133,097 LLC-load-misses           #   57.32% of all LL-cache hits 
   3.271893871 seconds time elapsed
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We must be careful interpreting these results, because they are polluted with data not related to our program under test, like the JVM startup sequence. Still, the differences are quite obvious. Our first smoking gun is instructions per cycle (the first line) - this measures how many CPU instructions were executed per clock cycle. This is the amount of time the CPU could actually do work without being blocked on other things, like loading data from RAM. Higher is obviously better. The fast no-contention version gets a respectable 0.52 instructions per cycle. The slow version scores a terrible 0.08 - indicating 92% of cycles were wasted. Moving down the report indicates why this is the case, with the last line being the most damning. We went from 110 thousand LLC (last-level cache) misses in the fast version, to over 12 million in the slow version. The slow version is spending most of its time waiting for data from RAM.&lt;/p&gt;

&lt;p&gt;Why does contention cause an otherwise identical program to behave so differently? The answer lies in the way that the CPUs in my multi-core system are forced to keep their caches in sync to meet the requirement of &lt;em&gt;visibility&lt;/em&gt;. The short version is that each core keeps track of which cache lines (chunks of memory in cache) are shared with other CPUs, and puts in extra effort to tell other CPUs about writes it makes to those lines. If you want to the long version, Ulrich Drepper explains it very well in section 3 of his excellent &lt;a href=&quot;http://www.akkadia.org/drepper/cpumemory.pdf&quot;&gt;What Every Programmer Should Know About Memory&lt;/a&gt;. I highly recommend reading it.&lt;/p&gt;

&lt;p&gt;The most important lesson here is how critical profiling is to writing high-performing parallel programs. Sometimes our intuition (more parallel equals faster) can be dangerously incorrect. The second lesson is the importance of minimizing contention and communication between cores.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Are volatile reads really free?</title>
      <link>http://brooker.co.za/blog/2012/09/10/volatile.html</link>
      <pubDate>Mon, 10 Sep 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/09/10/volatile</guid>
      <description>&lt;h1 id=&quot;are-volatile-reads-really-free&quot;&gt;Are volatile reads really free?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Some claim that reads from volatile variables are free in Java on x86. Is that claim true?&lt;/p&gt;

&lt;p&gt;There is a somewhat common belief among Java programmers that reads from &lt;em&gt;volatile&lt;/em&gt; variables are free. &lt;a href=&quot;http://stackoverflow.com/questions/1090311/are-volatile-variable-reads-as-fast-as-normal-reads&quot;&gt;Are volatile variable ‘reads’ as fast as normal reads?&lt;/a&gt; from StackOverflow is a perfect example, because it’s the top result I get from Google for most related searches. The top answer says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;On an x86, there is no additional overhead associated with volatile reads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and goes on to say:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;JSR-133 classifies four barriers “LoadLoad, LoadStore, StoreLoad, and StoreStore”. Depending on the architecture, some of these barriers correspond to a “no-op”, meaning no action is taken, others require a fence. There is no implicit cost associated with the Load itself, though one may be incurred if a fence is in place. In the case of the x86, only a StoreLoad barrier results in a fence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reading Doug Lea’s &lt;a href=&quot;http://gee.cs.oswego.edu/dl/jmm/cookbook.html&quot;&gt;The JSR-133 Cookbook for Compiler Writers&lt;/a&gt; gives the same impression. It says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Issue a StoreStore barrier before each volatile store.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Issue a StoreLoad barrier after each volatile store. Note that you could instead issue one before each volatile load, but this would be slower for typical programs using volatiles in which reads greatly outnumber writes. Alternatively, if available, you can implement volatile store as an atomic instruction (for example XCHG on x86) and omit the barrier. This may be more efficient if atomic instructions are cheaper than StoreLoad barriers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Issue LoadLoad and LoadStore barriers after each volatile load.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Doug lists the StoreStore, LoadLoad and LoadStore barriers as noops on x86. It makes some sense to read this as a validation of the idea that volatile reads are free. No extra instructions are issued for the reads, and instructions are what makes computers take time, so no more time is taken. Right?&lt;/p&gt;

&lt;p&gt;The first step to answering that question is seeing if the Java JRE 1.6 actually behaves like Doug says it should. Possibly the easiest way to do this is to add the &lt;em&gt;-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly&lt;/em&gt; options to the Java command line, and see the assembly that’s being generated by the JIT. I wrote a very small toy program which accessed a volatile variable in a loop (to force the JIT to kick in).&lt;/p&gt;

&lt;p&gt;Here’s the code it generated for both volatile and non-volatile reads:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nop                       ;*synchronization entry
mov    0x10(%rsi),%rax    ;*getfield x
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And for volatile writes:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;xchg   %ax,%ax
movq   $0xab,0x10(%rbx)
lock addl $0x0,(%rsp)     ;*putfield x
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The JVM is behaving exactly like it should according the to the spec, with one minor difference. It’s using &lt;em&gt;lock addl $0x0,(%rsp)&lt;/em&gt; as the StoreLoad barrier. The gory details can be found in chapter 8 of volume 3A of &lt;a href=&quot;http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html&quot;&gt;Intel’s IA-32 developer manual&lt;/a&gt;, but the short version is:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Locked operations are atomic with respect to all other memory operations and all externally visible events. […] Locked instructions can be used to synchronize data written by one processor and read by another processor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While &lt;em&gt;addl $0x0,(%rsp)&lt;/em&gt; does nothing, &lt;em&gt;lock addl $0x0,(%rsp)&lt;/em&gt; behaves like the StoreLoad barrier that Doug Lea talks about. The minor difference from Lea’s description is the &lt;em&gt;xchg %ax,%ax&lt;/em&gt; before the store. &lt;em&gt;xchg&lt;/em&gt; does have some memory ordering influence, but I’m not sure of the role it plays here - it seems to me like this would be correct without it, so there must be something I am missing.&lt;/p&gt;

&lt;p&gt;Anyway, it looks like what the JVM is doing in this case follows Doug Lea’s recommendations. We can expect volatile reads to be free, right?&lt;/p&gt;

&lt;p&gt;To test the theory, I wrote a program which does the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Launches 3 reader threads, which try to do 500 million reads from a shared variable and stores to a local variable&lt;/li&gt;
  &lt;li&gt;Launches 1 writer thread, which increments the shared variable 500 million times&lt;/li&gt;
  &lt;li&gt;Synchronizes their start times, and times them to completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ran it on my quad-core machine (Intel Q6600), and did three runs. Once, with a non-volatile shared variable, once with a volatile shared variable and once with a volatile shared variable and no writer thread. The graph below summarizes the results.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/volatile_results_graph.png&quot; alt=&quot;Graph of experimental results.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this experiment, volatile writes are nearly 100x more expensive than normal writes. That corresponds well with the &lt;em&gt;volatile writes are expensive&lt;/em&gt; mental model. What’s most notable, though, is that volatile reads when there is a writer are about 25x more expensive than non-volatile reads. The included error bars show one standard deviation both sides of the mean run time, and clearly show that the speed of volatile read access also varies widely. If the code’s the same, then what’s going on?&lt;/p&gt;

&lt;p&gt;What we are seeing is the effect of the processor trying to keep its contract with the &lt;em&gt;lock&lt;/em&gt; instruction, and make sure that the data written by the writer is visible to the readers. Memory access and cache coherency are some of the most expensive things that modern processors do, and it’s not easy to predict their performance impact from the assembly code. On the single-threaded single-core x86 processors of the past it was hard enough to read an assembly dump and predict performance. On modern multicore systems, it’s extraordinarily difficult.&lt;/p&gt;

&lt;p&gt;The third set of results tells another interesting story about volatile reads, one that’s closer to the meaning of volatile in C. To illustrate the difference, I wrote a much simpler program which reads a variable, increments it, and writes it back to another variable. Without volatile, the compiler generates code like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; nop                       ;*getfield isum
 add    %r10,%r11
 add    %r10,%r11
 ... 12 more adds ...
 add    %r10,%r11
 mov    %r11,0x10(%rbx)    ;*putfield isum
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The loop is unrolled, and the variables simply stored in registers. With volatile, the code looks like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;test   %r11d,%r11d
je     0x00007f89bd05e7b4  ;*synchronization entry
mov    0x18(%r11),%r10    ;*getfield x
add    %r10,%r9           ;*ladd
mov    %r9,0x10(%rbx)     ;*putfield isum
mov    0xc(%rbx),%r11d    ;*getfield this$0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I’ve only included one piece of the unrolled loop here. The compiler wrapped five sections like this in the larger outer loop, reducing the loop overhead somewhat. It does a whole lot more work per step that the non-volatile version (the details are unimportant), and isn’t nearly as aggressive about unrolling (I assume because it’s making a heuristic decision based on code size).&lt;/p&gt;

&lt;p&gt;It appears as though reads to volatile variables are not free in Java on x86, or at least on the tested setup. It’s true that the difference isn’t so huge (especially for the read-only case) that it’ll make a difference in any but the more performance sensitive case, but that’s a different statement from &lt;em&gt;free&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;EDIT: It’s worth noting that I am not criticizing Doug Lea’s &lt;a href=&quot;http://gee.cs.oswego.edu/dl/jmm/cookbook.html&quot;&gt;The JSR-133 Cookbook for Compiler Writers&lt;/a&gt;. He doesn’t say that volatile reads are free, and doesn’t even suggest it. That’s just something that many people seem to have inferred from his writing.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Highly contended and fair locking in Java</title>
      <link>http://brooker.co.za/blog/2012/09/10/locking.html</link>
      <pubDate>Mon, 10 Sep 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/09/10/locking</guid>
      <description>&lt;h1 id=&quot;highly-contended-and-fair-locking-in-java&quot;&gt;Highly contended and fair locking in Java&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;How do explicit locks compare to volatile access?&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;http://brooker.co.za/blog/2012/09/10/volatile.html&quot;&gt;my last post on Java’s volatile&lt;/a&gt;, I showed how (in one set of experiments) Java volatile variable reads don’t come for free. The cost of accessing a highly-contended volatile variable in one micro-benchmark came out at about 100x the cost of accessing a non-volatile variable. How does that compare to locking?&lt;/p&gt;

&lt;p&gt;In the last post, I presented the results of a program which:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Launches 3 reader threads, which do 500 million reads from a shared variable and stores to a local variable&lt;/li&gt;
  &lt;li&gt;Launches 1 writer thread, which increments the shared variable 500 million times&lt;/li&gt;
  &lt;li&gt;Synchronizes their start times, and times them to completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this post, I modified it to make the variable non-&lt;em&gt;volatile&lt;/em&gt;, and add explicit locking using a &lt;a href=&quot;http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/locks/ReentrantLock.html&quot;&gt;ReentrantLock&lt;/a&gt;. Performance dropped substantially bad. The graph below is an update of the graph from the last post, including the locking version.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/lock_results_graph.png&quot; alt=&quot;Graph of lock results&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For reads, the locking version of this test is about 33x more expensive than the volatile version (and over 3000x more than the incorrect unsynchronized version). Writes are about 15x more expensive. To put this in perspective, it’s still only 545 nanoseconds per lock operation, so individual operations are not really expensive in absolute terms.&lt;/p&gt;

&lt;p&gt;The other effect of locking vs. volatile is starvation of threads. The documentation for &lt;a href=&quot;http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/locks/ReentrantLock.html&quot;&gt;ReentrantLock&lt;/a&gt; says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The constructor for this class accepts an optional fairness parameter. When set true, under contention, locks favor granting access to the longest-waiting thread. Otherwise this lock does not guarantee any particular access order. Programs using fair locks accessed by many threads may display lower overall throughput (i.e., are slower; often much slower) than those using the default setting, but have smaller variances in times to obtain locks and guarantee lack of starvation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In my tests, I had fairness disabled and saw significant thread starvation. In about one run in five, the program actually runs in a totally serial order - each thread running to completion before any of the other threads run. I have seen thread starvation issues in real-world code too, but never to this extent. Fair locking fixed this problem, but increased the per-lock time to 32µs (from 0.5µs for the non-fair version). That’s an increase of about 60x, for a total of approximately 200x more expensive than a volatile access.&lt;/p&gt;

&lt;p&gt;Fair locks in the version of Java I was using are based on the &lt;a href=&quot;http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/concurrent/locks/AbstractQueuedSynchronizer.java#AbstractQueuedSynchronizer.acquireQueued%28java.util.concurrent.locks.AbstractQueuedSynchronizer.Node%2Cint%29&quot;&gt;AbstractQueuedSynchronizer&lt;/a&gt;. It uses a type of lock queue &lt;a href=&quot;http://gee.cs.oswego.edu/dl/papers/aqs.pdf&quot;&gt;modified from Craig, Landin, Hagersten locks&lt;/a&gt;, which provide a FIFO queue of waiters on a lock without needing to depend on another lower-level locking primitive. It’s truly fascinating stuff, and well worth reading.&lt;/p&gt;

&lt;p&gt;Given the very high costs of fair locks under contention, it’s probably best to avoid them in those situations. There are situations where they work well and are very necessary. From &lt;a href=&quot;http://gee.cs.oswego.edu/dl/papers/aqs.pdf&quot;&gt;Doug Lea&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Even though they may perform poorly under high contention when protecting briefly-held code bodies, fair locks work well, for example, when they protect relatively long code bodies and/or with relatively long inter-lock intervals&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s probably best to avoid writing high-contention code unless absolutely necessary. Where it is necessary, however, &lt;em&gt;volatile&lt;/em&gt; should be tested against locking, because it’s likely to be much faster. Fair locking shouldn’t be used in performance sensitive code unless its guarantees are really needed. You’ll likely want to do your own testing, and not assume your application will behave like my test.&lt;/p&gt;

</description>
    </item>
    
    <item>
      <title>Expect Less, Get More?</title>
      <link>http://brooker.co.za/blog/2012/09/02/expect-less.html</link>
      <pubDate>Sun, 02 Sep 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/09/02/expect-less</guid>
      <description>&lt;h1 id=&quot;expect-less-get-more&quot;&gt;Expect Less, Get More?&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;On what newly hired engineers think they need to do.&lt;/p&gt;

&lt;p&gt;Hiring good software engineers is really difficult. It’s hard to find good people, hard to filter good hires from bad hires, and possibly even harder to decide what ‘good’ really means. When we find a good hire, we want to make sure they can reach their potential as quickly as possible. Most good software managers and leaders realize that even excellent people may take some time to really contribute. They have a lot to learn, and must be given the time to learn, however eager we are to get them contributing to our projects.&lt;/p&gt;

&lt;p&gt;I recently came across a pair of papers from &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/abegel/&quot;&gt;Andrew Begel&lt;/a&gt; at Microsoft Research, and &lt;a href=&quot;http://cseweb.ucsd.edu/~bsimon/&quot;&gt;Beth Simon&lt;/a&gt; at UCSD. In &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/abegel/papers/icer-begel-2008.pdf&quot;&gt;Novice Software Developers, All Over Again&lt;/a&gt; and &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/abegel/papers/sigcse-begel-2008.pdf&quot;&gt;Struggles of New College Graduates in their First Software Development Job&lt;/a&gt;, they present the results from a study where they followed eight new Microsoft hires for a total of 85 hours.&lt;/p&gt;

&lt;p&gt;All of the material in these studies is very interesting, but the section I found most relevant to my interests is the list of &lt;em&gt;Misconceptions Which Hinder&lt;/em&gt;. The universal misconceptions they list all seem to match my own experiences early in my career, and seem to align well with what I see colleagues struggling with. Perhaps the most poignant of these is the first one:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;1. I must do everything myself so that I look good to my manager.&lt;/em&gt;
This misconception is particularly dangerous, especially in large, complex development environments. […] the perceived need to “perform” and not “reveal deficiencies” makes for much wasted time. It also seems to contribute to poor communication and a longer acclimatization. Communication suffered both by waiting too long to seek help and by trying to cover up issues that the [engineer] perhaps felt he “should know.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The fact that this was universally felt is especially interesting given the diverse educational backgrounds of the study subjects:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Subjects W, X, Y and Z had BS degrees, V had an MS, and U, R, and T had PhDs, all in computer science or software engineering. 2 were educated in the US, 2 in China, 1 in Mexico, 1 in Pakistan, 1 in Kuwait, and 1 in Australia. All 3 PhDs were earned in US universities.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Starting in industry after completing my PhD, I felt the same pressure that these candidates report. I felt like I had been hired for ‘‘what I knew’’, and by admitting that I didn’t know something, I’d make my manager and mentor reconsider the decision to hire me. Until reading Begel and Simon’s study, I hadn’t really thought that this was a widely spread feeling amongst people early in their careers. Thinking about this now, I realized two things. First, this pressure came to from me, and not from the outside. What feedback I received on my performance was almost uniformly positive. Second, it’s clear that I wasn’t hired for my intimate knowledge of the proprietary systems I was working on.&lt;/p&gt;

&lt;p&gt;While Begel and Simon’s study is a small one, it is good evidence that this is a widespread problem amongst newly hired engineers. As mentors and managers, we need to make requirements more clear to new hires. It’s not surprising that somebody who has been around for only a few months doesn’t have a complete knowledge of our system, and we should clearly communicate that we don’t expect them to. As software designers, we can also mitigate this problem by making sure our systems are loosely coupled and interfaces well defined - the amount of knowledge that a newbie needs to make effective changes to our software is a useful metric for good design.&lt;/p&gt;

&lt;p&gt;After reading these papers, I found the same research covered in the excellent book &lt;a href=&quot;http://www.amazon.com/Making-Software-Really-Works-Believe/dp/0596808321&quot;&gt;Making Software: What Really Works, and Why We Believe It&lt;/a&gt;. It’s well worth picking up if you have any interest in what we know about the process of making software.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Latency lags bandwidth</title>
      <link>http://brooker.co.za/blog/2012/02/11/latency-lags-bandwidth.html</link>
      <pubDate>Sat, 11 Feb 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/02/11/latency-lags-bandwidth</guid>
      <description>&lt;h1 id=&quot;latency-lags-bandwidth&quot;&gt;Latency lags bandwidth&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;On the growing gap between capacity, bandwidth and latency.&lt;/p&gt;

&lt;p&gt;One of the key engineering challenges for most high-performance storage systems is minimizing the number of disk seeks that are required to access stored data. Sometimes of this requires smart techniques like reordering, but the majority of the win comes from caching - storing as much data as you can in memory, or at least away from slow disk platters. Database systems do the same, as do modern operating systems - they constantly cache reads and buffer writes in memory in an attempt to hide disk latency. The classic &lt;a href=&quot;ftp://ftp.research.microsoft.com/pub/tr/tr-99-100.pdf&quot;&gt;rule of thumb&lt;/a&gt; is that it’s worth caching pages in memory which are likely to be accessed again within five minutes. This five minute rule has proven to be amazingly constant over &lt;a href=&quot;ftp://ftp.research.microsoft.com/pub/tr/tr-97-33.pdf&quot;&gt;several decades&lt;/a&gt; and many orders of magnitude of computer speed, capacity and development.&lt;/p&gt;

&lt;p&gt;Another way of stating the same thing is that page sizes have stayed approximately constant, disk access latencies have stayed approximately constant, and the ratio between the cost of a byte of RAM and a byte of disk have stayed approximately constant. We spend an ever increasing amount of our cheap RAM on hiding the crappy latency of our cheap storage.&lt;/p&gt;

&lt;p&gt;Caching has been very successful. So successful, in fact, that it has effectively hidden from all but the biggest applications the ever-growing split between capacity, bandwidth and latency in our storage systems.&lt;/p&gt;

&lt;p&gt;From &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=1022596&quot;&gt;Latency Lags Bandwidth&lt;/a&gt;, a 2003 paper by David Patterson:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/mbrooker_patterson_llb.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For every decade that Patterson measured for the paper, disks got on average 50 times bigger, 12 times faster at doing bulk transfers, and only 2.4 times faster at seeking. That paper came out in 2003, but there isn’t much indication that the picture has changed substantially since then. The other way of looking at that is even more disturbing: the time to read a complete disk with random IO is increasing by a factor of 22 every decade or 36% a year.&lt;/p&gt;

&lt;p&gt;The problem with the success of caching at hiding latency is that the cliff is getting steeper: the ratio between the speed of a cache hit and a cache miss is changing. Many applications nominally use disk, but may not be able to afford cache misses at all.&lt;/p&gt;

&lt;p&gt;There are two solutions to this: simply give up on disks for online data (RAM is the new disk), or expend bandwidth to reduce apparent latency. Both of these options are already widely seen in production systems. &lt;a href=&quot;http://www.mongodb.org&quot;&gt;MongoDB&lt;/a&gt;, &lt;a href=&quot;http://redis.io/&quot;&gt;Redis&lt;/a&gt; and &lt;a href=&quot;http://www.voltdb.com/&quot;&gt;VoltDB&lt;/a&gt; are good examples of the first and &lt;a href=&quot;http://www.oracle.com/technetwork/database/berkeleydb/learnmore/bdb-je-architecture-whitepaper-366830.pdf&quot;&gt;BDB-JE&lt;/a&gt; is a good example of the second. Neither of these approaches is ideal, however. Storing data in RAM requires very careful attention to durability, while bandwidth intensive methods have to deal with the very real and widening gap between capacity and bandwidth and the tradeoff between write and read speeds.&lt;/p&gt;

&lt;p&gt;This sets the stage for the explosive rise of SSDs, hybrid disks and hybrid storage systems, which is very exciting. Unfortunately, they only change the constants and not the fundamental arithmetic. Latency is going to keep lagging, and our data is going to keep getting further away.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The properties of crash-only software</title>
      <link>http://brooker.co.za/blog/2012/01/22/crash-only.html</link>
      <pubDate>Sun, 22 Jan 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/01/22/crash-only</guid>
      <description>&lt;h1 id=&quot;the-properties-of-crash-only-software&quot;&gt;The properties of crash-only software&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;My thoughts about a classic paper&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.usenix.org/events/hotos03/tech/full_papers/candea/candea.pdf&quot;&gt;Crash-only software&lt;/a&gt; by Candea and Fox is a very interesting paper which is well worth your time if you spend any time designing software or services. Re-reading it today, I noticed how useful the section headers of section 3 &lt;em&gt;Properties of Crash-Only Software&lt;/em&gt; appear outside the context of the paper.&lt;/p&gt;

&lt;p&gt;The properties the authors list are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;All important non-volatile state is managed by dedicated state stores&lt;/li&gt;
  &lt;li&gt;Components have externally enforced boundaries&lt;/li&gt;
  &lt;li&gt;All interactions between components have a timeout&lt;/li&gt;
  &lt;li&gt;All resources are leased, rather than permanently allocated&lt;/li&gt;
  &lt;li&gt;Requests are entirely self-describing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Regardless of what you think of the value of crash-only software, it is difficult to argue with this list of properties. Even outside of the context of the paper, each of these makes sense to me as a good design practice. The way I understand them is like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All important non-volatile state is managed by dedicated state stores&lt;/strong&gt;. Either you care about your state or you don’t. If you do, store it somewhere safe where it won’t get lost and can be recovered quickly in event of failure. If you don’t, and your data is either purely volatile or a cache, then be explicit in your design that you don’t care about it. Don’t half-care about data. Store data explicitly and not implicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Components have externally enforced boundaries&lt;/strong&gt;. Keep logically separate components as separate as possible, ensuring that they don’t interact except via a well-defined API. Try to limit implicit side channels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All interactions between components have a timeout&lt;/strong&gt;. Not only that, but consider the possibility that what you are trying to do will never succeed. All individual calls should have a timeout, and all attempts to retry should be limited. Waiting forever is not a substitute for knowing how to handle failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All resources are leased, rather than permanently allocated&lt;/strong&gt;. If you own something, don’t give it away to somebody else if you still value it. You can lend it to them for as long as they can justify it’s use. If you no longer care about something, you can give it away. If something still has value to your business, then keep it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requests are entirely self-describing&lt;/strong&gt;. Don’t keep implicit context in protocols and interactions. To me, self-describing doesn’t imply ‘has no schema’ or ‘doesn’t correspond to a well defined API’. It means that requests should contain all the context that is needed to complete them, and if that is not possible keep the state in a dedicated state store. The API should also be explicit about idempotency and the safety of retries.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The power of two random choices</title>
      <link>http://brooker.co.za/blog/2012/01/17/two-random.html</link>
      <pubDate>Tue, 17 Jan 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/01/17/two-random</guid>
      <description>&lt;h1 id=&quot;the-power-of-two-random-choices&quot;&gt;The power of two random choices&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Using less information to make better decisions.&lt;/p&gt;

&lt;p&gt;In many large-scale web services, multiple layers of stateless and stateful services are seperated by load balancers. Load balancing can be done with dedicated hardware, with dedicated software load balancers, using DNS trickery or through a load-balancing mechanism in the client. In large systems, the resources and constraints at each layer can vary widely. Some layers are stateless, and easily scale horizontally to many machines. Other layers may be more constrained, either due to the need to access state or contention for some other shared resource.&lt;/p&gt;

&lt;p&gt;Centralized load balancing solutions can distribute load over a fleet of machines very well. They track the amount of load they are sending to each machine (usually based on a simple measurement like connection count). Because they are centralized, load balancers typically don’t need to worry about load sent from other sources. They have complete control over the distribution of load.&lt;/p&gt;

&lt;p&gt;Despite this advantage, dedicated load balancers are often undesirable. They add cost, latency, complexity, and may introduce a single point of failure. Handing the task of load balancing to each upstream client is also possible, but introduces the challenge of fairly balancing the load from multiple places. In large systems with large numbers of clients and fairly homogeneous calls, a purely random system like DNS round robin can work very well. In smaller systems, systems where each downstream service can only handle a small number or concurrent requests, and systems where requests are heterogeneous it’s often desirable to load balance better than random.&lt;/p&gt;

&lt;p&gt;Perfect distributed load balancing could be done, at least in the happy case, by distributing information about system load across all the clients. The overhead of constantly sharing the exact load information between different sources can be high, so it’s tempting to have each source work off a cached copy. This data can periodically be refreshed from downstream, or from other clients.&lt;/p&gt;

&lt;p&gt;It turns out that’s not a great idea.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf&quot;&gt;The Power of Two Random Choices: A Survey of Techniques and Results&lt;/a&gt;, Mitzenmacher et. al. survey some research very relevant to this problem. The entire survey is good reading, but one of the most interesting results is about the effects of delayed data (like the cached load results mentioned above) on load balancing. The results are fairly logical in retrospect, but probably don’t match most engineers’ first expectations.&lt;/p&gt;

&lt;p&gt;Using stale data for load balancing leads to a herd behavior, where requests will herd toward a previously quiet host for much longer than it takes to make that host very busy indeed. The next refresh of the cached load data will put the server high up the load list, and it will become quiet again. Then busy again as the next herd sees that it’s quiet. Busy. Quiet. Busy. Quiet. And so on.&lt;/p&gt;

&lt;p&gt;One possible solution would be to give up on load balancing entirely, and just pick a host at random. Depending on the load factor, that can be a good approach. With many typical loads, though, picking a random host degrades latency and reduces throughput by wasting resources on servers which end up unlucky and quiet.&lt;/p&gt;

&lt;p&gt;The approach taken by the studies surveyed by Mitzenmacher is to try two hosts, and pick the one with the least load. This can be done directly (by querying the hosts) but also works surprisingly well on cached load data.&lt;/p&gt;

&lt;p&gt;To illustrate how well this works, I ran a simulation of a very simplistic system. Every second one request arrives at a system with 10 servers. Every 8 seconds the oldest request (if any) is removed from the queue on the servers. Load balancing is done with a cached copy of the server queue lengths, updated every N seconds. I considered four approaches: chose a random server, chose the best server, best of two random choices and best of three random choices.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://s3.amazonaws.com/mbrooker-blog-images/mbrooker_best_of_two_result.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As you can expect, the &lt;em&gt;pick the best&lt;/em&gt; approach worked best when perfect undelayed information was available. In the same case, the random pick approach worked poorly, leading to the worst queue times of any of the approaches. As the age of the data increases, the &lt;em&gt;pick the best&lt;/em&gt; approach becomes worse and worse because of herding and soon overtakes the random approach as the worst one.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Best of 3&lt;/em&gt; starts in second place, and puts in a good performance. It gains first place, only to cede it to &lt;em&gt;best of 2&lt;/em&gt; as the delay increases. Clearly, the behavior for &lt;em&gt;best of k&lt;/em&gt; will approach the behavior of &lt;em&gt;best&lt;/em&gt; as k approaches the number of servers. &lt;em&gt;Best of 2&lt;/em&gt; remains the strong leader all the way to the end of this simulation. Given these parameters it would start losing to the random approach around a refresh interval of 70. It is the clear leader for intervals over the range from 10 to 70, which is an impressive performance for such a simple approach.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Best of 2&lt;/em&gt; is good because it combines the best of both worlds: it uses real information about load to pick a host (unlike random), but rejects herd behavior much more strongly than the other two approaches.&lt;/p&gt;

&lt;p&gt;Take a look at &lt;a href=&quot;http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf&quot;&gt;The Power of Two Random Choices&lt;/a&gt; for a much stronger mathematical argument, and some more surprising places this algorithm works really well.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>The benefits of having data</title>
      <link>http://brooker.co.za/blog/2012/01/10/drive-failure.html</link>
      <pubDate>Tue, 10 Jan 2012 00:00:00 +0000</pubDate>
      <author>marcbrooker@gmail.com (Marc Brooker)</author>
      <guid>http://brooker.co.za/blog/2012/01/10/drive-failure</guid>
      <description>&lt;h1 id=&quot;the-benefits-of-having-data&quot;&gt;The benefits of having data&lt;/h1&gt;

&lt;p class=&quot;meta&quot;&gt;Two ways to look at drive failures and temperature.&lt;/p&gt;

&lt;p&gt;Almost all recent articles and papers I have read on hard drive failure rates refer to either &lt;a href=&quot;http://www.usenix.org/events/fast07/tech/full_papers/pinheiro/pinheiro_html/&quot;&gt;Failure Trends in a Large Disk Drive Population&lt;/a&gt; from Google, or &lt;a href=&quot;http://www.seagate.com/docs/pdf/whitepaper/drive_reliability.pdf&quot;&gt;Estimating Drive Reliability in Desktop Computers and Consumer Electronics Systems&lt;/a&gt; from Seagate. Despite both sounding and looking authoritative, these papers come to some wildly different conclusions, and couldn’t be more different in their approach.&lt;/p&gt;

&lt;p&gt;How does temperature affect drive failure rate? The Seagate paper says an increase from 25C to 30C increases it by 27%. The Google paper suggests a decrease of around 10%. How can the two most widely used studies differ by so much? It’s really because these papers use completely different approaches: the Google study uses simple analysis, while the Seagate paper uses powerful and sophisticated models, accelerated aging, and complex statistical tools. Despite sounding less authoritative, the Google paper is much better.&lt;/p&gt;

&lt;p&gt;The Seagate paper doesn’t actually present the results of testing drives at different temperatures. Instead, all the drives were tested using a standard &lt;a href=&quot;http://en.wikipedia.org/wiki/Accelerated_aging&quot;&gt;accelerated aging&lt;/a&gt; approach, in an oven heated to 42C. Another standard accelerated aging technique, the &lt;a href=&quot;http://en.wikipedia.org/wiki/Arrhenius_equation&quot;&gt;Arrhenius Model&lt;/a&gt;, was used to estimate the effect of temperature on failure rates. The Seagate paper goes on to use Weibull modeling, and a fairly sophisticated Bayesian approach to estimating the Weibull parameters. The underlying, and unmentioned, assumption is that the failure rate of drives is proportional to the &lt;a href=&quot;http://en.wikipedia.org/wiki/Reaction_rate_constant&quot;&gt;reaction rate constant&lt;/a&gt;, or the speed that an unlimited chemical reaction would proceed at a given temperature. No attempt is made to justify this choice, other than appealing to standard textbooks describing the approach.&lt;/p&gt;

&lt;p&gt;The Google paper, on the other hand, doesn’t use any statistical concepts that would be unfamiliar to an undergraduate engineering student. Instead, they use the failure data from over a hundred thousand disk drives collected over nearly a year of constant measurement.&lt;/p&gt;

&lt;p&gt;The difference between these paper’s conclusions on a rather fundamental question - how does temperature affect drive failure? - is a great example of how there is no substitute for data. Statistical tools and small-sample laboratory testing are certainly useful, and should not be ignored, but their conclusions are a poor approximation of the real thing.&lt;/p&gt;

&lt;p&gt;The Seagate paper is, sadly, an excellent example of why so many engineers (in our industry, and outside of it) are suspicious of statistical approaches. The paper not only stretches a little data into big conclusions, but does it while sounding authoritative. It uses complex statistical jargon not to explain, but to shield. &lt;em&gt;How dare you question my conclusions? You haven’t seen a Weibull model since college, and probably don’t even know what Eta looks like!&lt;/em&gt;&lt;/p&gt;
</description>
    </item>
    

  </channel>
</rss>
