Marc's Blog

About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in machining, welding, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I work on databases, serverless, and serverless databases. Before that, I worked on EC2 and EBS.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

Simple Simulations for System Builders

Even the most basic numerical methods can lead to surprising insights.

It’s no secret that I’m a big fan of formal methods. I use P and TLA+ often. I like these tools because they provide clear ways to communicate about even the trickiest protocols, and allow us to use computers to reason about the systems we’re designing before we build them1. These tools are typically focused on safety (Nothing bad happens) and liveness (Something good happens (eventually))2. Safety and liveness are crucial properties of systems, but far from being all the properties we care about. As system designers we typically care about many other things that aren’t strictly safety or liveness properties. For example:

The formal tools we typically use don’t do a great job of answering these questions. There are many ways to answer them, of course, from closed-form analysis3 to prototyping. One of my favorite approaches is one I call simple simulation: writing small simulators that simulate the behavior of simple models, where the code can be easily read, reviewed, and understood by people who aren’t experts on simulation or numerical methods.

A Quick Example

If you hang around with skiers or snowboarders, you’ll have heard a lot of talk over the last couple of winters about how crowded resorts have become, and how much time they now spend waiting to ride the ski lift4. Resort operators say that visits have been up only quite modestly, but skiers are seeing much longer waits. Is somebody lying? Or could we see significant increases in wait times with only modest increases in traffic?

To help explore this question, I wrote a small example simulator in Python which you can check out.

It starts off by building a model of each skier, who can be in one of three states: skiing down the hill, queuing, or riding up on the lift:

      +-------------------------------------------+       
      |                                           |       
      v                                           +       
+-------------+      +-------------+      +-------------+
|   Waiting   |----->| Riding Lift |----->|   Skiing    |
+-------------+      +-------------+      +-------------+

Then, it models the chair fairly explicitly, pulling folks from the queue and delivering them to the top of the mountain after a delay. Each skier, lift, and slope creates some events, which the simulation simply reacts to in virtual time order. The whole thing comes out to about 170 lines, with loads of comments.

That’s simple enough, but can we learn anything from it?

It turns out that, despite the extreme simplicity of the model, the results are interesting and run a little bit counter to our intuition. From example, here’s the result showing the percentage of time each skier spends skiing, versus the number of virtual skiers in our simulation:

I suspect that most people’s intuition would have this as a fairly linear relationship, and the pronounced knee in the curve would be a surprise. I don’t know what the realities are of ski resort attendance, but these simulations do suggest that its plausible that small increases in attendance could lead to long wait times.

As another example, my post on Serial, Parallel and Quorum Latencies is powered by a simple simulator.

It’s exactly these kinds of small insights that bring me back to building small simulators over and over.

How do I get started?

Start simply. You can use any programming language you like (I tend to reach for Python first), don’t need to learn any frameworks or libraries (although there are some good ones), and often don’t have to write more than a few tens of lines of code. The coding side, in other words, is relatively easy.

The hard part is modeling. Simply, coming up with an abstract model of your system and its actors, and choosing what to include and what to exclude. What’s important and what’s irrelevant. What’s the big picture, and what’s detail. The success of simulations of all sizes depends on making good choices here.

Think about the ski lift example. I modeled skier speed variations, and lift speed variations, and the periodic arrival of chairs. I didn’t model weather, or fatigue, or lunch time, or any one of many other factors that could change the result. Are those important? Maybe! But to answer our core question (“is it plausible that small increases in visits could lead to long increases in waiting?”) it didn’t seem like we needed to include them.

Then, when you have the model, convert it to code. I like to do this as literally and straightforwardly as possible. It’s very attractive to build in some abstraction that simplifies the code at the cost of obscuring the model. I avoid that as much as possible: being able to correlate the model and the code seems important to helping other people understand the assumptions. Our goal is to make the model and its assumptions obvious, not obscured.

Finally, explore and test. Play with the parameters and see what happens. Compare your intuition to the results. Look at the data coming out of the simulation. Try simple cases and check if they match. Validate against real systems if you can. How much effort to spend here depends a lot on how much is riding on the simulation being exact, but at least some validation is always warranted.

But what about….

Simple simulations aren’t the last word in computational or numerical methods. You can write simulations that are arbitrarily sophisticated, very carefully validated, and exquisitely crafted. Depending on what you’re trying to do that may be worth the effort. But I’ve seen a lot of people avoid reaching for simulation at all under the assumption that they have to be sophisticated. Often, you don’t. In the majority of cases I’ve seen, the results are robust, validation is fairly simple, and simplicity beats sophistication. Don’t let the depth of the field dissuade you from getting started.

Footnotes

  1. I was also one of the authors on How Amazon Web Services Uses Formal Methods which appeared in CACM back in 2015. Also check out the introduction/framing Leslie Lamport wrote in the same issue: Who Builds a House Without Drawing Blueprints?
  2. These succinct descriptions of safety and liveness come from Defining Liveness by Alpern and Schneider, which is well worth reading if you’re interested in going deeper on what liveness means.
  3. For example, modelling the durability of replicated and erasure-coded storage systems can be done fairly easily in closed-form (see, for example Notes on Reliability Models for Non-MDS Erasure Codes). The benefit is that the models are nice and clean and can be thrown in a spreadsheet. The downside is that they get complex quickly when you try include things like non-MDS erasure codes, correlated failure, and so on. The messy realities of life complicate modelling.
  4. Often this increase in traffic has been blamed on lower pass or ticket prices, which seems reasonable to believe. On the other hand, the same folks often complain about how expensive skiing has become. Clearly, the sport is both too cheap and too expensive, a real challenge for resort operators!