Marc's Blog

About Me

My name is Marc Brooker. I like to build things that work, and do cool stuff. I like building big things. I also dabble in machining, welding, cooking, and skiing.

I am an engineer at Amazon Web Services (AWS) in Seattle, where I work on agentic AI, especially safety and policy for agentic AI. Before that, I worked on EC2, EBS, databases, serverless, and serverless databases.
All opinions are my own.

Links

My Publications and Videos
@marcbrooker on Mastodon @MarcJBrooker on Twitter

Is this blog written by AI?

Not Just Scale

Bookmarking this so I can stop writing it over and over.

It seems like everywhere I look on the internet these days, somebody’s making some form of the following argument:

You don’t need distributed systems! Computers are so fast these days you can serve all your customers off a single machine!

This argument is silly and reductive.

But first, let’s look for the kernel of truth.

One Machine Is All You Need?

This argument is based on a kernel of truth: modern machines are extremely powerful, can do vast amounts of work every second, and can fit all the data belonging to even some rather large businesses in memory. Thousands, or even millions, of requests per second are achievable. Hundreds of gigabits per second. Terabytes of memory, and even more storage. Gigabytes per second of storage bandwidth. Millions of IOPS. Modern machines are super fast, and software which can take advantage of that speed can achieve incredible things.

It’s also true that many systems are distributed thoughtlessly, or wastefully, or in ways that increase complexity and reduce efficiency.

At the time I’m writing this, EC2 offers single instances with 32TiB of memory and 896 vCPUs, and 200Gbps of network bandwidth.

Many very important workloads can fit on one such machine.

Or could, if scale was all we cared about.

It’s Not Just Scale

Scale, and scalability, is only a small part of the overall reason distributed systems are interesting. Other practical reasons include:

Availability. Systems made up of multiple redundant components can offer levels of availability that single-machine systems simply can’t match. Distributed systems achieve exponentially better availability at linear cost, in a way that’s almost entirely hands-off. Single-machine systems can only achieve higher availability by increasing time-to-failure or reducing time-to-recovery, which can be both complex and expensive.
Durability. Making multiple copies of data on multiple machines is the only credible way to make online data highly durable. Single-system approaches like RAID are based on a fundamentally flawed assumption about failure independence. Offline data can be made durable with offline backups, of course.
Utilization. Multi-tenant systems achieve lower costs and higher utilization by packing multiple diverse workloads onto the same resources. This works in two ways: it allows workloads on different seasonal cycles to share resources in the shorter-term, and allows systems with unpredictable load spikes (or failure recovery spikes) to share spare burst resources. This allows both faster recovery, and significantly improved peak-to-average ratios. In many practical systems the efficiency wins from improving the peak-to-average ratio exceed the opportunities to tune single-machine systems for greater efficiency.
Latency. By being able to spread short-term spikes of load over a larger pool of resources, distributed systems can reduce tail latencies caused by short-term system overload.
Specialization. Distributed systems made of multiple components allow those components to be specialized for workloads that are latency-sensitive, throughput-sensitive, locality-sensitive, compute-intensive, memory-intensive, or whatever other unusual properties we had. A great example is from the MemoryDB paper where we saw how composition of specialized components allowed the overall system to both significantly reduce memory demand and bring down tail latency.
Isolation. Building systems out of multiple components allows components to be optimized for the security properties that matter to them. For example, notice how in the Firecracker paper, Lambda’s architecture combines strongly-isolated components that execute customer code, with multi-tenant components that perform simple data-lookup tasks.
Changes. Perhaps the most universal requirement in systems, and one that is frequently overlooked, is the ability to deal with change. Deploying new code as the business evolves. Patching security issues. Reacting to customer issues. In distributed systems, its typical to take advantage of the high availability mechanisms of the system to allow for safe zero-impact patching and deployments. Single-machine systems are harder to change.

These properties allow systems to achieve something important: simplicity.

Simplicity is a System Property

It is trivial to make any component in a system simpler, by moving its responsibilities to other parts of the system. Or by deciding that some of its responsibilities are redundant. It is common to see reductive views of simplicity that consider only part of a system’s responsibilities, dismissing important requirements or ignoring the way they’re actually achieved.

Let’s consider deployments as an example. In many distributed designs, deployments work by replacing or re-imaging machines when changes need to be made. Often, this uses the same mechanisms that ensure high-availability: traffic is moved away from a machine, changes are made and validated, and traffic returns. Single-machine deployments are typically harder to change: changes must be made online, to a running system, or under the pressure of downtime. Validating changes is difficult, because it’s all or nothing. The problems of single-machine deployments are solvable, but typically at the cost of higher system complexity: complex operational procedures, skilled operators, high judgement, coordination with customers, etc. It’s easy to ignore this complexity when admiring the simplicity of a single machine deployment. In the moment we look at it, none of this system complexity is visible.

Simplicity is a property of systems, not components. Systems include people and processes.

Another trap in the simplicity debate is confusing simple with familiar. Years of using Linux may make system administration tasks feel simple. Years of using IaC frameworks may make cloud deployments feel simple. In reality, both are rather complex, but its easy to conclude that the one we’re more familiar with is the simpler one.

Of course, scale also matters in real systems, in a number of ways. One of those ways is organizational scale.

Scaling Organizations

Just like computer systems, organizations scale by avoiding coordination. The more the organization needs different pieces to coordinate with one another to work, the less it is going to be able to grow. Organizations that wish to grow without grinding to a halt need to be carefully design, and continuously optimized, to reduce unnecessary coordination. Approaches like microservices and SoA are tools that allow technical organizations to avoid coordinating over things like data models, implementation choices, fleet management, tool choices, and other things that aren’t core to their businesses. APIs are, fundamentally, contracts that move coordination from human-to-human to system-to-system, and constrain that coordination in ways that allow systems to handle it efficiently.

You might be able to run all your business logic on a single box, but as your organization grows you’ll likely find the coordination necessary to do that slows you down more and more.

Finally, scale does matter.

The Scale Ceiling

As a business owner, there’s nothing quite like the joy and misery of a full store. Joy, because its an indication of a successful business. Misery, because a larger store would have been able to serve more customers. The queue out the door is turning people away, and with those people go their business. Opening a second location could take months, as could adding space. The opportunity is slipping away.

A smart business needs to be correctly scaled. A hundred thousand square feet is too much for a taco truck. All that space is expensive, and distracting. Fifty square feet is too few for a supermarket. Folks can barely get into the door. A pedestrian bridge and a train bridge are built differently. Scale matters, both up and down.

This isn’t a hard idea. It’s right at the soul of what engineering aims to achieve as a field. The smartest thing that new engineers can do is focus on the needs of their businesses. Both now and in the future. Learn what drives the costs and scalability needs of your business. Know how it makes money. Understand the future projections, and the risks that come with them. Ignore the memes and strong opinions.

Marc's Blog

About Me

Links

Not Just Scale

Similar Posts

Something Completely Different