A common question I get at work is “how do I learn to build big distributed systems?”. I’ve written replies to that many times. Here’s my latest attempt.
Learning how to design and build big distributed systems is hard. I don’t mean that the theory is harder than any other field in computer science. I also don’t mean that information is hard to come by. There’s a wealth of information online, many distributed systems papers are very accessible, and you can’t visit a computer science school without tripping over a distributed systems course. What I mean is that learning the practice of building and running big distributed systems requires big systems. Big systems are expensive, and expensive means that the stakes are high. In industry, millions of customers depend on the biggest systems. In research and academia, the risks of failure are different, but no less immediate. Still, despite the challenges, doing and making mistakes is the most effective way to learn.
Learn through the work of others
This is the most obvious answer, but still one worth paying attention to. If you’re academically minded, reading lists and lists of best papers can give you a place to start to find interesting and relevant reading material. If you need a gentler introduction, blogs like Adrian Colyer’s Morning Paper summarize and explain papers, and can also be a great way to discover important papers. There are a lot of distributed systems books I love, but I haven’t found an accessible introduction I particularly like yet.
If you prefer to start with practice, many of the biggest distributed systems shops on the planet publish papers, blogs, and talks describing their work. Even Amazon, which has a reputation for being a bit secretive with our technology, has published papers like the classic Dynamo paper, and a recent papers on the Aurora database, and many more. Talks can be a valuable resource too. Here’s Jaso Sorenson describing the design of DynamoDB, me and Holly Mesrobian describing a bit of how Lambda works, and Colm MacCarthaigh talking about some principles for building control planes. There’s enough material out there to keep you busy forever. The hard part is knowing when to stop.
Sometimes (as I’ve written about before) it can be hard to close the gap between theory papers and practice papers. I don’t have a good answer to that problem.
Get hands-on
Learning the theory is great, but I find that building systems is the best way to cement knowledge. Implement Paxos, or Raft, or Viewstamped Replication, or whatever you find interesting. Then test it. Fault injection is a great approach for that. Make notes of the mistakes you make (and you will make mistakes, for sure). Docker, EC2 and Fargate make it easier than ever to build test clusters, locally or in the cloud. I like Go as a language for building implementations of things. It’s well-suited to writing network services. It compiles fast, and makes executables that are easy to move around.
Go broad
Learning things outside the distributed systems silo is important, too. I learned control theory as an undergrad, and while I’ve forgotten most of the math I find the way of thinking very useful. Statistics is useful, too. ML. Human factors. Formal methods. Sociology. Whatever. I don’t think there’s shame in being narrow and deep, but being broader can make it much easier to find creative solutions to problems.
Become an owner
If you’re lucky enough to be able to, find yourself a position on a team, at a company, or in a lab that owns something big. I think the Amazon pattern of having the same team build and operate systems is ideal for learning. If you can, carry a pager. Be accountable to your team and your customers that the stuff you build works. Reality cannot be fooled.
Over the years at AWS we’ve developed some great mechanisms for being accountable. The wheel is one great example, and the COE process (similar to what the rest of the industry calls blameless postmortems) is another. Dan Luu’s list of postmortems has a lot of lessons from around the industry. I’ve always enjoyed these processes, because they expose the weaknesses of systems, and provide a path to fixing them. Sometimes it can feel unforgiving, but the blameless part works well. Some COEs contain as many great distributed systems lessons as the best research papers.
Research has different mechanisms. The goal (over a longer time horizon) is the same: good ideas and systems survive, and bad ideas and systems are fall away. People build on the good ones, with more good ideas and the whole field moves forward. Being an owner is important.
Another tool I like for learning is the what-if COE or premortem. These are COEs for outages that haven’t happened yet, but could happen. When building a new system, think about writing your first COE before it happens. What are the weaknesses in your system? How will it break? When replacing an older system with a new one, look at some of the older one’s COEs. How would your new system perform in the same circumstances?
It takes time
This all takes time, both in the sense that you need to allocate hours of the day to it, and in the sense that you’re not going to learn everything overnight. I’ve been doing this stuff for 15 years in one way or another, and still feel like I’m scratching the surface. Don’t feel bad about others knowing things you don’t. It’s an opportunity, not a threat.