Threat modeling from the security field, and business impact analysis from the continuity management field, are powerful and influential ways of structured thinking about particular kinds of problems. The power of threat modeling comes from its structure. By imposing a structure on the thought process, we reduce the number of things that we miss, and make the information more analyzable and accessible. Two popular classic tools for structuring threat modeling are STRIDE and DREAD, both originally from Microsoft. While, on the surface, the mnemonics appear cheesy (what is this, the high school science fair?), in practice they are easy to remember, easy to use, and relatively difficult to misunderstand.
Can we apply the same kind of structured thinking to analyzing the trade offs in distributed systems we design?
CALISDO is my first attempt at a mnemonic for doing STRIDE-style modeling of distributed systems designs.
For Consistency, the focus is on client-visible guarantees, in the sense typically used by distributed systems (i.e. more closely related to A and I than C of ACID). Key questions:
For Availability, the focus is on how clients experience the ability to interact with the system. This obviously includes classic CAP- and PACELC-style trade offs, but practical concerns are likely to be as important in real systems. Issues such as redundancy, failover, load balancing, infrastructure diversity and hardware quality can all have a significant influence on availability.
Latency focuses on how much time clients have to wait for operations to complete. Within datacenters, and local areas like AWS regions, compute and storage performance typically dominate. For systems spread around large areas of the world, such as websites and CDNs, networking and locality concerns may dominate. Latency analysis should consider both the happy case when everything is working, and degradation behavior under conditions such as failed dependencies and network packet loss. Issues such as data buffering, caching and pre-computation are also important to latency.
Data Integrity is critical to the client experience of distributed systems. Analysis in this area should include both end-to-end properties (such as checksums, error-correcting codes and authenticated encryption modes) and local properties (such as the BER and UBER of storage devices and channels). Key questions should cover how often corruption is expected to happen, how it is detected, and how its existence is communicated to clients. Integrity analysis should also recognized that some types of data (typically indexes and metadata) can have high leverage, effectively turning small amounts of corruption into wide-scale issues. Extra attention should be payed to integrity of this data.
Scalability looks at how the system's behavior changes as load increases. Attention should be paid to two parts of the scaling curve: the rise in goodput in response to load offered up to saturation, and the drop in goodput with increased load beyond saturation. Scalability should consider scale-up (larger hardware), scale-out (spreading load across components) and load allocation (how load is distributed across components). In nearly all real systems, scalability is limited by either bottlenecks in the architecture, or by hot-spotting caused by uneven distribution of load.
Durability deals with data loss. Many distributed systems are stateless, or only handle soft-state that may be lost without major cost. Falling into one of these two categories should be seen as a goal. Systems should only store durable state if they have no other sensible choice. Durability requires attention to the failure rates of individual storage components (hard drives, SSDs, etc) and redundancy used to handle these failures (replication, RAID, etc). Attention should be paid to the blast radius of failures, and potential causes of correlated failure (such as sharing an enclosure). Recognition of the role of latent failures on durability is very important.
Operational Cost covers both the human cost of operations, and the hardware and services cost of operating the system. On human operations, particular attention should be given to single points of failure, where a human operator needs to take action to prevent or end an outage. Human costs also include deployment, updates and upgrades and root-cause investigations. Hardware costs and services costs dominate the overall cost of operating large systems, but human costs often dominate for smaller systems. System designers must be aware of decisions that trade one operational cost for another.
All of these categories are inter-related. The theoretical tradeoffs between availability and latency are well known, but in practice also involve scalability and durability. Similarly, latency and scalability are often in tension: architectural decisions taken to improve scalability can often add latency. Operational costs are typically in tension with all the other categories. There are also some areas where different categories pull together. For example, good design decisions for durability are often similar to good decisions for availability and integrity.
It's also worth noting that CALISDO doesn't include many critical security properties. Analysis of the security properties of a system is also needed, and may interact with decisions around integrity, scalability and operational cost.
CALISDO isn't an exhaustive list of design concerns for distributed systems, but it seems like a good start at not forgetting everything obvious. Taking each aspect of a system design - along with the end-to-end system - and breaking it down into these categories, makes it harder to miss obvious deficiencies, mistakes, and oversights.