A classic cloud architecture is built of small clusters of nodes (typically one to nine1), with coordination used inside each cluster to provide availability, durability and integrity in the face of node failures. Coordination between clusters is avoided, making it easier to scale the system while meeting tight availability and latency requirements. In reality, however, systems sometimes do need to coordinate between clusters, or clusters need to coordinate with a central controller. Some of these circumstances are operational, such as around adding or removing capacity. Others are triggered by the application, where the need to present a client API which appears consistent requires either the system itself, or a layer above it, to coordinate across otherwise-uncoordinated clusters.
The costs and risks of re-introducing coordination to handle API requests or provide strong client guarantees are well explored in the literature. Unfortunately, other aspects of sometimes-coordinated systems do not get as much attention, and many designs are not robust in cases where coordination is required for large-scale operations. Results like CAP and CALM2 provide clear tools for thinking through when coordination must occur, but offer little help in understanding the dynamic behavior of the system when it does occur.
One example of this problem is reacting to correlated failures. At scale, uncorrelated node failures happen all the time. Designing to handle them is straightforward, as the code and design is continuously validated in production. Large-scale correlated failures also happen, triggered by power and network failures, offered load, software bugs, operator mistakes, and all manner of unlikely events. If systems are designed to coordinate during failure handling, either as a mesh or by falling back to a controller, these correlated failures bring sudden bursts of coordination and traffic. These correlated failures are rare, so the way the system reacts to them is typically untested at the scale at which it is currently operating when they do happen. This increases time-to-recovery, and sometimes requires that drastic action is taken to recover the system. Overloaded controllers, suddenly called upon to operate at thousands of times their usual traffic, are a common cause of long time-to-recovery outages in large-scale cloud systems.
A related issue is the work that each individual cluster needs to perform during recovery or even scale-up. In practice, it is difficult to ensure that real-world systems have both the capacity required to run, and spare capacity for recovery. As soon as a system can’t do both kinds of work, it runs the risk of entering a mode where it is too overloaded to scale up. The causes of failure here are both technical (load measurement is difficult, especially in systems with rich APIs), economic (failure headroom is used very seldom, making it an attractive target to be optimized away), and social (people tend to be poor at planning for relatively rare events).
Another risk of sometimes-coordination is changing quality of results. It’s well known how difficult it is to program against APIs which offer inconsistent consistency, but this problem goes beyond just API behavior. A common design for distributed workload schedulers and placement systems is to avoid coordination on the scheduling path (which may be latency and performance critical), and instead distribute or discover stale information about the overall state of the system. In steady state, when staleness is approximately constant, the output of these systems is predictable. During failures, however, staleness may increase substantially, leading the system to making worse choices. This may increase churn and stress on capacity, further altering the workload characteristics and pushing the system outside its comfort zone.
The underlying cause of each of these issues is that the worst-case behavior of these systems may diverge significantly from their average-case behavior, and that many of these systems are bistable with a stable state in normal operation, and a stable state at “overloaded”. Within AWS, we are starting to settle on some patterns that help constrain the behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination, independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.
More work is needed in this space, both in understanding how to build systems which strongly avoid congestive collapse during all kinds of failures, and in building tools to characterize and test the behavior of real-world systems. Distributed systems and control theory are natural partners.