In this morning’s re:Invent keynote, Matt Garman announced Aurora DSQL. We’re all excited, and some extremely excited, to have this preview release in customers’ hands. Over the next few days, I’m going to be writing a few posts about what DSQL is, how it works, and how to make the best use of it. This post is going to look at the product itself, and a little bit of a personal story.
The official AWS documentation for Aurora DSQL is a great place to start to understand what DSQL is and how to use it.
What is Aurora DSQL?
Aurora DSQL is a new serverless SQL database, optimized for transaction processing, and designed for the cloud. DSQL is designed to scale up and down to serve workloads of nearly any size, from your hobby project to your largest enterprise application. All the SQL stuff you expect is there: transactions, schemas, indexes, joins, and so on, all with strong consistency and isolation5.
DSQL offers active-active multi-writer capabilities in multiple availability zones (AZs) in a single region, or across multiple regions. Reads and writes, even in read-write transactions, are fast and local, requiring no cross-region communication (or cross-AZ communication in single region setups). Transaction commit goes across regions (for multi-region setups) or AZs (for single-regions setups), ensuring that your transactions are durable, isolated, and atomic.
DSQL is PostgreSQL compatible, offering a subset of PostgreSQL’s (huge) SQL feature set. You can connect with your favorite PostgreSQL client (even the psql
cli), use your favorite ORMs and frameworks, etc. We’ll be adding more PostgreSQL-compatible features over time, making it easy to bring your existing code to DSQL.
DSQL is serverless. Here, we mean that you create a cluster in the AWS console (or API or CLI), and that cluster will include an endpoint. You connect your PostgreSQL client to that endpoint. That’s all you have to do: management, scalability, patching, fault tolerance, durability, etc are all built right in. You never have to worry about infrastructure.
As we launch Aurora DSQL, we’re talking a lot about multi-region active-active, but that’s not the only thing its for. We built DSQL to be a great choice for single-region applications of all sizes - from a few requests per day to thousands a second and beyond.
A Personal Story
In 2020 I was working on serverless compute at AWS, spending most of my time with the great AWS Lambda team1. As always, I spent a lot of time talking to customers, and realized that I was hearing two consistent things from serverless and container customers:
Existing relational database offerings weren’t a great fit for the fast-moving scalable world of serverless and containers. These customers loved relational databases and SQL, for all the reasons folks have loved relational for forty years, but felt a lot of friction between the needs of serverless compute and existing relational products. Amazon RDS Proxy helped with some of this friction, but it wasn’t going away.
Large, highly-regulated, AWS customers with global businesses were building applications across multiple AWS regions, but running into a tricky architectural trade-off. They could pick multi-region active-active (with DynamoDB Global Tables, for example), but lose out on SQL, ACID, and strong cross-region consistency. Or they could choose active-standby (with Aurora Global Database, for example), but lose the peace of mind of having their application actively running in multiple places, and the ability to serve strongly consistent data to customers from their closest region. These customers wanted both things.
At the same time, a few pieces of technology were coming together. One was a set of new virtualization capabilities, including Caspian (which could dynamically and securely the resources allocated to a virtual machine), Firecracker3 (a lightweight VMM for fast-scaling applications), and the VM snapshotting technology we were using to build Lambda Snapstart. We used Caspian to build Aurora Serverless V22, bringing a vertical auto scaling to Aurora’s full feature set.
The second was EC2 time sync, which brings microsecond-accurate time to EC2 instances around the globe. High-quality physical time is hugely useful for all kinds of distributed system problems. Most interestingly, it unlocks ways to avoid coordination within distributed systems, offering better scalability and better performance. The new horizontal sharding capability for Aurora Postgres, Aurora Limitless Database, uses these clocks to make cross-shard transactions more efficient.
The third was Journal, the distributed transaction log we’d used to build critical parts of multiple AWS services (such as MemoryDB, the Valkey compatible durable in-memory database4). Having a reliable, proven, primitive that offers atomicity, durability, and replication between both availability zones and regions simplifies a lot of things about building a database system (after all, Atomicity and Durability are half of ACID).
The fourth was AWS’s strong formal methods and automated reasoning tool set. Formal methods allow us to explore the space of design and implementation choices quickly, and also helps us build reliable and dependable distributed system implementations6. Distributed databases, and especially fast distributed transactions, are a famously hard design problem, with tons of interesting trade-offs, lots of subtle traps, and a need for a strong correctness argument. Formal methods allowed us to move faster and think bigger about what we wanted to build.
Finally, AWS has been building big cloud systems for a long time (S3 is turning 19 next year!, can you believe it?), and we have a ton of experience. Along with that experience is an incredible pool of talented engineers, scientists, and leaders who know how to build and operate things. If there’s one thing that’s AWS’s real secret sauce, it’s that our engineers and leaders are close to the day-to-day operation of our services7, bringing a constant flow of real-world lessons of how to improve our existing services and build better new ones.
The combination of all of these things made it the right time to think big about building a new distributed relational database. We knew we wanted to solve some really hard problems on behalf of our customers, and we were starting to see how to solve them.
So, in 2021 I started spending a lot more time with the databases teams at AWS, including the incredibly talented teams behind Aurora and QLDB. We built a team to go do something audacious: build a new distributed database system, with SQL and ACID, global active-active, scalability both up and down (with independent scaling of compute, reads, writes, and storage), PostgreSQL compatibility, and a serverless operational model. I’m proud of the incredibly talented group of people that built this, and can’t wait to see how our customers use it.
One Big Thing
There are a lot of interesting benefits to the approach we’ve taken with DSQL, but there’s one I’m particularly excited about (the same one Matt highlighted in the keynote): the way that latency scales with the number of statements in a transaction. For cross-region active-active, latency is all about round-trip times. Even if you’re 20ms away from the quorum of regions, making a round trip (such as to a lock server) on every statement really hurts latency. In DSQL local in-region reads are as fast as 1.2ms, so 20ms on top of that would really hurt.
From the beginning, we took avoiding this as a key design goal for our transaction protocol, and have achieved our goals. In Aurora DSQL, you only incur additional cross-region latency on COMMIT
, not for each individual SELECT
, UPDATE
, or INSERT
in your transaction (from any of the endpoints in an active-active setup). That’s important, because even in the relatively simple world of OLTP, having 10s or even 100s of statements in a transaction is common. It’s only when you COMMIT
(and then only when you COMMIT
a read-write transaction) that you incur cross-region latency. Read-only transactions, and read-only autocommit SELECT
s are always in-region and fast (and strongly consistent and isolated).
In designing DSQL, we wanted to make sure that developers can take advantage of the full power of transactions, and the full power of SQL. Later this week I’ll share more about how that works under the covers. The goal was to simplify the work of developers and architects, and make it easier to build reliable, scalable, systems in the cloud.
A Few Other Things
In Aurora DSQL, we’ve chosen to offer strong consistency and snapshot isolation. Having observed teams at Amazon build systems for over twenty years, we’ve found that application programmers find dealing with eventual consistency difficult, and exposing eventual consistency by default leads to application bugs. Eventual consistency absolutely does have its place in distributed systems8, but strong consistency is a good default. We’ve designed DSQL for strongly consistent in-region (and in-AZ) reads, giving many applications strong consistency with few trade-offs.
We’ve also picked snapshot isolation by default. We believe that snapshot isolation9 is, in distributed databases, a sweet spot that offers both a high level of isolation and few performance surprises. Again, our goal here is to simplify the lives of operators and application programmers. Higher isolation levels push a lot of performance tuning complexity onto the application programmer, and lower levels tend to be hard to reason about. As we talk more about DSQL’s architecture, we’ll get into how we built for snapshot isolation from the ground up.
Picking a serverless operational model, and PostgreSQL compatibility, was also based on our goal of simplifying the work of operators and builders. Tons of folks know (and love) Postgres already, and we didn’t want them to have to learn something new. For many applications, moving to Aurora DSQL is as simple as changing a few connection-time lines. Other applications may need larger changes, but we’ll be working to reduce and simplify that work over time.
Footnotes
REPEATABLE READ
level on these tests.