Last week at PGConf NYC, I had the pleasure of hearing Andreas Freund talking about the great work he’s been doing to bring async IO to Postgres 18. One particular result caught my eye: a large difference in performance between forward and reverse scans, seemingly driven by read ahead1. The short version is that IO layers (like Linux’s) optimize performance by proactively pre-fetching data ahead of the current read point in a file, so it’s already cached when needed. Notably, most of these systems don’t do this backwards. This leads to a big difference in performance between forward scans (where the pages are already in the cache when they’re needed) and backward scans (where the database needs to block on IO to fetch the next page).
This lead me to thinking more about a particular hypothesis behind many database designs: a temporal-spatial locality hypothesis2. Before we get there, let’s talk about locality more generally, because it might be the single most important idea in database performance (and computer systems performance generally).
Almost all database systems take advantage of these forms of locality, and would lost significant performance without taking advantage of them.
Stacks of books could be written about these ideas. Stacks of books have been written about these ideas. We could talk about cache-oblivious algorithms, or non-polluting read and write instructions, or have an argument about linked lists. Instead, I want to zoom in to a particular idea in databases: temporal-spatial hypothesis.
The hypothesis I mean has a definition something like this:
Temporal-spatial locality is the idea that data that was written at approximately the same time will be read at approximately the same time, and therefore should be stored near each other4.
Is the Temporal-Spatial Hypothesis true?
In a streaming system, where keys are written in order by a producer and read in order by subscribers, the temporal-spatial hypothesis is trivially true. Most real-world streaming systems are highly optimized based on this assumption, and choose on-disk formats and APIs that take advantage of it. Time-series, metrics, and observability systems mostly behave the same way. Readers in these systems are normally interested in a window of time, using the same concept of time that writers had.
Hash-based databases (like DynamoDB) reject the temporal-spatial hypothesis. Or at least don’t optimize for it. When new rows are created, they are assigned a large random key (often by hashing the natural primary key), which are then spread over the storage space. This has huge write-time advantages, especially in mitigating write-time hot spots. They pay the cost at read time: an in-order scan of a table is no more efficient than a random scan.
Relational schemas that assign large surrogate keys (such as uuid
s) to items similarly reject the hypothesis3. Distributed and sharded databases get the same hot-spot avoiding benefits, which can be significant. Single-box databases may get better write performance from reduced lock or latch contention, or worse write performance because of reduced cache effectiveness. uuid
keys can significantly reduce read performance in both types of systems, again by defeating spatial locality and making effective cache sizes smaller. These schemas may still be a good choice for data modelling reasons. Many such schemas restore the temporal ordering, through a layer of indirection, by indexing on a timestamp column.
The flip side of that is a lot of relational schemas use time-ordered primary keys (SERIAL, AUTO_INCREMENT, etc) even when reads aren’t correlated in time. This leads to increased write-time contention with no particular read-time benefit. In turn, database systems spend significant effort weakening the guarantees around these time-ordered keys. These optimizations include making them not truly ordered (see CACHE in Postgres, for example), and by giving them their own special isolation rules (see the rollback and visibility behavior of nextval in Postgres, for example).
I don’t have a lot of data to back this up, but my belief is that the temporal-spatial hypothesis is weakly true for OLTP workloads generally, but perhaps only to the extent that recent keys are hotter keys (the temporal hypothesis), rather than a strong correlation between keys written at similar times. However, there are some workloads for which it is a key optimization, and these workloads need to either use data structures optimized for this fact, or very carefully design their schemas with knowledge of their database system’s underlying data structures.
Attempting to be a little crisper
Let me attempt to be a little crisper about the temporal-spatial hypothesis, and how it differs from other common ideas about locality. Let’s say we have keys $K = {K_0, \ldots, K_i, K_{i+1}, \ldots}$, and that for each key $K_i$ we have time since write $K^{W}_i$, time since the last access (read or write) $K^{A}_i$, and a probability $P_r(K_i)$ that it’ll be read again soon.
Then, lets say we have the array $\omega$, which is the set $K$ ordered by $K^W$ (e.g. $\omega_i$ is the key written immediately after $\omega_{i-1}$). Our spatial-temporal hypothesis is:
Notice how, if the natural order of keys matches the write order of keys, and keys are stored in natural order, the spatial and temporal-spatial hypotheses are the same.
Footnotes
O(log N)
worst-case lookups based on key equality.