RFC9562 defines UUID Version 7. This has made a lot of people very angry and been widely regarded as a bad move1. More seriously, UUIDv7 has received a lot of criticism, despite seemingly achieving what it set out to do.
The legitimate criticism seems to be on a few points. V7 UUIDs:
E6EE7F40...
format doesn’t work, because of deterministic first digits.Before thinking about how we might fix these issues, let’s understand why folks are drawn to UUIDv7. Most of the use-cases I see are related to increasing database insert performance. To quote the RFC:
Time-ordered monotonic UUIDs benefit from greater database-index locality because the new values are near each other in the index. As a result, objects are more easily clustered together for better performance. The real-world differences in this approach of index locality versus random data inserts can be one order of magnitude or more.
This effect is very real. Random DB keys like UUIDv4 destroy spatial locality (as I’ve written about before), making database caches less effective, almost always reducing insert performance, and reducing query performance where queries have substantial temporal locality. The slight upside to this is that they also avoid hot spotting in distributed or sharded architectures.
Can we both have good insert performance and avoid the downsides of UUIDv7? Yes, I believe we can.
Let’s keep the overall format from the RFC:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| H(unix_ts_ms | unix_ts_ms >> N | id) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| .......... | ver | rand_a |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|var| rand_b |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| rand_b |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Instead of the unix_ts_ms
field being the unmodified ts_ms
, let’s replace it with: H(unix_ts_ms | unix_ts_ms >> N | id)
, where H
is a cryptographic hash function, N
is a parameter that trades off locality and leading entropy, and id
is an arbitrary identifier for some unit of infrastructure (e.g. a database cluster id, customer id, region ID). The choice of N
is a trade off between ID spread and database cache locality. For most installations and query patterns, I’d expect N >= 12
to restore full insert performance.
The use of an arbitrary id
along with a cryptographic hash function allows providers to choose the radius that they issue UUIDs over. An empty id
would produce a single global stream of UUIDs. A fine-grained per-client UUID would produce a per-client stream of UUIDs, at the cost of some locality. Per-server, per-cluster, per-AZ, and other scopes for id
s add flexibility to trade off between the locality advantages and disadvantages of UUIDs.
Depending on how you read the RFC, this may be allowed in the letter of the law. It does say:
Implementations MAY alter the actual timestamp.
But it does seem to clearly violate the spirit. Still, I think this is a UUID format that avoids a lot of the downsides of UUIDv7, while keeping most of the database performance benefits. As for whether you should use UUID entropy for security, that’s a different topic.
Footnotes