An Advent CAPlendar.
Published 01 December 2025 at Yours, Kewbish. 17,773 words. Subscribe via RSS.
Introduction
For a couple years in high school, I dedicated my Decembers to Wastl’s Advent of Code, a daily series of Christmas-themed programming puzzles. The regular cadence of problems kept the completionist in me hooked, and the problems were satisfyingly challenging beyond my ability. I lacked the background to make it through most days after the first week or two on my own, so I’d sheepishly poke around the subreddit the next morning for hints. Through AOC, I learned a lot of Python builtins like counters, zips, and list comprehensions, and it was always a fun activity to look forward to during December.
Last year, I found out about a similar project called Advent of Distributed Systems. AODS breaks up the Fly.io Distributed Systems challenges into weekly miniprojects, aiming to get you through to the final Raft KV store by the end of the month.
This fall, I realized I had a growing paper backlog and a bucket list of startups and technologies I’d like to get around to reading someday, but never made the time for. Inspired by the subject matter of AODS and the cadence of AOC, I’m putting together my own holiday countdown.
This post is my 2025 Advent CAPlendar, exploring topics in distributed systems (hence the title pun), databases, formal verification, and security. I’m planning to read some foundational papers like DynamoDB and Zookeeper as well as some newer works out of MIT PDOS and Berkeley SkyLab. There are also some neat systems startups and protocols that I keep hearing about, so I’m also planning to glean what I can about how they work from their documentation. You can see a calendar of topics below1.
2025 Advent CAPlendar
I’m also trying something new and starting a popup newsletter to accompany this series. Signing up will get you a short weekend roundup list of the posts that week. To assuage the privacy-conscious: I’ll only use this for Advent CAPlendar-related posts, I’ll be running this by hand (no external data processors, etc.), and I promise I’ll be too busy studying for finals to think about spamming you otherwise!
1. The CAP and CAL Theorems
I figured it’d be most fitting to start this calendar by referencing the titular pun itself. The CAP theorem is idea that a distributed system can only satisfy two of the following three guarantees:
- Consistency: reads return the most recently written value. This behaviour is also known as the linearizability consistency level.
- Availability: every request by a non-failing node receives a response. Note that this doesn’t mean every request receives a response, which I think is the typical definition of uptime via high availability; it’s predicated here on being received by a non-failing node.
- Partition tolerance: the system keeps behaving as-is even as arbitrary nodes fail.
The idea is that if you have a network partition, you can decide to either eschew consistency and just return responses, even if they’re not the latest values, or stick to your guns and not respond to requests, sacrificing availability.
The CAP theorem got picked on a bunch. The most obvious argument is that the pithy “pick 2 out of 3” framing conveniently sidesteps that you can’t ignore partition tolerance, or that you might end up with less than two properties. As well, if your system doesn’t fit the definitions of consistency (e.g. isn’t linearizable), the CAP theorem shrugs its shoulders and gives up — it won’t really tell you anything. Martin Kleppmann’s paper also highlights some vagueness in definitions in the original CAP proof. This is summarized in Dominik Tornow’s blog post, which illustrates how the conjecture and proof are a bit like passing ships in the night given their definitions are a bit off.
The CAP theorem could be a good rule of thumb, but I think its variants that focus on latency instead are far more interesting. PACELC is an extension of CAP, stating that under Partitioning, you have to choose between Availability and Consistency, Else you’ll have to make the tradeoff between Latency and Consistency. This makes sense, since in the absence of partitions stronger consistency models will still impose more acknowledgements and latency.
Kleppmann’s paper goes further, modelling availability in terms of operation latency to match more industry-standard intuitions of uptime SLAs. “If a service can sustain low operation latency, even as network delay increases dramatically, it is more tolerant of network problems than a service whose latency increases.” sounds easier to compare to real systems. He also defines some operation latency lower bounds in term of network latency at different consistency levels which clearly match intuition.
Almost a decade later, Edward Lee takes this even further, defining the CAL theorem. This trades off against Consistency, Availability, and network Latency, but at a more gradual level than the all-or-nothing assumptions of the CAP theorem. The CAL theorem subsumes the CAP theorem, which is defined as a special case. It’s a pity this isn’t more widely known, otherwise I’d have a cleaner title for this Advent series.
I think the neatest part of Lee’s paper is the Lingua Franca framework he introduces. It’s a coordination language for distributed, realtime software for cyber-physical systems like autonomous vehicles. The key feature I picked up on was the fault handlers that trigger when you hit some apparent latency bounds, in which you can specify callbacks and explicitly define whether you’d like to relax your consistency or availability properties. If you choose to sacrifice availability, it provides centralized coordination primitives; if not it provides decentralized coordination with error bounds on clock synchronization to relax consistency. Lee has a video tutorial for how to do so here. It looks like a nice abstraction layer to keep the CAL theorem top of mind while developing a distributed system.
I think the most valuable part of the CAP/CAL theorems for me is the quick gut check mental model. The CAP theorem appeals to instinctive sense. I also had the opportunity to attend Lee’s Distinguished Lecture at UBC, and it was just as clearly laid out and intuitive. Writing this up has highlighted some of the caveats of each model, though, which I’ll need to remember to keep in mind.
2. I’m FLPping Out
The CAP theorem is one of the classic impossibility results in distributed systems, and back at DARE last year I learned about another: the FLP. Unlike the CAP theorem, FLP’s initialism doesn’t serve as a neat mnemonic: it stands for Fischer, Lynch, and Paterson, each very storied researchers that I’d recommend checking out if you have some time.
The FLP states that every asynchronous consensus algorithm may not terminate if there’s at least one silently faulty process. Asynchronous in this case means that we can’t bound the amount of time it takes to pass messages, and the paper also assumes no access to synchronized clocks (this would make things easier with timeouts.) Consensus means that all non-faulty nodes decide on the same value, like 1 in the set {0, 1}, for instance. If you rotate the FLP on its head a bit, it shows that it’s impossible to determine if the system will never terminate and has failed or if latency’s just high.
They model consensus via one big message buffer system, shared by all processes and requiring explicit send/receive notifications. They define a configuration as the set of processes running, their internal states, and this message buffer. A reachable config is a config with some series of messages being applied on top. They outlaw protocols that always decide either 0 or 1, and define a concept of ‘partial correctness’ that means all configurations have exactly one decision value, which could be null, and some reachable configuration will eventually decide 0 or 1. I’ll hand-wave a bit below so look into the paper if you feel unconvinced.
The proof proceeds as follows:
- Lemma 1: Given two configurations that have disjoint sets of processes, you can replay the events of one onto the other and vice versa and you’ll end up with the same final configuration because the sets of processes doing work are entirely distinct.
- Lemma 2: Any such protocol has an initial configuration that can eventually lead to a configuration that’s decided 0 and a separate configuration that’s decided 1 — this is called an initial, bivalent config. Bivalent means we can’t decide on either 0 or 1, which is bad news.
- Lemma 3: From a bivalent config C, applying an event to any configs reachable from C will still result in at least one bivalent config being reachable. You can see where this is going now.
- Lemma 4: If we have to start with a bivalent config, and can only get to states where we still have some bivalent configs, we’ll never get into a config that only has univalent configs that have decided on other values. This means there can’t be such a protocol that always terminates.
The paper also includes an example protocol that does achieve asynchronous terminating consensus if we assume a fixed number of live processes and that processes can’t die and become faulty during consensus. This consensus protocol creates a clique of known live nodes, where each node uses information received from only the other clique members to deterministically decide. This does terminate but is significantly more constrained.
I like that in the conclusion they mention “These results do not show that such problems cannot be “solved” in practice; rather, they point up the need for more refined models of distributed computing […]”. It’s quite a poetic takeaway. Something similar arose with the extensions to the CAP theorem, and indeed I can find plenty of recent work still iterating on modelling the problem and defining the impossibility boundary: here, here, and here, just to name a few.
The FLP paper itself is very short and approachable, so I’d recommend giving it a direct read! The definitions aren’t too onerous to wade through and the contradiction structure follows quite intuitively. The paper is quite theoretical and technical, but I need to get better at understanding and eventually producing work like this to do more convincing research.
3. Consistency Canvases
In my first week of my internship at the Sorbonne, I was talking about consistency models with my mentor, who promptly whipped out an annotated, thick paper with crazy-looking diagrams illustrating the different consistency anomalies given certain models. He offhandedly said he could send it to me so I could give it a read later, but just the brief glimpse of the diagrams scared me off enough to never press the topic while I was in Paris. However, I was thinking about consistency again while brainstorming ideas for this calendar, and this certainly feels like a paper I’d put off, so I asked him if he still remembered which paper it was. I was uncertain how to feel when the PDF landed in my inbox, but I wanted to keep an open mind.
The terrifying behemoth in question is Cerone et al’s A Framework for Transactional Consistency Models with Atomic Visibility, which I was happy to learn was only 14 pages (not a behemoth at all!) It formally defines consistency models (read atomic, snapshot isolation, etc.) for transactional systems. Prior work on formalizing models was one-off and scattered, “often tied to database internals”. Cerone et al. present a more unified, general approach that can be used for higher-level reasoning about applications using said transactional systems.
They define six axioms:
- INT (internal consistency): Within a transaction, sequential semantics are respected, so reads observe the most recently written value, avoiding unrepeatable reads.
- EXT (external consistency): A transaction observes the maximum value determined by the visibility and arbitration relations, avoiding dirty reads.
- TRANSVIS (transitive visibility): If a transaction T1 is visible to another, T2, and T2 is itself visible to another, T3, T1 should also be visible to T3.
- NOCONFLICT (no concurrent updates): Two concurrent transactions may not both modify the same object.
- PREFIX (prefix consistency): All transactions become visible in the same order defined by arbitration order; if T1 is observed by T2, T2 also observes all of T1’s arbitration relation predecessors.
- TOTALVIS (total visibility/serializability): The visibility relation is totally ordered.
The six consistency models they define all satisfy some of these axioms. For example, read atomic, the weakest level, only satisfies INT (internal consistency) and EXT (external consistency); snapshot isolation, the strongest non-serializable model, satisfies PREFIX (arbitration order prefix) and NOCONFLICT (no concurrent modifications) as well. In order, the six models they define are read atomic, causal consistency, parallel snapshot isolation, prefix consistency, snapshot isolation, and serializability.
The diagrams I saw were the illustrations of anomalies that can occur if certain axioms don’t hold. These are neatly formatted on page 6, which alongside page 10 lay out the key ideas of the paper. For instance, if TOTALVIS (total order/serializability) is not satisfied, then write skew can occur. This means that modifying two different objects (e.g. withdrawing from two separate accounts) while expecting a shared condition to hold (e.g. total balance over some threshold) can violate the condition unless there’s some total order. These diagrams were much clearer than I’d thought — they use concrete examples like bank accounts or posting comments to make the anomalies more concrete.
Interestingly, their work also outlines some more implementation-driven algorithms that satisfy their framework’s models, “closer to the intuition of practitioners”. They prove how these operational specifications align with their axioms and properties. On a meta-level, they’re structured a bit like the ones in an ICDT submission we made for specifying database backends — straightforward and ready to be directly implemented. They define a set of constraints separate from the model axioms earlier in the paper, and present a similar mapping of consistency model to constraints needed. Each of the original axioms follows from some combination of these constraints: for example, the transitive visibility axiom follows from causal delivery because replicas will only receive a transaction’s logs after receiving its predecessors.
As I’d mentioned yesterday, I want to improve my appetite for theory-heavy work, so reading this paper was a good chance to exercise that skill again. It certainly wasn’t as intimidating as I thought it’d be. The consistency diagrams look complex at first glance but are fairly straightforward even with the formal lens. I’m itching to get back to some more applied research, though, which will be the focus of the next few days.
4. CRDTs Don’t Grow On Trees
On the topic of consistency, I wanted to read about some CRDTs: causal trees and replicated growable arrays. Both are very similar ideas for ensuring that concurrent edits to a doc (e.g. two friends working on a paper while both offline) are resolved in an orderly, meaningful way. Their basic variants operate on a per-character basis, but more advanced variants support editing ranges of text.
The main idea with causal trees is that each user is assigned an entity ID, and an idea similar to Lamport timestamps is used to track time. The start of the text is the root of the tree, and when a user adds a character, an edge is created from the character’s causal predecessor. If another user concurrently edits, the tree forks at the common causal predecessor and continues branching. A user’s timestamp, which is attached to each character added, increments on each character added and is the maximum timestamp of all merging branches. Intuitively, this means timestamps represent how much content a user has seen before they insert something else. I’d highly recommend checking out the visualizations on Farley’s post for an interactive look.
To merge branches, the head events are first sorted by decreasing timestamp (since this means the user saw more recent content as they were making their edit), then by decreasing entity ID (to deterministically create a total order). To recover the final text, we then take a depth-first pre-order traversal of the tree.
Deleting text uses a common CRDT strategy called tombstoning, where the inserted character is not actually removed from the tree, but just marked as a tombstone and skipped on final rendering. This is because if you hard-delete the node in the tree, no matter what design decision you take, you’ll end up with tradeoffs. Surely you can’t just delete the children of the node along with it, so this means you’ll have to reattach them somewhere. Reattaching to the root of the tree is easy and deterministic but doesn’t make semantic sense (characters would jump to the start of the text). Reattaching to the deleted node’s parent sounds like a good idea, but if there are concurrent operations, things might end up out of order across clients if the immediate ancestor hasn’t been received yet. Sort order would also change among the ancestor’s siblings. All this means it’s just easier to preserve the structure of the tree and skip the node later.
The original causal trees paper goes into more detail suggesting a data format based on Unicode to represent the timestamp numbers with special characters like ॐ representing the root of the tree. It gets pretty crazy: there’s a diagram for how each data format converts to others and what methods combine them. Said methods are implemented not in a typical programming language, but via PCRE regexes. The author makes a fair point for this: they assert that higher-level languages like JS would have too much overhead creating individual objects for each character due to GC, hence the need for a text-only format and portable regex manipulation. I enjoy a cheeky regex puzzle from time to time, so it was really neat seeing regexes be pushed here, but this was a bit of an arcane choice. There are also a lot of fiber-arts-related names for terms in the paper, which makes it a bit challenging to initially parse (in my head “weave” and “weft” occupy too similar of a semantic space).
There’s another very similar CRDT called a replicated growable array. Instead of a tree, though, it’s a flat list that’s easy to read back. Each edit is still tagged with a timestamp and entity ID. For concurrent edits, the insertion pointers are offset for each concurrent client. In practice, it’s faster due to simpler traversal and implementation, so RGAs end up being used in production libraries like Automerge and Yjs.
Thinking through why tombstones were necessary over raw deletions was a fun challenge, and I also liked coming to terms with the odd vocabulary of the causal types paper. Reading it first made understanding the RGA paper much easier, so I’d recommend this order. Neither paper has overwhelming amounts of formalism either, which I appreciated. I got surprisingly interested in these couple papers now that I’ve gone through the basic theory of CRDTs a few times, so I’m looking forward to continuing this trend a bit in tomorrow’s writeup.
5. A Fugue State
Yesterday, we discussed causal trees and replicated growable arrays, two text editing CRDTs. Both CRDTs have a problem, though: while the concurrent edits neatly resolve for forward edits (e.g. typing as you normally would), what happens for backwards edits (e.g. typing a character, then moving your cursor to the start of the text, typing another character, and so on)? This is an issue called interleaving: both causal trees and RGAs are fine with forward interleaving but are susceptible to backward interleaving. If we think of typing “bread” backwards with a causal tree, we’d end up with a shallow tree, with leaves for each character. If another, offline user concurrently types “cakes” backwards in the same document, we’ll end up with the same shallow tree situation, but the text will be interleaved since the timestamps are the same. You’ll end up with “bcraekaeds” — not quite ideal. Matthew Weidner and Martin Kleppmann’s Fugue paper tackles this with a new CRDT that limits both forward and backward interleaving.
Causal trees and RGAs prevent forward interleaving because of the tree/list structure preserving “inserted after” information, but there’s no structural relationship that preserves “inserted before”, leading to the backward interleaving. Fugue defines the concepts of “left” and “right” origins respectively: left origins track “inserted after” information and right origins “inserted before”. This comes together in the definition of maximal non-interleaving:
- if B is the first thing inserted after A (forward insertion), A and B should be consecutive in the final list.
- if A is the last thing inserted before B (hence B is A’s right origin, backward insertion), A and B should be consecutive in the final list, except where we might need to prioritize another concurrent forward-typing insertion.
Fugue is also represented in a non-binary tree. However, its tree has the unique concepts of “left/right children”: each child of a node must be designated left or right, but there can be multiple left or right children, and left/right children are sorted amongst themselves by ID. On insertion, if the left origin’s node has no right child, then the current insert is made a right child. This handles the forward-typing case. If the left origin’s node already has a right child, the current insert is made a left child of the right origin, handling the backward-typing case. To recover the final text, perform an in-order traversal of the tree.
This tree will more or less stay binary, except in the case of concurrent inserts. In that case, there can be multiple left/right children, but the important thing is that the series of edits based on those left/right children stay together in a branch. Let’s take the “bread” and “cakes” edit example I considered for the causal trees/RGA case. In Fugue, the updates for typing either word backwards would become a branch of nodes, each a left child. Now, merging these two users’ updates would cause there to be two left children branches on the root node, so we’ll get “breadcakes” (or “cakesbread”, depending on the users’ entity IDs) as intended.
Non-interleaving thus holds intuitively, since the concurrent inserts get separated into separate branches. In-order traversal thus makes sure runs of text are consecutive instead of interleaved. Some interleaving is still inevitable, but this happens only in limited edge cases with multiple interacting concurrent updates. In practice, this should rarely be of concern.
The maximal non-interleaving property is so named because it defines a unique order over the operations. This is important, since this means no other non-trivial properties can be proven on top — this interleaving as as strong a statement as you’ll ever make. Kleppmann et al. had previously proposed another interleaving property, which will never hold on any CRDT, and their prior CRDT proposal was proven to not converge in some cases.
As described earlier, Fugue will prevent backward interleaving, but still won’t satisfy maximal non-interleaving in some of these edge cases. The paper thus defines FugueMax, a slightly more complicated variant that sorts multiple right children in reverse order of their right origins instead of by lexicographic node ID. FugueMax is proven to support maximal non-interleaving, but the paper argues the extra complexity isn’t worth it.
In terms of performance, Fugue is fairly comparable to the SOTA Yjs in terms of network bandwidth, load times, throughput, etc. FugueMax fares a little worse but is still reasonable. One thing that struck me was that even Yjs has a lot of metadata overhead, on the order of tens of bytes per character insertion. This put into perspective just how performant modern systems are, if we can afford that overhead and remain perfectly usable.
I really liked the continuation between first looking at causal trees and RGAs, then realizing the edge cases and interleaving problems, then reading Fugue — it was like tracing through each logical step in the frontier of CRDT capability. The way each paper builds intuitively from the last makes the problem space seem very approachable. I think I read somewhere once that (paraphrased) “good papers make you feel like you’re making discoveries” and that’s certainly been true for this series.
6. Braid and Abet
I mentioned how the causal trees paper discussed two days ago made an almost-offputting number of weaving references. Today, we’re picking up the trend again with Braid!
Braid is a project working towards interopable state synchronization, with the goal of making “read[ing] & writ[ing] distributed state as easily as a local variable”. They’re run as an Invisible College community and are actively developing the ecosystem, with an IETF draft, a Braid-based filesystem, Automerge support and example Chrome extension, just to name a few initiatives.
There’s four major aspects of Braid-HTTP:
- Subscriptions attached to GET requests, for real-time synchronization
- Versioning via new
VersionandParentheaders, which extendETagheaders as a better indicator of time (as opposed to a content-based hash) - Custom concurrent edit handling via
Merge-Typeheaders - More efficient patch updates, ability to patch a range of data
The spec draft lists these advances in a different order, but I’ve listed them in order of what’s most interesting to me. The demo, with very tangible updates, makes the first couple feel very magical, especially knowing all this is done over an HTTP extension rather than custom code.
The Merge-Type is an interesting space for customizability. They have one implementation available with Diamond Types, the fastest CRDTs available, written by one of the Braid spec draft coauthors Joseph Gentle. Besides the performance boosts, from my understanding this is a pretty standard CRDT in theory. They have another implementation called a ‘simpleton’, which is a bit like an OT in that there’s some rebasing of peer edits. The idea is that CRDT-ifying everything is a waste if most data is read instead of written, and most writes are sequential instead of concurrent. Simpleton merging means that all conflicts resolve to the client’s version, and the server does the work of rebasing edits and bringing clients up to speed. Clients ignore all received versions without its current version in the history and send edits as patches to the server. This gets you consistency and simplicity but comes at the cost of clients being half-duplex, unable to read other peers’ updates within a RTT of writing due to the server needing to be in the loop to rebase.
They’ve recently been talking about an unification of OT and CRDTs called the time machine. It’s a “CRDT that implements OT, and is implemented via OT implemented via an internal CRDT”. At first I thought this was vaguely reminiscent of the Eg-walker paper, but then I put two and two together and realized Gentle coauthored both the paper and this spec. In this time machine framework, OTs provide the mechanism for allowing events to “time travel” and be applied out of order to different CRDT versions, and CRDTs let us reason about the different views of clients given distributed time. History is stored in a neutral format, and different peers can use different merge strategies while interoperating since the OT-on-CRDT lets us transform remote histories for local usage.
I’d recommend first reading the Braid landing page and watching a demo video to contextualize what Braid can do, before reading this review of Braid by the causal trees author and the Braid community’s response to go into more detail. The time machine meeting summary can stand alone in case you’re less interested with the rest of the protocol. It’s neat to see that I can recognize names in the space (e.g. Gentle) and have enough of a knowledge base that I can connect across familiar work now. Overall, I’m looking forward to seeing where Braid goes, particularly the time machine CRDT piece!
7. The ElectricSQL Slide
I found out about ElectricSQL when I was in Paris, after looking up a sticker I saw in the lab. Surprise: my supervisor is also a technical advisor there. My first exposure to the company was its team page, where I realized they’d accumulated a good number of the CRDT Mafia2 as advisors and had received angel investments from a couple Ink & Switch-adjacent folks I knew. I figured if they’d managed to convince my supervisor to advise them, they must be working on something interesting.
Electric is a sync engine from Postgres DBs to client apps. It enables live reactivity, like Braid-HTTP, and boasts a wide range of framework integrations, from old favourites like React to CRDT interfaces with Yjs. It syncs subsets of data, which are called “shapes” in their framework, a bit reminiscent of the goals of Gritzko’s RDX. Shapes are defined with a table, with the option to specify a WHERE clause or specific columns — it’s like a database view. The WHERE clauses can tank performance (the docs say perf is inversely proportional to the number of Shapes) unless they include field = constant clauses, which have a special index.
Electric sits in front of Postgres, consuming the logical replication log and massaging it into a so-called Shape Log. The Shape Log is conceptually similar to Postgres’s log, but just filter for the relevant data. Clients will first request the entirety of this Shape Log, which might need to be split up over multiple requests. Electric remaps initial query results to insert operations so the shapes don’t have to be predefined. Once the client has processed the Shape Log and gets an ‘up-to-date’ header response, it switches into Live Mode, repeatedly issuing requests and holding each request open until new data is received or it times out. Clients are responsible for then taking the Shape Log and turning them into materialized data for the app, which can be done via Typescript/Elixir clients or a hand-rolled implementation.
Auth is left to the user to worry about, a nice concession that allows for flexibility. There are two main patterns suggested: proxy-based auth and gatekeeper auth. Proxy-based auth does as advertised: you can add an authorization header to the client’s request, pass it to the proxy, validate it, and pass Electric’s response back if authorized. The gatekeeper auth pattern is a bit more interesting: you’ll need to build a gatekeeper endpoint in your API, which issues Shape-scoped auth tokens. Then there needs to be another authorizing proxy to check the claims of the token against the request parameters, which is less complicated because we’re only checking for an exact match. There are additional implementation details and examples here.
I didn’t poke too hard into the technical details of implementation, but I did scroll down their blog post list and found this neat gem about Rich CRDTs. These are CRDTs that additionally support composition, compensations (side effects), and reservations (escrows fine-grained locking). I couldn’t find additional papers or implementation examples anywhere, but I thought the idea was compelling at least.
This was a good opportunity to dive into one of the live sync tools available — Electric itself lists many alternatives which seem to be clustered around a similar space. I’m happy to see the CRDT/sync/local-first-adjacent community is flourishing with so many up-and-coming products. I hate to bring AI into the mix, but with increasing LLM/“magic” integrations into more and more apps, I expect we’ll want to see more syncing of context (and the Electric blog agrees). The research legwork the community’s already invested in will be such a strong foundation for this.
8. LiveStore, Laugh, Love
One of ElectricSQL’s integrations is with a platform called LiveStore, and one of the technical advising team members at Electric happens to be its founder, Johannes Schickling. Electric also funds (funded?) development of LiveStore and there seems to be a close partnership between the two teams. Before reading up about either, I’d heard of both in passing, but at a distance, both seemed vaguely similar: sync-y reactive database-shaped tools. I was curious to see how they differed.
LiveStore is similarly a sync and state management framework, but is geared more towards local-first apps. It also provides live reactivity on the client-side, but is based on SQLite and an event-sourcing model, separating reads and writes. Read events go through the SQLite layer. Separately, each write event that’s created on a client is committed and pushed to the sync backend, which then accepts the push. It then actively pokes the other clients to pull the new event. LiveStore defines a flexible event model API and automatically manages the persistence and recomputating of state.
The reactivity system is based on three types of state:
- Reactive SQL queries, which additionally support callbacks and explicit dependencies to manage relationships with other queries and state
- Reactive signals, state values that aren’t in tables but should be persisted
- Reactive computed values, which can perform additional processing on signals
LiveStore was based on an earlier research project, Riffle. The project was based at MIT CSAIL, and Geoffrey Litt at Ink & Switch was part of the project. One of the interesting key ideas here was the separation of concerns between state management and syncing (loosely, the separation between LiveStore and the sync provider, like ElectricSQL). As their Riffle blog post reads, “[i]f an app developer could rely on a sufficiently powerful local state management layer, then their UI code could just read and write local data, without worrying about synchronizing data […] or other chores […]”. The local-first aspect allows for local queries and reactive state to update in the time it takes to render a single frame. It’s also neat to see the progression of the work, from the demo music library app in Riffle blog posts to Overtone to the underlying tech being split out into LiveStore. There aren’t many technical details on how the reactive query performance is optimized in either the blog posts or the UIST paper as I can find, which makes sense for the given audience, but I would have liked to dive deeper into that piece.
LiveStore is still under pre-1.0 active development, so I’m excited to see where the project continues growing. One thing that stood out about even LiveStore’s early, incomplete docs was the explicit Design Decisions page. It’s short but collects all the main considerations in one place, so I could get an overview of the project’s motivations without trying to read between the lines elsewhere. Making decisions and learnings explicit makes them easier to take to heart, and it’s also a big part of what makes books like AOSA compelling. This is a strategy I’ll try to adopt for some of my research projects moving forward — I keep a research log, but the decisions get blurred together with all the other WIP notes. Explicitly calling them out somewhere will help context-share as well as make paper writing clearer and more straightforward.
9. Making a Convex-ion
I’m rounding out my miniseries of live, reactive sync systems with ConvexSQL, a framework I keep hearing about on Twitter.
Convex is an ACID compliant, reactive datastore that lets apps atomically live-update as queries change. It doesn’t use SQL and doesn’t require ORMs — like Electric it leans more heavily on the Typescript type system. Compared to the other systems, Convex is more of a platform, running queries as if they were serverless functions within their architecture. It’s more opinionated and abstracts away any database optimization that developers might have to perform.
A Convex deployment includes the database itself alongside a sync worker and a function runner. Queries and updates must be expressed as functions, which are bundled and executed by the V8 runtime in the function runner. The database stores tables in a transaction log format that tracks all versions of documents, a bit like a write-ahead log with CRDTs. Operations are wrapped in transactions, which Convex ensures are run serializably using optimistic concurrency control and tracking read and write sets. The transactions implementation was quite cool to read about, since I recognized many of the ideas from my formally-specified database project during my research internship in Paris.
To sync, the database can use a similar algorithm as what they use to ensure serializability to detect updates in queries. After rerunning dependent queries, results are pushed to the relevant sync worker and transferred via Websocket to the client, which updates its local state.
ElectricSQL supports transactional causal consistency+, which means that there are atomic (and high-availability) transactions that respect causal consistency and that this is implemented with CRDTs. LiveStore doesn’t have transactions and is eventually consistent with a total order on events in its log. In contrast, Convex provides serializability: “We believe that any isolation level less than serializable is just too hard a programming model for developers.” This is both rather funny and rather correct, and I appreciate that it eliminates this whole class of issues in favour of mechanisms like transaction abort-retry that are more intuitive to implement.
I noted that Convex is leaning quite heavily into highlighting how optimized their platform is for AI — they have a whole landing page detailing the best prompts and rules as well as quickstart examples for building LLM integrations. Because they lean more into Typescript schemas, they claim LLMs are better equipped to generate Convex code (which I believe). Their quickstart page contains the typical React and other frameworks, but also has a plain textbox to directly generate an app. They’ve built a whole platform-on-a-platform, Chef, to oneshot entire realtime apps, which is very compelling to demo. Also, one application they show that meshes nicely with reactive frameworks is LLM streaming output — it makes sense that we might want to store the output but also stream a call’s progress in real time. Convex was started around the same time as the other projects (2020, in comparison to ElectricSQL and LiveStore, which both seem to have been started in 2021) but seems to have enjoyed more developer popularity, and I wonder if this recent LLM push had anything to do with adoption.
This writeup was mostly based on the very thorough How Convex Works blog post — I appreciate the transparency and clear walkthrough. The life-of-a-request diagrams were particularly helpful, and I’d have liked to see something similar with the other platforms to get a better sense of the flow in one place.
Reading about three adjacent realtime frameworks in a row let me compare-and-contrast them more easily: learning about each system contextualized it among the others, and I got to “map out” the space. I think three is enough sync frameworks for now, but there are many others I’d still like to read about sometime, like Turso, Jazz, Zero, and InstantDB. Check those out if you’re interested in exploring further, but for now, I plan to turn back to some notable database systems papers in my next few writeups.
10. MapReduce Your Expectations
This marks the start of a series of more mainstream database papers. I have this sense I should have read these by now, if only to build more context for what folks keep referencing in design discussions or online. The first system I chose was MapReduce, which seemed most approachable.
Functional programmers will find MapReduce intuitive to work with. It does exactly what it says on the tin: map and reduce values. The input is a large series of key-value pairs, which are then transformed into intermediate key-value pairs with the user-defined map function, and collected by intermediate key into a series of final values with the user-defined reduce function.
The framework aims to parallelize computation, so it makes use of a special master worker orchestrating many other workers. Input data is partitioned into splits and distributed among map worker nodes, which write the results locally. Intermediate pair tasks are partitioned (via a hash, for example) and then split up amongst the reduce workers. These reduce workers use RPCs to read the buffered data from map workers’ disks, sorts it, then performs the reduce operation. The final result is sent back to the master worker, which is also responsible for waking up the user program to pass the final results.
MapReduce also takes care of fault-tolerance concerns for the user. If a task times out, the master worker will mark it as failed and rerun it. Completed reduce tasks are safe since its results are stored in global storage, but if a machine that ran a map task failed, the master worker will also rerun its map task since the local disk becomes inaccessible. Interestingly, MapReduce supports non-deterministic tasks, simply providing weaker semantics and guarantees. One neat point is that stragglers, or slow/delayed tasks, can impact performance for the entire job, so once a job is close to completion, the master worker also schedules backup executions of the remaining in-progress tasks to race against existing executions. This is an interesting technique especially given nondeterministic functions are allowed, but they reported this noticeably improved performance.
The magic of MapReduce is that mapping and reducing is all users have to worry about. They don’t have to figure out how to optimally place data to reduce latency (the master worker will take locality into consideration when assigning tasks) or think about fault tolerance and durability. Because MapReduce takes care of these reliability and performance concerns in a common abstraction layer, teams don’t have to keep reinventing the wheel, which increases maintainability and productivity. I assume the MapReduce team played the role of a dev infra startup within the company — it’s cool that this type of framework and subteam forms in industry. Working on infra teams like this in the past was really fun and I enjoyed the impact and considerations we’d have to make.
The two meta points that drew my interest most were how the paper demonstrates how widely this simple pattern has been applied, and how they decided to discuss some of the actual library evolution and document real usage examples along with concrete results. For example, in addition to their more contrived experiments, they’d explain what size partitions they’d tend to choose, why, then explain the CPU utilization they observed with the configuration after real usage. I rarely see this with more academic papers: being able to share these patterns is a real strength of industry papers.
This paper was actually assigned as one of my class readings this term — I was already planning to read it, so I was happy to kill two birds with one stone. I’ll note that I based this writeup on the shorter Communications of the ACM version as opposed to the original OSDI paper. This version is a more concise version that glosses over the extensions to the basic model, like allowing users to define custom reduce partitioning logic, and is ironically more highly cited, which perhaps speaks to how easy to understand it was. These details weren’t of the most interest at my level, though, so this version sufficed. One day I’ll revisit the full version of this paper, but in the meantime I think reading shorter versions prepared by the same authors could be a good strategy for reading papers that seem intimidating, like the ones that follow.
11. Fantastic Beasts and How to Lock Them
Zookeeper is one of those system names that sticks with you: what beasts3 could they possibly be herding that warranted such a name? I’d heard about Zookeeper in many a distributed systems reading list, so I figured it was time to give it a read. I was surprised as soon as I opened the PDF: I’d always heard of Zookeeper in the AWS context, so I’d thought Zookeeper was Amazon’s brainchild, but it turns out it originated at Yahoo, who published the original paper at ATC 2010.
In it, Zookeeper is described as a coordination kernel with wait-free, hierarchical objects. It provides primitives that can be used to build locks, group management, and more, offering developers flexibility and modularity in how they’re actually implemented. I found it helpful to visualize it as a filesystem with extra message ordering guarantees. Requests are pipelined and asynchronously processed, and Zookeeper’s API provides notifications to watch for updates to objects.
Zookeeper’s main primitive is the znode concept, which are sets of data nodes stored in a hierarchical order and accessible via JSON PATCH-like paths. These can be freely created and deleted by workers, and can be even made ephemeral, cleaned up when its parent terminates. Znodes use version tags for concurrency control, like CosmosDB etags.
These znodes can be used to build a variety of utilities, as described in the paper:
- Live config management can be implemented by having a “ready” sentinel object that a leader process can delete before making modifications to config and recreate when it’s ready to be processed. Workers can subscribe to updates on the ready object to re-process the config.
- Group management can be implemented with ephemeral nodes that are created as child znodes on some well-known znode. These nodes are created and die in lockstep with their parent processes, and group membership can be calculated by listing the children of the znode.
- Locks can be implemented by creating a designated node with an ephemeral flag and letting all nodes race to create the object. To avoid a thundering-herd effect, nodes can be set up to only watch the previous node in the queue; read locks can be implemented by watching only the previous write lock node.
- The paper then goes on to explain some real-world use cases that remix the above abstractions. In a few, they mention leader election being implemented via Zookeeper, but don’t mention it in the rest of the paper: I would have really liked to see how it was put together from znodes.
Zookeeper replicates all data in in-memory DBs that each contain the whole tree. It uses a special atomic broadcast protocol, ZAB, for requests that require coordination, like writes. For these, Zookeeper guarantees linearizable state writes, but it serves reads from each local replica. These means that reads can be stale, so they also describe their sync API that forces all writes issued before the read to finish first. Zookeeper also has a WAL and uses idempotent transactions in its request processor, making recovery a straightforward process of replaying events off a fuzzy snapshot.
Again, I’ve realized I’ve been over-intimidated by a paper: it was much easier to parse than I expected. Perhaps this speaks to the quality of the papers and why they’ve been recommended so frequently. I found the structure of explaining the programming model of znodes first, followed by examples, then finally jumping into implementation made for a very clear storyline — this is a recipe I hope I’ll remember to draw on in the future.
12. Grecian Politics
Once upon a time I bodged together an implementation of Raft, one of the two most well-known consensus protocols. Masquerading as a distributed systems enthusiast, this felt like an apt thing to do. When people evoked log replication and leader heartbeats, I, too, could nod my head along. Unfortunately, I hadn’t gotten around to learning about the other main consensus protocol, Paxos. Whenever folks would bring it up, all that came to mind was vague premonitions of it being harder to understand and implement than Raft. But I figure now’s as good an excuse as any to finally read through it, so I added it to my reading list.
It turns out my suspicions of Paxos being a bit convoluted were well-founded. Based on the success of his Byzantine generals parable, Lamport’s original paper frames the protocol as a story of a parliament on a Greek island, complete with characters with Greek-esque gibberish names. While this has Lamport’s trademark quirkiness written all over it, the other distributed systems pioneers who read it didn’t pick up on the algorithm. His original submission was rejected, and it was only published later with “more serious” annotations. I based this writeup on the easier-to-parse “Paxos made simple” writeup Lamport made, whose abstract merely states “The Paxos algorithm, when presented in plain English, is very simple.” Thanks, Lamport, I’ll try my best.
There are three roles in Paxos, which can be played simultaneously by each node in a larger system: acceptors, proposers, and learners. The proposers propose terms of the form (proposal number, value), which the acceptors evaluate and potentially commit to. Then, learners will learn the final, consensus’ed value. The hard part of Paxos is making sure that only a single proposed value is chosen and processes only learn it’s been chosen when it’s been committed to. The paper handwaves away the problem of electing a leader (presumably done via whoever wins the consensus race) but it serves as a distinguished proposer and learner to optimize some communication.
There are also three phases in Paxos.
- The first phase starts with proposers proposing a proposal number (yes, don’t-ask-to-ask seems not to apply here) to the acceptors and requesting 1) a confirmation that the acceptor agrees not to accept any lower-numbered proposals 2) the previous highest proposal. Once it receives responses from a majority of acceptors, it can issue its own proposal with the agreed-upon proposal number. The value of the proposal is the value of the highest-valued proposal it received in its Phase 1 response, or if no such accepted proposals exist it can provide its own choice of value. Finally, acceptors can accept the proposal if they haven’t accepted a proposal with a lower number.
- Finally, the acceptors can contact the distinguished learners (or all learners in general) to then gossip the result to the other learners.
Paxos prioritizes consistency over availability, and only makes progress if the majority of nodes are running. Progress can also only be guaranteed with a single distinguished proposer, the role of the leader I mentioned previously.
One neat thing this paper covered was a section on implementing a distributed state machine with Paxos. Each proposed value would be a deterministic command, and machines would only execute commands once agreed upon. I never realized this was a focus or even feature of Paxos.
Having come from a Raft-centric background in consensus, the main differences from Raft I noticed with Paxos were the lack of log replication, lack of explicit leader election, and reports of it being tricky to implement, whereas Raft has a reputation for being more clear. There are a few variants and implementations of Paxos that address some of the other tradeoffs, including Multi-Paxos, which allows multiple values in series to be decided upon. It’s implemented at Google in the Chubby filesystem, which was also mentioned in the Zookeeper paper. There’s also a Flexi-Paxos, which allows more flexible interpretations of what qualifies as a valid quorum between the phases of the protocol.
I’m not sure if the paper lives up to its abstract (untangling the propositions is not what I’d call “very simple”) but it’s certainly demystified the protocol. I quite liked the structure of the paper as well: it’s very TLA+-coded (and fitting) of Lamport to formulate the algorithm in terms of its safety invariants first. The declarative proofs leading to each new bit of the protocol was an interesting choice but made the protocol’s construction more incremental. I’m not sure if I’ll ever have to implement Paxos in a system — everything consensus-related I’ve worked with so far has been Raft — but if I ever do, I’ll feel a little more prepared, and perhaps a little more scared, thanks to this paper.
13. We Call It A Wrench
When I was interning this last summer, I learned through the grapevine that the person sitting behind me was the former Director of Spanner at Google. I understood the weight and aura of such a title: I’d read about Spanner in DDIA before, and I’d heard it referenced frequently enough to get the gist that it was a Big Important System. I’d never taken more than a cursory look at the paper abstract, though, and given I’d just read about Paxos, which backs Spanner, I thought it was an apt time to take a deeper look.
Spanner is a multi-version semi-relational database with automatic geographic redistribution and resharding. It uses sharded Paxos state machines to provide externally-consistent (more about this later) distributed transactions. Spanner deployments are organized in universes (very grand), with the unit of physical isolation being a zone in a universe. Zones serve data to clients via spanservers, which keep track of a unique data structure called a tablet. These tablets are bags of KV pairs with timestamps. Each tablet gets its own Paxos state machine for replication. While it’s not discussed in the paper, a placement driver at the universe level then automates the movement of these tablets. I found this part a bit overly detailed, especially since there are abstractions and components that aren’t discussed later on.
Spanner has the interesting property of supporting globally-consistent reads at a timestamp. As well, it enables externally-consistent reads and writes: external consistency is more or less linearizability, if you squint, but for transactions with multiple operations instead of linearizability’s definition based on only individual atomic ops.
This is all thanks to the TrueTime API that Spanner defines: instead of timestamps bearing a single value, which presents issues with clock skew between machines, timestamps build in this very uncertainty interval. There are time “masters” per datacenter and clients per machine, which sync and determine the largest possible error rate. Most operations require waiting until the uncertainty interval has fully expired before proceeding. The parts about choosing timestamps for transactions were interesting to me because I’d had to deal with a similar problem for a prior research internship: read-only transactions choose the earliest time that would not violate visibility to avoid backing up other concurrent transactions, and schema change transactions and operations that depend on said changes set their timestamp well into the future. I think it was worth reading about Spanner for this unique timestamp mechanism and the guarantees it gives alone — section 3 and 4 cover this in detail. For a less academic take, Spanner also has developer docs available.
I’d expected more math proofs and complicated causality diagrams of the sort I encountered in some earlier FLP/CRDT papers, but I was pleasantly surprised the paper went without. I found the “industry voice” quite clear given the emphasis on applied deployment details. I was also surprised how open Google was about Spanner’s relationships in contrast to other internal storage products like BigTable — in some parts, the paper read almost like an internal onboarding doc for when to use certain systems. It makes sense, since back then they’d already published papers on those other systems.
I was curious about Google’s more recent distributed systems publications, so I went to go poke around. It seems like they continue to put out “bigger” papers on new systems that build on their existing services. I’m happy that’s the case: sometimes it seems like I’ve missed the heyday of the early 2010s for all the systems I keep hearing about, but it’s encouraging to see there’s still plenty of work and research to be done.
14. DynamoDB Contra Dynamo
Amazon is dogmatically adherent to its leadership principles: interview there and you’ll likely need to speak to a time you exhibited “customer obsession” or one of the fourteen other platitudes. It turns out this values-driven approach extends to its systems research as well. Folks suggested I read about Dynamo, an “always-writeable” KV store that optimizes heavily for availability over consistency, published by Amazon at SOSP 2007. It was a fascinating read, both technically and culturally:
In this paper there are many references to this 99.9th percentile of distributions, which reflects Amazon engineers’ relentless focus on performance from the perspective of the customers’ experience. […] Amazon’s engineering and optimization efforts are not focused on averages.
If the engineers were focused on an above-average customer experience, I’m sure the paper authors were too: I found it quite clear and thorough. The bulk of the paper describes how they applied well-known techniques (e.g. consistent hashing for partitioning) and it’s neat that the academic novelty is intended to focus on evaluation and production readiness optimization.
For some extra context, Dynamo focuses on being highly available instead of consistent. It’s a KV-only store that supports small-blob storage. There’s no isolation and no multi-key operations. Dynamo stores multiple versions of each object, so if concurrent writes conflict, the application can define a custom conflict resolution pathway, or fall back to the system’s last-writer-wins strategy. It’s supposed to be a leaderless service, with each node having enough information to route and handle requests — the paper describes it like a “zero-hop DHT”.
Keys are partitioned across nodes via a “Dynamo ring”, which just uses consistent hashing. Each key is replicated at some clockwise successor nodes. Merkle trees are used to reduce transferred data during replica synchronization. The Dynamo ring is updated manually, with updates being gossipped around. Some nodes are well-known and serve as seeds to relay information about newly joined nodes.
In Dynamo, multiple version of data can (rarely) coexist. They’re labelled with vector clocks, and the node serving the read is responsible for collapsing these branches. There’s an interesting vector clock optimization where they’re keeping essentially a LRU cache of the nodes that last updated a key in the clock and discarding other older node updates.
Dynamo allows users to tune knobs defining replication: each key is replicated to N nodes, each write request goes to W of them, and each read to R of them, with W + R > N to ensure at least one node gets the latest value. The usual value for these parameters is W=2, R=2, N=3. Dynamo uses a sloppy quorum, which means it uses the first N healthy nodes, which might require walking more of the ring beyond the first N nodes. There’s metadata with each request that includes hints on the intended replica, so a replica further along the ring can buffer updates and catch the intended replica up once it comes back online. As well, if a node serving the read gets any stale versions among its R read responses, it can update those nodes without falling back to the Merkle tree replication — this is called read repair.
The implementation has three components: a request coordinator, membership and failure detection protocol, and a local DB engine. I thought it was interesting the underlying DB architecture was pluggable, which enables different engines according to application workloads. It seems perhaps feasible to use more traditional relational DBMSes and replicate tables instead of just keys, and I’m wondering if they ever tried this or if the overhead of running a DBMS was contrary to the high-availability and performance angle.
Anyways, I thought I was done with Paxos-based systems, but it turns out I wasn’t. Amazon came out with another paper on DynamoDB at ATC 2022, which describes DynamoDB as we know it today in its AWS form. The design draws on the principles of Dynamo, but all of the architecture today is different. It now supports indexes, full tables, and a single-leader multi-Paxos replication strategy instead of the leaderless ring-based/Merkle tree approach in Dynamo. There are now microservices in the microservices: DynamoDB is itself built in small components with a centralized administration service. This paper contains much less actual architecture discussion than the first — I can’t help but wonder if this is due to the increasing corporatisation and legal clampdown I’d expect at today’s scale. Unfortunately, this second paper contained much fewer iconic company-culture statements (I found none in my brief skim), so I wasn’t motivated to read it in its entirety.
It was a bit disappointing to read a paper then find out it’d been completely dismantled in current deployments, but it’s also quite exciting that there’s been so much change and progress. I think there’s still plenty of value in reading the original Dynamo whitepaper, particularly as it surveys so many design techniques. I’ll try to keep the practical optimizations mentioned in mind: they’re smaller adjustments that might be easier to slot into existing systems I’ll work on developing but may still make a big impact.
15. I Mean No (Redis)respect
Inspired by the NYSRG, a good friend and I spent a couple hours reading through the Redis 1.3.6 codebase. It was much more tractable then than now: as part of paper writing for my formally-verified databases project in Paris, my supervisor asked me to compute the LoC of the Redis core, and the whole project now stands at ~100K LoC of C, with the core at ~25K LoC. However, back in 1.3.6, its core was a single 9K LoC file, redis.c. Two hours was enough to learn the basics of what Redis does, skim through the code and pick out the key helper libraries, and explore some of its current enterprise extensions.
The big redis.c file is loosely structured as follows:
- L1: Definitions, prototypes, globals
- L802: Utility functions
- L2587: Redis object creation, like strings and zsets (ordered sets)
- L2981: RDB (snapshotting) saving and loading
- This is a high-performance way to ensure durable backups.
- L3703: Command implementations
- Redis supports a lot of data types so there are a lot of commands to walk through here! They’re all quite intuitive.
- L6712: Batch command implementations, a bit further down the file
- L6608: Cache expiry and removal
- There’s separate functions for expiring keys due to TTL and due to memory pressure.
- L6981: Replication helpers
- Redis replication is eventually consistent and has both full and partial synchronization steps. There’re some neat resynchronization helpers here.
- Redis replicas are read-only, unless you use Enterprise and get into their Active-Active setup (more below)
- L7412: AOF (write-ahead logging) implementation
- This can allow full reconstruction of data given a crash, so supports better durability and crash corruption avoidance, but has performance overheads
- L7950: Virtual memory, a now deprecated feature for swapping objects to disk
- L9076: Main function
There are also two other main files: the custom event loop and the backing hash table. Neither is very tricky to read through but they’re very clean implementation examples.
A neat thing I learned about the event loop was that it’s single threaded. This still admits some race conditions from clients but ensures atomicity for individual commands and makes Redis behaviour easier to reason about. There’s some asynchronous I/O for background RDB/AOF file saving, but this is handled by forking a child process and I assume this is excluded from the “single-threaded” considerations.
On a whim, I looked up “Redis CRDTs”, wondering if they might have some experimental datatype support, but I was surprised to find a whole whitepaper. I didn’t read it as it was behind an email paywall, but from their docs, it seems that they’re used to support the geo-distributed replication features for the Enterprise (!) product, allowing clients to both read and write from anywhere in the cluster. The whitepaper page is a bit plastered over with marketing-speak (“intelligently resolving conflicts”, “unprecedented advantages”, “other techniques such as LWW”, which is also a CRDT…) but it’d be interesting to learn more about what their implementation looks like sometime.
I didn’t get too into the weeds into the details of Redis during my foray, but what struck me was how simple, elegant, and approachable everything was. The most interesting areas for me were the two durability features (RDB vs AOF files) as well as the replication system. I haven’t yet had the chance to work with Redis in an internship (surprising given its ubiquity) but I think this initial look will be a good foundation for when the time comes.
16. Incremental Gymnastics
The frequency illusion struck particularly resoundingly recently, this time for the concept of incremental view maintenance. A supervisor mentioned it as a continuation of a research direction, someone else posted a blog post that came up on Twitter, and I stumbled across another reference on someone else’s blog. The second time I came across the term, I resolved to look into it some other time; the third encounter bumped it to the top of this reading list.
Incremental view maintenance refers to the fact that materialized views in databases improve performance, but must be efficiently brought up to date themselves. If a view tracks a like count, for instance, it needs to update whenever a new post is made or liked. Materialized views are defined just like a query, so an immediate idea would just be to re-run the query every time any data it references changes. This turns out to be very expensive, however, so folks have been working on smarter ways of computing results based on just the deltas between snapshots or in more parallel ways.
The first search result that comes up for “incremental view maintenance” is a dense Postgres implementation proposal, which was a helpful overview of an intuitive solution with change sets. I’d recommend reading this to get an easy-to-understand overview of one way to do this. There’ve been three other main advances though, rooted in academia though some have become the basis for infra companies:
- Timely dataflow, which orders changes linearly by timestamps.
- Differential dataflow, which imposes a partial order.
- DBSP circuits, which model computation on changes as circuits.
Timely dataflow is a dataflow layer that’s designed for incremental and iterative work. It tracks records with a special timestamp that pinpoints where a particular task is to know what data is stable and safe to process versus still recomputing and in flux. This time-awareness allows the system to avoid repeatedly recomputing from scratch, and can incrementally update results when later-timestamped data arrives. Time is tracked outside of the dataplane, so tasks get a meta-idea of the global view of time to determine if they can progress. This separation also allows tasks to be computing different points in time, at the same time, enabling more concurrent computation.
Differential dataflow is built on top of timely dataflow, as a slightly higher-level abstraction. Instead of keeping track of a single version of data, it keeps multiple versions in an index, with timely dataflow timestamps used to build a partial order. This allows the system to reuse previously computed values. Differential dataflow is more of a framework (an example of how to use it here), whereas timely dataflow is more of a low-level runtime scheduler. Differential dataflow has low adoption given its relative immaturity, but Materialize is an example of a startup providing a live SQL layer that’s based on differential dataflow.
I found the DBSP circuits paper most interesting given how concrete the theory was and the clarity of the examples. They develop a theory of database updates as a stream of changes, inspired by digital signal processing techniques. You can apply various operators and functions that are composed in circuits to these streams. They then show how you can rewrite any relational algebra and SQL operators into circuits that operate on these streams (see Table 1) and expand this up to general incremental recursive programs. For instance, you can rewrite a UNION as an + between streams and a distinct operator. This greatly improves the efficiency of view maintenance, since the circuits can give an incremental version of a view’s query that returns the change delta. This theory has been formalized although this is just barely touched on in the paper. DBSP is the highest-level of these three approaches, constraining time and state management in exchange for a simpler model — as a developer, DBSP circuits most easily match my existing mental models. As for real-world applications, Feldera is a startup focusing heavily on incremental processing based on DBSP.
So far, it seems like these three approaches are the frontier of the field, and there’s a very cutely titled paper addressing additional advances. As you can surmise from the choice of topics for this calendar, this sort of live, reactive, incrementally updating dataflow is fascinating to me on an aesthetic level. I’m eager to see where the research goes next, particularly if the Postgres proposal goes anywhere or if these techniques are integrated into any more mainstream databases.
17. Timberlake on Databases
As should be evinced by these calendar entries so far, I have a soft spot for punny and cute titles. I saw What Goes Around Comes Around in the supplemental readings for my DB course, so when I saw What Goes Around Comes Around… And Around… I added it to my TBR list with a little extra relish4.
The former is a chapter in the famous Red Book of DBMS research, which I also need to get through at some point. It was published in 2005, chronologing the major trends in prior databases research and application til that point, from hierarchical data to ER to the XML heyday. Clearly, a lot has changed since then — I saw vestiges of XML only in my early web-dev days, and we joke more about web-scale databases than ORMs nowadays. So, about twenty years later, the latter follow-up paper was published to bring us up to date.
WGACAAA is structured in two parts: one with a series of history-lesson-like summaries of various paradigms, and one on hotter topics in DBMSes. Even at a first glance, these keywords feel more familiar: document DBs, vector DBs, columnar stores, blockchain. Some of my favourite anecdotes mentioned:
- The term MapReduce is thrown around so often that I’d always figured it’d persisted, but the paper covers a throwdown between Google and the DBMS community, and the eventual demise of MapReduce and the budding vendors that’d grown up around it.
- Cloud databases took the compute time-sharing of the late 70s and shared-disk DBMS ideas and made them trendy cash cows again.
This was also the first time I’d heard of a few concepts, including:
- Column-family DBMSes (as opposed to columnar data stores), as used in Google’s BigTable. These are like document DBs but only allow one level of key nesting along with arrays in values.
- Array databases, primarily used in scientific computing. These are challenging due to arbitrary or irregular nesting of array dimensions. This is a reason there are few vendors, and none in the major clouds.
- Hardware-accelerated DBMSes, which recently mean using FPGAs and GPUs. It’s quite interesting that as early in the 80s vendors were already trying to fiddle with hardware.
I learned about some of these DBMS models completely separately from traditional RDBMSes, so it’s interesting how the paper ties back each feature to new SQL standard updates. For example, I’ve spent some time working with Neo4j and thought the graph traversals were very distinct from a typical SQL use-case, but it turns out SQL:2023 introduced graph traversal support. Array database features can likewise also be simulated by SQL:2022 features. And of course, document DB JSON features are finding their way into SQL systems. The conclusion of the first part of the paper makes a strong case that all mainstream, profitable roads lead back to SQL and RDBMSes.
I like one of the closing lines of the paper: “In other words, stand on the shoulders of those who came before and not on their toes.” (see above point about cute lines). This might be the first survey paper I’ve read, and its balance between cautionary tales and future-looking forecasts is executed very well. It’s a fairly recent overview, and thanks to the slower DB research cadence it should remain relevant for a while. It’s very easy to digest and provides valuable retrospectives on the historical provenance and controversies in the DB community so I’d certainly recommend perusing it. This feels like a fitting way to wrap up this DB section, and it’s comforting to see I’ve already covered quite a few of the foundational systems mentioned in this overview paper.
18. Without a Trace
One of my sidequests this summer was trying to improve the performance of some action in a service. I was helpfully pointed to where I could find the traces, only to realize there was something wrong with the spans — they were overflowing their parents and reappearing in odd ways. I ended up dropping that work since I couldn’t understand the spans, but I recently figured it’d be good to go back to one of the first big distributed tracing papers to see if it might help in the future.
Dapper is a Google paper describing their homegrown distributed tracing setup and how internal teams use it. Its two main goals were to be sufficiently low-level enough to deployable everywhere and to be continuously running for as many services as possible.
They mention that there are two typical types of tracing systems: black-box ones that use statistical regression to link events, and annotation-based ones that have better accuracy in exchange for the development cost of adding annotations. Dapper is based on the latter school of thought, although given the wide range of services deployed at Google, they had unique concerns about where to put this annotation. They ended up annotating events at a very low level in common libraries, like the thread-local storage, the thread scheduler, and the RPC library. This avoids requiring application-specific annotations, although these are also supported.
The base unit of a trace is a span, which has timestamped records and timing data and everything else you’d expect. All spans in a trace execution keep a parent span ID and share a globally unique trace ID, which makes setting up the root span a bit more expensive. This span data is written first in local log files, then pulled by the Dapper daemons (what a name) on a periodic basis, then finally written into a relational database, BigTable.
All this ideally occurs within a short few minutes’ delay to enable more realtime firefighting when issues arise. Indeed, the paper spends a lot of time explaining how they made design decisions to optimize for lowering client overhead and latency. They found trace generation overhead to be negligible, on the order of nanoseconds, and even the expensive local log writing takes place asynchronously in the background with little overhead. I thought it was interesting that they explicitly called out that trace collection and analysis could be turned off more easily, hence are less pressing performance-wise, but they also schedule the processes to be lowest-priority in the kernel.
Spans are sampled twice for latency and storage improvements: once at the service level and once at the BigTable storage level. They used to tune this globally at a sample rate of 1/1024, but were rolling out adaptive sampling at the time of writing. One aspect I didn’t see covered in the paper was how much financial, not just performance, cost Dapper imposed. I know Datadog and other tracing SaaSes are extremely expensive even with heavy sampling, so I’m curious how much the log retention was given the adaptive sampling.
The paper goes into detail about how the Ads team onboarded to Dapper along with some other anecdotal experiences, but the most interesting use case they mentioned was how Dapper was used for security audits. Application-level annotations were made opt-in explicitly to avoid leaking PII by logging function values in spans. In addition, the span data in BigTable can then be used to automatically verify and enforce security policies: for instance, ensuring services don’t unintentionally share private information with one another. I thought this was a super cool usecase and tied in well with the vague aesthetics of what I’m researching now.
I found the case studies and future work proposed in the paper quite interesting and applicable too — there’s neat ideas for how tracing could be applied in general to other services. My mysterious spans from this summer were no more elucidated, but at least I understand how to implement a such tracing system now.
19. It’s About Time
Temporal is a relatively famous runtime tool that executes functions and workflows durably, guaranteeing that code will keep running even if it crashes midway and abstracting away reliability and recovery work. I’d heard about Temporal both from working at Stripe and while at OpenAI — both times I sat next to another intern working on a Temporal-related integration project. I’d gotten the gist of “workflows are durable” but had always wondered how exactly it made good on its code replayability promises.
Temporal consists of a cloud service, operated either on-prem via their open source platform or by them, as well as worker processes. The worker processes actually run the functions, and the Temporal server choreographs these executions. One of the main Temporal abstractions is Workflows, which manage their own local state, execute deterministically, and can communicate with other Workflows via message passing. The core promise of Temporal is that Workflows are run exactly once and to completion, even if this takes years and in the face of extensive network or server failures.
The other main abstraction is the concept of Activities, which are normal function executions that complete a single, well-defined task, like fetching from the database or so on. Activities may be retried repeatedly, requiring idempotency keys for side effects. There can be code outside of Activity steps, but anything that might fail or otherwise requires interaction with the Temporal Service needs to be in an Activity. When Activities complete, the worker sends the result back to the Temporal service in an event.
It’s these events that build up an Event History, which is durably persisted, presumably in the DB that accompanies the Temporal service. Besides tasks finishing, many other lifecycle events are recorded in events and logged. This Event History can then be replayed to restore execution state in case of failures. Temporal considers workers (and hence the Activity tasks they’re running) to have failed if they time out, and will automatically retry it from that point.
To continue executing code, the new worker requests the Event History from the Temporal Service. It can rerun code directly, since Workflows are deterministic, so they don’t actually have to store or track state anywhere fancy. This deterministic piece is key to ensure new executions match existing event history. When the new worker reaches an already-executed Activity, it can look back in the Event History to find the task’s completion event and directly use its result, instead of enqueuing a new Activity task to the Temporal Service. As the docs emphasize, the Worker thus doesn’t re-execute the Activity, so there’s no way execution results can differ on this new worker. When the new worker encounters an Activity that wasn’t completed (i.e. the site of the crash), it can start executing code live and enqueue new Activity tasks.
Temporal surfaces its internals less prominently than other dev infra companies I’ve written about here. I had to go digging quite far in the docs (on a language-specific SDK page!) to get to a satisfactory level of detail regarding how it maintains durability, for instance. While I think Temporal could do a better job at centralizing and clarifying their design deep dives, it’s already so popular that this wouldn’t help with adoption. Swyx, the former Head of Developer Experience also articulates that Temporal is intended to be an iPhone moment, so maybe they’re explicitly not optimizing for people to look at their internals and instead focusing on the experience and ubiquity. (I actually recommend this iPhone post as a precursor to reading the Temporal docs since it builds better intuition about how you’d naturally come to a similar solution.)
It was quite neat to finally learn about a platform I’d heard so much about from fellow interns. I’d like to build something with Temporal one day just to feel the magic myself — hopefully I’ll have a chance with an upcoming class project soon.
20. Who’s D Boss?
This last summer in the Bay, I was finally able to attend the mythical SF and South Bay Systems meetups, events I’d followed for a while from the sidelines. I managed to go to the South Bay Systems meetup where Andy Pavlo spoke, and I noticed one of the organizers was wearing a DBOS shirt. My friend (another organizer) had mentioned DBOS offhandedly to me in DMs a while back as well. The original vision sounded incredibly interesting: a database-based operating system where you can freely log and inspect system state as just another table would be a field day for Quantified Self, local-first, and malleable software folks alike. Now, DBOS has evolved from its research roots into a cloud company that supports durable functions as a service.
If you read yesterday’s post on Temporal, you’ll see why I grouped these two together. Luckily, DBOS folks have put together a helpful couple paragraphs to differentiate the two, which I think is useful to get a better sense of what DBOS does in terms of what we already know about Temporal. Both provide durability, but Temporal is a framework that requires re-architecting your application around its concepts of Activities and Workflows, whereas DBOS enables drop-in annotations. DBOS is limited to Postgres only and fewer languages than Temporal supports, but their blog post describing the differences is quite compelling. In particular, they mention how durable code is usually at the heart of business applications and hence more expensive and risky to refactor or segregate out. The library approach that DBOS champions seems easier to integrate and avoids depending on yet another orchestrator service.
The DBOS library provides abstractions for workflows, queues, and messaging, backed by a Postgres database, described here. Workflows are annotated in existing code and checkpointed into Postgres (dbos.workflow_status). As with Temporal, workflows are required to be deterministic; DBOS workflows also need to be idempotent, whereas Temporal merely suggests it. Workflows can receive messages in stream topics. Workflow recovery is similar to Temporal: inputs and outputs are stored in the DBOS Postgres and restarted with old inputs/outputs until it reaches a step with no checkpoints, hence continuing execution normally after the point where the system crashed. This poses interesting issues when the code needs to change, as inputs and outputs may be malformed for new versions, so DBOS also tracks application version and only recovers workflows with matching versions, otherwise putting the burden on developers to manually remap tasks safely.
DBOS also enables durable queues of workflows. Its implementation is unique: instead of there being separate worker servers, as in Temporal, all application servers can poll for workflows. One key point is that these application servers would be owned by you, as opposed to the separate, Temporal-controlled worker servers. Recovery from these distributed application servers proceeds automatically, detecting when executors may be unhealthy and enqueueing tasks to another server.
Take a brief look through the Postgres schema if you haven’t already — it’s quite intuitive and there’s not anything surprising. This is actually how I found out about the message streaming communication feature in the first place, as I hadn’t seen it in the architecture overview.
To be honest, I was slightly more drawn to the grandiose hopes of a database-oriented OS, but perhaps the current DBOS Cloud offering is a more realistic entryway onto that path (or perhaps just a good funding source for continuing more experimental research). For those interested in other durable workflow runners, I have few links besides the Temporal writeup I did yesterday; for those interested in other speculative “OS” ideas, on the other hand, I quite like Alexander Obenauer’s lab notes and the very aesthetic MercuryOS by Jason Yuan.
21. Where’s Waldo?
I came across the Waldo paper when I was working on Kintsugi: I’d read Emma Dauterman’s SafetyPin paper quite extensively while doing my literature search, so I’d seen Waldo but not gotten around to reading it yet. The combination of databases and secret sharing caught my eye. Aesthetically, I’m drawn to odd experimental databases, and the promise of privacy that was fast enough to approach practical usage was quite alluring.
Waldo is a time series database that protects both the data it stores and the queries that operate on it. Something like this would be applicable in many sensitive fields, and they mention use cases like health data where even queries can reveal PII/private information about a patient. Two more obvious ways to approach this would be to use oblivious RAM or generic multi-party computation techniques, but their latency and cost make them impractical solutions. Waldo instead creatively applies function secret sharing (FSS), expressing the private data into data structure and using FSS to operate queries with multiple parties. Instead of thinking of a SELECT as a SQL query, for example, we can express it as a function that returns 1 for certain input values, and this more mathematical rendition is what’s shared between nodes. Waldo also supports more index types than prior work: additive aggregates over multiple predicates and arbitrary aggregates over a time interval are both basic features in typical databases that Waldo is also able to provide.
If you just take FSS for granted and squint through the rest of the paper, the building blocks are themselves already really neat. For example, FSS usually deals with public data, since the servers evaluating the function need the same input. Instead Waldo splits the data up structurally into a one-hot index encoding per feature. Each possible data value gets a row, with the entry (x, y) having value 1 if the record x had data value y. Each entry gets multiplied by the FSS evaluation, which is 0 if it’s not at the right row anyways, so in the end Waldo can hide the data value as well, with the added upshot of cheap appends. To combine multiple predicates, Waldo takes vector FSS outputs, which can be more easily combined with another predicate’s results. To support arbitrary predicates, Waldo lets the user define an aggregation tree that itself is FSS’ed to hide both data and query.
However, I thought it’d be a good chance to build on my prior forays into secret sharing to also learn about FSS. This paper introduces the idea. While I’d expected FSS to be for sharing typical polynomial “math” functions, for which you can simply Shamir-secret-share the coefficients of the polynomial, it turns out the paper focuses on distributed point functions f_α(x), which are 0 everywhere except a point (α, β). Instead of doing the typical Lagrange interpolation, they instead turn to a tree model. A dealer generates n keys, k1 through kn, and for the FSS evaluation function Eval, Eval(k1, x) ⊕ … ⊕ Eval(kn, x) = f_α(x). Each key has a root seed which determines the structure of a pseudo-randomly generated tree. Each key’s tree is the same, except along the path α, letting the final ⊕ operation produce a value at α. To make these trees agree on all other paths, there’s an XOR mask, called a correction word, generated by the dealer to fix the other paths. These correction words are also stored in the keys. The pseudo-random generation makes the tree (and thus key) look random, but with all the shares the function can be evaluated. You can use this same tree idea to construct decision trees (i.e. if conditions) that can be privately evaluated as well, which I assume are what Waldo uses behind the scenes. I skimmed over the rest of the FSS math in the paper for how the DB features are implemented because I didn’t quite have the capacity to wrangle more numbers in my head, but I do want to come back to it sometime!
I thought Waldo was a super interesting idea, and I spent some time afterward going down a rabbithole of NDSS presentations on FSS being applied in other privacy-aware contexts. I’m eager to see if private databases like Waldo gain popularity as folks start wanting to protect their personal data from ubiquitous LLM agents, or if people in general trust labs to get things right.
I think there was a gap in my background that made the details in the paper a bit harder to approach: like most crypto-adjacent papers it’s quite dense and even the overview video left me a bit confused. I was drawn in by the FSS application yet had to wrestle with the intuition for it, so ultimately I don’t think I got as much out of this paper as I’d have liked. I want to read more on oblivious RAM and MPC (the two infeasible techniques) at some point to contextualize current work, for instance. As well, I’d might try to implement some FSS stuff: it took me at least a week to really get my head around dynamic proactive secret sharing so perhaps FSS will come more naturally after some focused time.
22. Pass the Yogurt
Every so often, it’s fun to poke around various distributed research labs’ latest papers. It’s a direct reflection of what a particular microcosm of researchers are thinking most about, and it’s neat to start to trace interests together. This is how I found this paper and the next: both caught my attention because of the interesting formal twists they proposed. Today, I’ll be discussing Parfait, a collection of frameworks and a modular approach for verifying HSMs; tomorrow’s paper covers a framework to verify distributed systems.
Parfait is a stack of tools, that when chained together, ensures that information doesn’t leak from a supposedly secure system using secure hardware like HSMs. For example, there might be minute timing differences in computation that could leak information about the data a program is processing. Existing leakage models are either not end-to-end from software all the way down to registers and hardware, or don’t scale the iterative refinement process well.
Parfait uses a host of different proof tools and languages: its five stages are an app specification in F*, then a verified app implementation in Low*, which compiles via the KaRaMel compiler to C, after which the verified CompCert compiler generates the assembly, which is finally proved by physical simulation. The hardware is also Verilog-verified. These stages each can catch certain classes of bugs: for instance, software logic bugs are caught at the Low* level, and timing attacks can be prevented at the last hardware stage. Parfait is made up of two sub-frameworks that encompass these stages: Starling comprises the software steps, and Knox2 takes care of the hardware verification. Here, only the app spec and the eventual client driver are part of the trusted computing base.
Between these stages, equivalence is proved using information-preserving refinement (IPR). The IPR from F* to Low* is done via a lockstep proof, and the IPR from assembly to hardware is via proving the existence of an observationally equivalent mapping. This stack seems convoluted, but lets developers leverage existing tools and codebases, like HACL* for formally-verified cryptography. It also makes Parfait more realistically applicable rather than relying on custom tooling every step of the way. The examples of each of these proofs and their mappings are very concrete. I thought they got a bit repetitive, but maybe I’m not the right audience because I just trusted it would work.
I liked the case studies they showed for ECDSA signing and password hashing, based on two different processors. The developer-time arguments seem very convincing: both prototypes are implemented in very few lines of F* and were ported to different platforms in a couple of hours. The verification time also seems more or less reasonable — I liked the practical note about reducing loop bounds as a way to reduce verification time.
I also wanted to call out the graphs and figures in this paper! It’s clear a lot of time and thought went into them, as they were laid out very orderly with color-coded matching captions to boot. I’m also just a sucker for nice diagrams with consistent colors and fonts, especially across figures in the same paper. This is certainly something I’ll aspire to in future work.
Overall, this was a very interesting read, especially on the framing level, and I learned a lot presentation-wise from this paper. We just received the reviews back for a recent paper submission and there’s a lot of inspiration I’ll draw from how the verification correctness and performance arguments are made here in our next submission.
23. Miss The Grove For The Trees
Today is another MIT PDOS verification paper — a completely different vein than yesterday but just as interesting! Grove is a Rocq library for verifying Go distributed systems at their implementation level in a modular, easy-to-reason-about way. In particular, it focuses on time-based leases and the concept of ownership, which prior verification libraries don’t yet support. For example, Grove can help verify read consistency with lease-based systems that avoid coordination, ensure the durability and crash recovery of systems, and check that reconfiguration correctly waits for leases to expire.
Grove is based on concurrent separation logic (CSL), which is a very intimidating term to just mean thread-based logic that treats subsystems as modular black boxes. This helps “separate” concerns — one doesn’t have to worry about the other concurrently executing code to prove invariants on the current thread. Some core concepts include assertions, which ascribe ownership onto something (e.g. this thread has control over this object) and thus enable modularity because they limit the layer of concern to just this thread. As well, there are ghost resources, which can span multiple nodes, providing a higher level of abstraction, but can be owned by specific nodes. An example of this might be the epoch object, which can be read by all nodes but is owned by a central reconfiguration service. CSL involves proofs that take some input state, apply a specific function, then assert some output state, which makes it a natural fit for verifying implementation-level concerns.
I think it was a bit unclear what Grove was at first, without having read some of the other PDOS verification papers and without knowing what CSL was. I understood that Grove was a “library”, but as an implementation-driven person I was wondering if it was a Go import, a new DSL, or some new IDE. It becomes clearer only near the end of the paper in the implementation section that Grove is indeed a Rocq library. It’s built off of their prior Perennial library, which is itself built on Goose, which converts a subset of Go to Rocq for verification. Maybe it’s obvious from context if one has a bit more verification background, but I was a bit lost initially.
I found the library’s CSL abstractions quite interesting, though I admit I skimmed over them. Grove supplies special per-node assertions, network and file content resources, and time-bounded invariants for lease assertions. In general, I thought their explanation of the “separation” bit and the composability was compelling: proven ownership enforces well-behaved code, since a thread can’t make a mistake and trigger some other unintended behaviour in another thread via a shared RPC.
Their demo system, vKV, was also quite impressive. It requires about the same amount of proof-to-code as other prior work. It’s also very high performance: 67-73% of Redis, which is crazy for verified code. I like that they specifically mentioned how implementing leases into the library and the system improved performance. For instance, read only operations could then bypass expensive run-only-once code — an optimization I’ve used in my own prior work.
I had two other main questions while reading the paper:
- Why not just use TLA+? I’m regularly surrounded by TLA+ folks and have been working somewhat adjacent to the ecosystem for a while so naturally thought of it when I saw “verifying distributed systems”.
- The paper points out, though, that Grove aims to catch implementation-level bugs, not just the abstract modelling-level ones TLA+ is usually used to reason about. I’d been thinking about PGo so long I forgot we had the implementation translation to worry about.
- Why are liveness properties not supported? This is explicitly called out in the paper but alas it was never explained why it was out of scope.
- Maybe they had Grove research separate from their other work on liveness, and only started the liveness piece more recently?
Just like Parfait, I liked some presentational aspects of the paper, particularly how they present modularity and composability. The intro has specific callouts for different audiences (systems vs. verification folks) and how they could get value from the same paper. It was nice framing that helps with papers like this that straddle both fields, since each’s norms and lenses are a bit different. It would have been nice to interleave some Grove excerpts earlier on and clarify what it was to the unwitting reader, but I also appreciated how they had motivating examples and built intuition up slowly upfront. Overall, plenty of inspiration to be drawn from the writing style of Grove as well!
24. The (Fizz)bee’s Knees
In my past research, I’ve tended to work adjacent to formal verification (implementing a specification, writing libraries for specs) while never actually doing any of it. I’ve tried my best to pick things up over time, but specification languages like TLA+ still seem verbose and intimidating, far removed from the imperative code I’m used to writing. Andrew Helwer wrote a post about how companies can see the value of TLA+ but still not want to invest in engineers themselves working on the specs or to continue iterating in-house after a contract’s over. So it seems existing formal methods tooling is scary, hard to pick up, and difficult to truly integrate into the development cycle. Fizzbee, a Python-inspired specification language to visualize and test distributed systems, aims to make formal verification more accessible.
I’d first heard of Fizzbee via LinkedIn: for some reason, its creator JP had sent a connection request. I have a policy of declining connection requests unless I directly know the person, so I declined as I hadn’t heard of JP at the time (sorry). The formal verification mention in their headline caught my eye, though, and I remember clicking through to the Fizzbee website for a skim. This last summer, I saw Fizzbee mentioned again at a couple of Bay Area systems meetups, so I resolved to include it when building this reading list.
At first glance, the Fizzbee syntax and quickstart guide feel refreshing. There’s a certain mental barrier that comes with reading through TLA+ disjunctions and fairness operators: it’s like my eyes are glazing over of their own accord. But Fizzbee’s syntax is recognizable and intuitive, reading like plain Python. The website is also designed cleanly, which beats having to page through a scan of Specifying Systems.
Fizzbee has a couple main concepts, including roles, actions, functions, and invariants. Roles seem a bit like classes and can encapsulate ephemeral state. Actions are intended to represent external triggers, while functions are called by actions or other functions. Both are vaguely reminiscent of TLA+ (maybe I just can’t think of anything else when I see Init) but read more like Python functions. The language supports something like list comprehensions with a for syntax, but requires an atomic keyword to prevent state space explosion from the loop iterations. The invariant definitions can be used to specify liveness and safety properties as expected. In particular, the fairness keywords seem nicely intuitive: instead of <>[] we have eventually and always. Each Fizzbee spec also has some frontmatter that can be used to constrain state exploration, like limiting the maximum number of executions of an action. It seems like some care needs to be taken when writing specs to avoid explosion: there are plenty of warnings and information panels littered throughout the guide.
JP has an interview with Chris Riccomini on Materialized View, where he makes a similar point as Helwer above: no one really wants to read TLA+ because it’s too elusive. Even if you can convince your team (and yourself) to go through the upfront modelling, it’s rare for enthusiasm to stay high and for specs to stay relevant given the high mental load. I think this is where Fizzbee’s similarity with Python really shines. This is great for adoption, since the amount of new syntax you have to learn is limited, and the spec can stay relatively close to the code. If I’m writing a spec, thinking about system properties and invariants is draining enough; I don’t want to have to remember if <>[] or []<> was the fairness property I intended. This is touched on a bit in the interview, but I wonder if it’ll be easier to get LLMs to help with writing specifications given an implmentation, since models likely have plenty of Python training data but lack context for TLA+ or other more niche tools. On the flip side, I also wonder if the slight differences from Python will trip LLMs up in an ironic case of the uncanny valley.
I also wanted to highlight Fizzbee’s interactive tooling, which also seem helpful for adoption. Fizzbee can automatically generate sequence diagrams and system visualizations that can be stepped through incrementally. This makes it easier to articulate traces and counterexamples instead of trawling through a TLC log. All this happens in a web playground so users don’t have to futz around with TLC and Java. Some web tooling for TLA+ exists, like Spectacle, but I really appreciate how visualization is a first-party concept for Fizzbee.
Fizzbee’s really impressive work for what seems to be a solo dev. The first time I read the landing page wasn’t that long ago, but it still seems like a lot of progress was made. I’m eager to see what features crop up next and how adoption plays out — perhaps I’ll give it a try myself for the next systems paper I read.
25. Look Mom, No Hands
I thought I’d wrap up this calendar by briefly shilling some of my own recent research. It’s a bit lighter reading and a good opportunity to checkpoint my progress.
Earlier this spring, I’d settled in on my bachelor’s honours thesis idea. My PhD mentor’d been working for a while on PGo, a tool to automatically compile deployable distributed systems from formal specifications in a dialect of TLA+. He’s also been working on a related TraceLink project to map execution traces to the underlying model, which ensures neither deviates from the other. This work gets us a “model → code → traces” pipeline. My thesis was meant to close the loop by turning traces and code into formal specifications. These could then be compiled, and we’d get this virtuous cycle of proven-correct nirvana.
At a first glance, this seems like an intractable synthesis problem. With traditional static techniques like Rosette or static slicing through traces, this would be. But it’s 2025, NeurIPS and ICLR are exploding at the seams, and I’d be interning at a large American AI startup. The solution, of course, would be to throw LLMs at it and see what stuck. I jest, but not really. The model checker or proof assistant can indicate whether our generated spec is internally consistent, and with trace validation we’ll know that we’re not hallucinating orthogonal to the codebase too. This seems like a good start.
Verification is a really interesting domain for me because of the rigor and practical benefits that formal guarantees supply. These boons are generally recognized by companies big and small, who agree that formal methods would be great if only they didn’t require so much manpower5. This seems like a good place to be in: we’re limited by effort and ROI calculations, not for want of demand. Supposing we manage to significantly reduce the cost and burden of maintaining formal specs alongside code, it seems like even casual, curious adoption would see an uptick. Even putting aside the question of if we should specify software and how much it’d really help, making verification a viable option to consider is a clear net positive. Right now, it feels a bit like this academic luxury, bottled AOC in eastern France and Switzerland: I think there’s low hanging fruit to improve accessibility.
With all these grand motivations in mind, I set out to implement an agent and understand the current SOTA of TLA+ capability. By September, I was ready to triumphantly present my shoddy loop to my supervisor as a first step. The slight catch: someone else had done it already, and done it better. Over the summer, the TLAi+ Challenge had been taking place, and unbeknownst to us, a team from UIUC and NJU had won first prize for an agent to convert source codebases into TLA+ specs with thoughtful integrations of existing tools. Rats.
This story ends happily ever after, though. My supervisor sent an email regarding collaborations, and a week later we found ourselves in a Zoom call chatting about next steps. We’ve been making progress based on their initial submission, Specula, and we’re currently refactoring it to provide better feedback to the agent. We’ve also recently put out a benchmark, SysMoBench, measuring stock LLMs on their ability to produce syntactically correct and runtime-meaningful specs. The TL;DR is that they don’t do very well, even just considering the issue of generating valid syntax, and there’s still a ways to go before generating trace-validable specs becomes commonplace. You can see a preview of this work on arXiv. I’m really excited by this line of research and hope to have more results to share by the time I submit my thesis.
Anyways, this concludes this edition of my Advent CAPlendar (if you’ve missed a post, consider checking out the others here). I’ve quite enjoyed getting exposed to so many interesting papers and ideas through this project. I hope to be able to keep up the habit, if a bit less intensely, throughout the rest of next year: compiling these posts has already given me so many new branching off points and further readings. We’ll see how high my TBR papers list stacks up, and if I’ll have a more focused line of inquiry, but for now I’m planning to do another calendar next year. Happy holidays, and til then!
-
After spending a while doing infra work, it was refreshing to go back to my frontend roots and put this together! I learned some new tricks: did you know CSS has counters? Did you know you can set a separate header ID or that you can have slashes in them? ↩︎
-
A mentor’s nickname for the larger community of researchers, typically in Europe, working on CRDTs and sync technology. ↩︎
-
There’s a joke in here somewhere about the Indian boar romping about on the cover of Designing Data-Intensive Applications and zookeepers needing to wrangle them… ↩︎
-
See other examples of Andy Pavlo’s banger titles on his papers page. ↩︎
-
See this post and this post to get a sense of adoption issues from folks much closer to the “convincing people to do TLA+” side. ↩︎