‹ go back

An Advent CAPlendar.

Published 01 December 2025 at Yours, Kewbish. 7,835 words. Subscribe via RSS.


Introduction

For a couple years in high school, I dedicated my Decembers to Wastl’s Advent of Code, a daily series of Christmas-themed programming puzzles. The regular cadence of problems kept the completionist in me hooked, and the problems were satisfyingly challenging beyond my ability. I lacked the background to make it through most days after the first week or two on my own, so I’d sheepishly poke around the subreddit the next morning for hints. Through AOC, I learned a lot of Python builtins like counters, zips, and list comprehensions, and it was always a fun activity to look forward to during December.

Last year, I found out about a similar project called Advent of Distributed Systems. AODS breaks up the Fly.io Distributed Systems challenges into weekly miniprojects, aiming to get you through to the final Raft KV store by the end of the month.

This fall, I realized I had a growing paper backlog and a bucket list of startups and technologies I’d like to get around to reading someday, but never made the time for. Inspired by the subject matter of AODS and the cadence of AOC, I’m putting together my own holiday countdown.

This post is my 2025 Advent CAPlendar, exploring topics in distributed systems (hence the title pun), databases, formal verification, and security. I’m planning to read some foundational papers like DynamoDB and Zookeeper as well as some newer works out of MIT PDOS and Berkeley SkyLab. There are also some neat systems startups and protocols that I keep hearing about, so I’m also planning to glean what I can about how they work from their documentation. You can see a calendar of topics below1.

2025 Advent CAPlendar

I’m also trying something new and starting a popup newsletter to accompany this series. Signing up will get you a short weekend roundup list of the posts that week. To assuage the privacy-conscious: I’ll only use this for Advent CAPlendar-related posts, I’ll be running this by hand (no external data processors, etc.), and I promise I’ll be too busy studying for finals to think about spamming you otherwise!

1. The CAP and CAL Theorems

I figured it’d be most fitting to start this calendar by referencing the titular pun itself. The CAP theorem is idea that a distributed system can only satisfy two of the following three guarantees:

  • Consistency: reads return the most recently written value. This behaviour is also known as the linearizability consistency level.
  • Availability: every request by a non-failing node receives a response. Note that this doesn’t mean every request receives a response, which I think is the typical definition of uptime via high availability; it’s predicated here on being received by a non-failing node.
  • Partition tolerance: the system keeps behaving as-is even as arbitrary nodes fail.

The idea is that if you have a network partition, you can decide to either eschew consistency and just return responses, even if they’re not the latest values, or stick to your guns and not respond to requests, sacrificing availability.

The CAP theorem got picked on a bunch. The most obvious argument is that the pithy “pick 2 out of 3” framing conveniently sidesteps that you can’t ignore partition tolerance, or that you might end up with less than two properties. As well, if your system doesn’t fit the definitions of consistency (e.g. isn’t linearizable), the CAP theorem shrugs its shoulders and gives up — it won’t really tell you anything. Martin Kleppmann’s paper also highlights some vagueness in definitions in the original CAP proof. This is summarized in Dominik Tornow’s blog post, which illustrates how the conjecture and proof are a bit like passing ships in the night given their definitions are a bit off.

The CAP theorem could be a good rule of thumb, but I think its variants that focus on latency instead are far more interesting. PACELC is an extension of CAP, stating that under Partitioning, you have to choose between Availability and Consistency, Else you’ll have to make the tradeoff between Latency and Consistency. This makes sense, since in the absence of partitions stronger consistency models will still impose more acknowledgements and latency.

Kleppmann’s paper goes further, modelling availability in terms of operation latency to match more industry-standard intuitions of uptime SLAs. “If a service can sustain low operation latency, even as network delay increases dramatically, it is more tolerant of network problems than a service whose latency increases.” sounds easier to compare to real systems. He also defines some operation latency lower bounds in term of network latency at different consistency levels which clearly match intuition.

Almost a decade later, Edward Lee takes this even further, defining the CAL theorem. This trades off against Consistency, Availability, and network Latency, but at a more gradual level than the all-or-nothing assumptions of the CAP theorem. The CAL theorem subsumes the CAP theorem, which is defined as a special case. It’s a pity this isn’t more widely known, otherwise I’d have a cleaner title for this Advent series.

I think the neatest part of Lee’s paper is the Lingua Franca framework he introduces. It’s a coordination language for distributed, realtime software for cyber-physical systems like autonomous vehicles. The key feature I picked up on was the fault handlers that trigger when you hit some apparent latency bounds, in which you can specify callbacks and explicitly define whether you’d like to relax your consistency or availability properties. If you choose to sacrifice availability, it provides centralized coordination primitives; if not it provides decentralized coordination with error bounds on clock synchronization to relax consistency. Lee has a video tutorial for how to do so here. It looks like a nice abstraction layer to keep the CAL theorem top of mind while developing a distributed system.

I think the most valuable part of the CAP/CAL theorems for me is the quick gut check mental model. The CAP theorem appeals to instinctive sense. I also had the opportunity to attend Lee’s Distinguished Lecture at UBC, and it was just as clearly laid out and intuitive. Writing this up has highlighted some of the caveats of each model, though, which I’ll need to remember to keep in mind.

2. I’m FLPping Out

The CAP theorem is one of the classic impossibility results in distributed systems, and back at DARE last year I learned about another: the FLP. Unlike the CAP theorem, FLP’s initialism doesn’t serve as a neat mnemonic: it stands for Fischer, Lynch, and Paterson, each very storied researchers that I’d recommend checking out if you have some time.

The FLP states that every asynchronous consensus algorithm may not terminate if there’s at least one silently faulty process. Asynchronous in this case means that we can’t bound the amount of time it takes to pass messages, and the paper also assumes no access to synchronized clocks (this would make things easier with timeouts.) Consensus means that all non-faulty nodes decide on the same value, like 1 in the set {0, 1}, for instance. If you rotate the FLP on its head a bit, it shows that it’s impossible to determine if the system will never terminate and has failed or if latency’s just high.

They model consensus via one big message buffer system, shared by all processes and requiring explicit send/receive notifications. They define a configuration as the set of processes running, their internal states, and this message buffer. A reachable config is a config with some series of messages being applied on top. They outlaw protocols that always decide either 0 or 1, and define a concept of ‘partial correctness’ that means all configurations have exactly one decision value, which could be null, and some reachable configuration will eventually decide 0 or 1. I’ll hand-wave a bit below so look into the paper if you feel unconvinced.

The proof proceeds as follows:

  • Lemma 1: Given two configurations that have disjoint sets of processes, you can replay the events of one onto the other and vice versa and you’ll end up with the same final configuration because the sets of processes doing work are entirely distinct.
  • Lemma 2: Any such protocol has an initial configuration that can eventually lead to a configuration that’s decided 0 and a separate configuration that’s decided 1 — this is called an initial, bivalent config. Bivalent means we can’t decide on either 0 or 1, which is bad news.
  • Lemma 3: From a bivalent config C, applying an event to any configs reachable from C will still result in at least one bivalent config being reachable. You can see where this is going now.
  • Lemma 4: If we have to start with a bivalent config, and can only get to states where we still have some bivalent configs, we’ll never get into a config that only has univalent configs that have decided on other values. This means there can’t be such a protocol that always terminates.

The paper also includes an example protocol that does achieve asynchronous terminating consensus if we assume a fixed number of live processes and that processes can’t die and become faulty during consensus. This consensus protocol creates a clique of known live nodes, where each node uses information received from only the other clique members to deterministically decide. This does terminate but is significantly more constrained.

I like that in the conclusion they mention “These results do not show that such problems cannot be “solved” in practice; rather, they point up the need for more refined models of distributed computing […]”. It’s quite a poetic takeaway. Something similar arose with the extensions to the CAP theorem, and indeed I can find plenty of recent work still iterating on modelling the problem and defining the impossibility boundary: here, here, and here, just to name a few.

The FLP paper itself is very short and approachable, so I’d recommend giving it a direct read! The definitions aren’t too onerous to wade through and the contradiction structure follows quite intuitively. The paper is quite theoretical and technical, but I need to get better at understanding and eventually producing work like this to do more convincing research.

3. Consistency Canvases

In my first week of my internship at the Sorbonne, I was talking about consistency models with my mentor, who promptly whipped out an annotated, thick paper with crazy-looking diagrams illustrating the different consistency anomalies given certain models. He offhandedly said he could send it to me so I could give it a read later, but just the brief glimpse of the diagrams scared me off enough to never press the topic while I was in Paris. However, I was thinking about consistency again while brainstorming ideas for this calendar, and this certainly feels like a paper I’d put off, so I asked him if he still remembered which paper it was. I was uncertain how to feel when the PDF landed in my inbox, but I wanted to keep an open mind.

The terrifying behemoth in question is Cerone et al’s A Framework for Transactional Consistency Models with Atomic Visibility, which I was happy to learn was only 14 pages (not a behemoth at all!) It formally defines consistency models (read atomic, snapshot isolation, etc.) for transactional systems. Prior work on formalizing models was one-off and scattered, “often tied to database internals”. Cerone et al. present a more unified, general approach that can be used for higher-level reasoning about applications using said transactional systems.

They define six axioms:

  • INT (internal consistency): Within a transaction, sequential semantics are respected, so reads observe the most recently written value, avoiding unrepeatable reads.
  • EXT (external consistency): A transaction observes the maximum value determined by the visibility and arbitration relations, avoiding dirty reads.
  • TRANSVIS (transitive visibility): If a transaction T1 is visible to another, T2, and T2 is itself visible to another, T3, T1 should also be visible to T3.
  • NOCONFLICT (no concurrent updates): Two concurrent transactions may not both modify the same object.
  • PREFIX (prefix consistency): All transactions become visible in the same order defined by arbitration order; if T1 is observed by T2, T2 also observes all of T1’s arbitration relation predecessors.
  • TOTALVIS (total visibility/serializability): The visibility relation is totally ordered.

The six consistency models they define all satisfy some of these axioms. For example, read atomic, the weakest level, only satisfies INT (internal consistency) and EXT (external consistency); snapshot isolation, the strongest non-serializable model, satisfies PREFIX (arbitration order prefix) and NOCONFLICT (no concurrent modifications) as well. In order, the six models they define are read atomic, causal consistency, parallel snapshot isolation, prefix consistency, snapshot isolation, and serializability.

The diagrams I saw were the illustrations of anomalies that can occur if certain axioms don’t hold. These are neatly formatted on page 6, which alongside page 10 lay out the key ideas of the paper. For instance, if TOTALVIS (total order/serializability) is not satisfied, then write skew can occur. This means that modifying two different objects (e.g. withdrawing from two separate accounts) while expecting a shared condition to hold (e.g. total balance over some threshold) can violate the condition unless there’s some total order. These diagrams were much clearer than I’d thought — they use concrete examples like bank accounts or posting comments to make the anomalies more concrete.

Interestingly, their work also outlines some more implementation-driven algorithms that satisfy their framework’s models, “closer to the intuition of practitioners”. They prove how these operational specifications align with their axioms and properties. On a meta-level, they’re structured a bit like the ones in an ICDT submission we made for specifying database backends — straightforward and ready to be directly implemented. They define a set of constraints separate from the model axioms earlier in the paper, and present a similar mapping of consistency model to constraints needed. Each of the original axioms follows from some combination of these constraints: for example, the transitive visibility axiom follows from causal delivery because replicas will only receive a transaction’s logs after receiving its predecessors.

As I’d mentioned yesterday, I want to improve my appetite for theory-heavy work, so reading this paper was a good chance to exercise that skill again. It certainly wasn’t as intimidating as I thought it’d be. The consistency diagrams look complex at first glance but are fairly straightforward even with the formal lens. I’m itching to get back to some more applied research, though, which will be the focus of the next few days.

4. CRDTs Don’t Grow On Trees

On the topic of consistency, I wanted to read about some CRDTs: causal trees and replicated growable arrays. Both are very similar ideas for ensuring that concurrent edits to a doc (e.g. two friends working on a paper while both offline) are resolved in an orderly, meaningful way. Their basic variants operate on a per-character basis, but more advanced variants support editing ranges of text.

The main idea with causal trees is that each user is assigned an entity ID, and an idea similar to Lamport timestamps is used to track time. The start of the text is the root of the tree, and when a user adds a character, an edge is created from the character’s causal predecessor. If another user concurrently edits, the tree forks at the common causal predecessor and continues branching. A user’s timestamp, which is attached to each character added, increments on each character added and is the maximum timestamp of all merging branches. Intuitively, this means timestamps represent how much content a user has seen before they insert something else. I’d highly recommend checking out the visualizations on Farley’s post for an interactive look.

To merge branches, the head events are first sorted by decreasing timestamp (since this means the user saw more recent content as they were making their edit), then by decreasing entity ID (to deterministically create a total order). To recover the final text, we then take a depth-first pre-order traversal of the tree.

Deleting text uses a common CRDT strategy called tombstoning, where the inserted character is not actually removed from the tree, but just marked as a tombstone and skipped on final rendering. This is because if you hard-delete the node in the tree, no matter what design decision you take, you’ll end up with tradeoffs. Surely you can’t just delete the children of the node along with it, so this means you’ll have to reattach them somewhere. Reattaching to the root of the tree is easy and deterministic but doesn’t make semantic sense (characters would jump to the start of the text). Reattaching to the deleted node’s parent sounds like a good idea, but if there are concurrent operations, things might end up out of order across clients if the immediate ancestor hasn’t been received yet. Sort order would also change among the ancestor’s siblings. All this means it’s just easier to preserve the structure of the tree and skip the node later.

The original causal trees paper goes into more detail suggesting a data format based on Unicode to represent the timestamp numbers with special characters like ॐ representing the root of the tree. It gets pretty crazy: there’s a diagram for how each data format converts to others and what methods combine them. Said methods are implemented not in a typical programming language, but via PCRE regexes. The author makes a fair point for this: they assert that higher-level languages like JS would have too much overhead creating individual objects for each character due to GC, hence the need for a text-only format and portable regex manipulation. I enjoy a cheeky regex puzzle from time to time, so it was really neat seeing regexes be pushed here, but this was a bit of an arcane choice. There are also a lot of fiber-arts-related names for terms in the paper, which makes it a bit challenging to initially parse (in my head “weave” and “weft” occupy too similar of a semantic space).

There’s another very similar CRDT called a replicated growable array. Instead of a tree, though, it’s a flat list that’s easy to read back. Each edit is still tagged with a timestamp and entity ID. For concurrent edits, the insertion pointers are offset for each concurrent client. In practice, it’s faster due to simpler traversal and implementation, so RGAs end up being used in production libraries like Automerge and Yjs.

Thinking through why tombstones were necessary over raw deletions was a fun challenge, and I also liked coming to terms with the odd vocabulary of the causal types paper. Reading it first made understanding the RGA paper much easier, so I’d recommend this order. Neither paper has overwhelming amounts of formalism either, which I appreciated. I got surprisingly interested in these couple papers now that I’ve gone through the basic theory of CRDTs a few times, so I’m looking forward to continuing this trend a bit in tomorrow’s writeup.

5. A Fugue State

Yesterday, we discussed causal trees and replicated growable arrays, two text editing CRDTs. Both CRDTs have a problem, though: while the concurrent edits neatly resolve for forward edits (e.g. typing as you normally would), what happens for backwards edits (e.g. typing a character, then moving your cursor to the start of the text, typing another character, and so on)? This is an issue called interleaving: both causal trees and RGAs are fine with forward interleaving but are susceptible to backward interleaving. If we think of typing “bread” backwards with a causal tree, we’d end up with a shallow tree, with leaves for each character. If another, offline user concurrently types “cakes” backwards in the same document, we’ll end up with the same shallow tree situation, but the text will be interleaved since the timestamps are the same. You’ll end up with “bcraekaeds” — not quite ideal. Matthew Weidner and Martin Kleppmann’s Fugue paper tackles this with a new CRDT that limits both forward and backward interleaving.

Causal trees and RGAs prevent forward interleaving because of the tree/list structure preserving “inserted after” information, but there’s no structural relationship that preserves “inserted before”, leading to the backward interleaving. Fugue defines the concepts of “left” and “right” origins respectively: left origins track “inserted after” information and right origins “inserted before”. This comes together in the definition of maximal non-interleaving:

  • if B is the first thing inserted after A (forward insertion), A and B should be consecutive in the final list.
  • if A is the last thing inserted before B (hence B is A’s right origin, backward insertion), A and B should be consecutive in the final list, except where we might need to prioritize another concurrent forward-typing insertion.

Fugue is also represented in a non-binary tree. However, its tree has the unique concepts of “left/right children”: each child of a node must be designated left or right, but there can be multiple left or right children, and left/right children are sorted amongst themselves by ID. On insertion, if the left origin’s node has no right child, then the current insert is made a right child. This handles the forward-typing case. If the left origin’s node already has a right child, the current insert is made a left child of the right origin, handling the backward-typing case. To recover the final text, perform an in-order traversal of the tree.

This tree will more or less stay binary, except in the case of concurrent inserts. In that case, there can be multiple left/right children, but the important thing is that the series of edits based on those left/right children stay together in a branch. Let’s take the “bread” and “cakes” edit example I considered for the causal trees/RGA case. In Fugue, the updates for typing either word backwards would become a branch of nodes, each a left child. Now, merging these two users’ updates would cause there to be two left children branches on the root node, so we’ll get “breadcakes” (or “cakesbread”, depending on the users’ entity IDs) as intended.

Non-interleaving thus holds intuitively, since the concurrent inserts get separated into separate branches. In-order traversal thus makes sure runs of text are consecutive instead of interleaved. Some interleaving is still inevitable, but this happens only in limited edge cases with multiple interacting concurrent updates. In practice, this should rarely be of concern.

The maximal non-interleaving property is so named because it defines a unique order over the operations. This is important, since this means no other non-trivial properties can be proven on top — this interleaving as as strong a statement as you’ll ever make. Kleppmann et al. had previously proposed another interleaving property, which will never hold on any CRDT, and their prior CRDT proposal was proven to not converge in some cases.

As described earlier, Fugue will prevent backward interleaving, but still won’t satisfy maximal non-interleaving in some of these edge cases. The paper thus defines FugueMax, a slightly more complicated variant that sorts multiple right children in reverse order of their right origins instead of by lexicographic node ID. FugueMax is proven to support maximal non-interleaving, but the paper argues the extra complexity isn’t worth it.

In terms of performance, Fugue is fairly comparable to the SOTA Yjs in terms of network bandwidth, load times, throughput, etc. FugueMax fares a little worse but is still reasonable. One thing that struck me was that even Yjs has a lot of metadata overhead, on the order of tens of bytes per character insertion. This put into perspective just how performant modern systems are, if we can afford that overhead and remain perfectly usable.

I really liked the continuation between first looking at causal trees and RGAs, then realizing the edge cases and interleaving problems, then reading Fugue — it was like tracing through each logical step in the frontier of CRDT capability. The way each paper builds intuitively from the last makes the problem space seem very approachable. I think I read somewhere once that (paraphrased) “good papers make you feel like you’re making discoveries” and that’s certainly been true for this series.

6. Braid and Abet

I mentioned how the causal trees paper discussed two days ago made an almost-offputting number of weaving references. Today, we’re picking up the trend again with Braid!

Braid is a project working towards interopable state synchronization, with the goal of making “read[ing] & writ[ing] distributed state as easily as a local variable”. They’re run as an Invisible College community and are actively developing the ecosystem, with an IETF draft, a Braid-based filesystem, Automerge support and example Chrome extension, just to name a few initiatives.

There’s four major aspects of Braid-HTTP:

  • Subscriptions attached to GET requests, for real-time synchronization
  • Versioning via new Version and Parent headers, which extend ETag headers as a better indicator of time (as opposed to a content-based hash)
  • Custom concurrent edit handling via Merge-Type headers
  • More efficient patch updates, ability to patch a range of data

The spec draft lists these advances in a different order, but I’ve listed them in order of what’s most interesting to me. The demo, with very tangible updates, makes the first couple feel very magical, especially knowing all this is done over an HTTP extension rather than custom code.

The Merge-Type is an interesting space for customizability. They have one implementation available with Diamond Types, the fastest CRDTs available, written by one of the Braid spec draft coauthors Joseph Gentle. Besides the performance boosts, from my understanding this is a pretty standard CRDT in theory. They have another implementation called a ‘simpleton’, which is a bit like an OT in that there’s some rebasing of peer edits. The idea is that CRDT-ifying everything is a waste if most data is read instead of written, and most writes are sequential instead of concurrent. Simpleton merging means that all conflicts resolve to the client’s version, and the server does the work of rebasing edits and bringing clients up to speed. Clients ignore all received versions without its current version in the history and send edits as patches to the server. This gets you consistency and simplicity but comes at the cost of clients being half-duplex, unable to read other peers’ updates within a RTT of writing due to the server needing to be in the loop to rebase.

They’ve recently been talking about an unification of OT and CRDTs called the time machine. It’s a “CRDT that implements OT, and is implemented via OT implemented via an internal CRDT”. At first I thought this was vaguely reminiscent of the Eg-walker paper, but then I put two and two together and realized Gentle coauthored both the paper and this spec. In this time machine framework, OTs provide the mechanism for allowing events to “time travel” and be applied out of order to different CRDT versions, and CRDTs let us reason about the different views of clients given distributed time. History is stored in a neutral format, and different peers can use different merge strategies while interoperating since the OT-on-CRDT lets us transform remote histories for local usage.

I’d recommend first reading the Braid landing page and watching a demo video to contextualize what Braid can do, before reading this review of Braid by the causal trees author and the Braid community’s response to go into more detail. The time machine meeting summary can stand alone in case you’re less interested with the rest of the protocol. It’s neat to see that I can recognize names in the space (e.g. Gentle) and have enough of a knowledge base that I can connect across familiar work now. Overall, I’m looking forward to seeing where Braid goes, particularly the time machine CRDT piece!

7. The ElectricSQL Slide

I found out about ElectricSQL when I was in Paris, after looking up a sticker I saw in the lab. Surprise: my supervisor is also a technical advisor there. My first exposure to the company was its team page, where I realized they’d accumulated a good number of the CRDT Mafia2 as advisors and had received angel investments from a couple Ink & Switch-adjacent folks I knew. I figured if they’d managed to convince my supervisor to advise them, they must be working on something interesting.

Electric is a sync engine from Postgres DBs to client apps. It enables live reactivity, like Braid-HTTP, and boasts a wide range of framework integrations, from old favourites like React to CRDT interfaces with Yjs. It syncs subsets of data, which are called “shapes” in their framework, a bit reminiscent of the goals of Gritzko’s RDX. Shapes are defined with a table, with the option to specify a WHERE clause or specific columns — it’s like a database view. The WHERE clauses can tank performance (the docs say perf is inversely proportional to the number of Shapes) unless they include field = constant clauses, which have a special index.

Electric sits in front of Postgres, consuming the logical replication log and massaging it into a so-called Shape Log. The Shape Log is conceptually similar to Postgres’s log, but just filter for the relevant data. Clients will first request the entirety of this Shape Log, which might need to be split up over multiple requests. Electric remaps initial query results to insert operations so the shapes don’t have to be predefined. Once the client has processed the Shape Log and gets an ‘up-to-date’ header response, it switches into Live Mode, repeatedly issuing requests and holding each request open until new data is received or it times out. Clients are responsible for then taking the Shape Log and turning them into materialized data for the app, which can be done via Typescript/Elixir clients or a hand-rolled implementation.

Auth is left to the user to worry about, a nice concession that allows for flexibility. There are two main patterns suggested: proxy-based auth and gatekeeper auth. Proxy-based auth does as advertised: you can add an authorization header to the client’s request, pass it to the proxy, validate it, and pass Electric’s response back if authorized. The gatekeeper auth pattern is a bit more interesting: you’ll need to build a gatekeeper endpoint in your API, which issues Shape-scoped auth tokens. Then there needs to be another authorizing proxy to check the claims of the token against the request parameters, which is less complicated because we’re only checking for an exact match. There are additional implementation details and examples here.

I didn’t poke too hard into the technical details of implementation, but I did scroll down their blog post list and found this neat gem about Rich CRDTs. These are CRDTs that additionally support composition, compensations (side effects), and reservations (escrows fine-grained locking). I couldn’t find additional papers or implementation examples anywhere, but I thought the idea was compelling at least.

This was a good opportunity to dive into one of the live sync tools available — Electric itself lists many alternatives which seem to be clustered around a similar space. I’m happy to see the CRDT/sync/local-first-adjacent community is flourishing with so many up-and-coming products. I hate to bring AI into the mix, but with increasing LLM/“magic” integrations into more and more apps, I expect we’ll want to see more syncing of context (and the Electric blog agrees). The research legwork the community’s already invested in will be such a strong foundation for this.

8. LiveStore, Laugh, Love

One of ElectricSQL’s integrations is with a platform called LiveStore, and one of the technical advising team members at Electric happens to be its founder, Johannes Schickling. Electric also funds (funded?) development of LiveStore and there seems to be a close partnership between the two teams. Before reading up about either, I’d heard of both in passing, but at a distance, both seemed vaguely similar: sync-y reactive database-shaped tools. I was curious to see how they differed.

LiveStore is similarly a sync and state management framework, but is geared more towards local-first apps. It also provides live reactivity on the client-side, but is based on SQLite and an event-sourcing model, separating reads and writes. Read events go through the SQLite layer. Separately, each write event that’s created on a client is committed and pushed to the sync backend, which then accepts the push. It then actively pokes the other clients to pull the new event. LiveStore defines a flexible event model API and automatically manages the persistence and recomputating of state.

The reactivity system is based on three types of state:

  • Reactive SQL queries, which additionally support callbacks and explicit dependencies to manage relationships with other queries and state
  • Reactive signals, state values that aren’t in tables but should be persisted
  • Reactive computed values, which can perform additional processing on signals

LiveStore was based on an earlier research project, Riffle. The project was based at MIT CSAIL, and Geoffrey Litt at Ink & Switch was part of the project. One of the interesting key ideas here was the separation of concerns between state management and syncing (loosely, the separation between LiveStore and the sync provider, like ElectricSQL). As their Riffle blog post reads, “[i]f an app developer could rely on a sufficiently powerful local state management layer, then their UI code could just read and write local data, without worrying about synchronizing data […] or other chores […]”. The local-first aspect allows for local queries and reactive state to update in the time it takes to render a single frame. It’s also neat to see the progression of the work, from the demo music library app in Riffle blog posts to Overtone to the underlying tech being split out into LiveStore. There aren’t many technical details on how the reactive query performance is optimized in either the blog posts or the UIST paper as I can find, which makes sense for the given audience, but I would have liked to dive deeper into that piece.

LiveStore is still under pre-1.0 active development, so I’m excited to see where the project continues growing. One thing that stood out about even LiveStore’s early, incomplete docs was the explicit Design Decisions page. It’s short but collects all the main considerations in one place, so I could get an overview of the project’s motivations without trying to read between the lines elsewhere. Making decisions and learnings explicit makes them easier to take to heart, and it’s also a big part of what makes books like AOSA compelling. This is a strategy I’ll try to adopt for some of my research projects moving forward — I keep a research log, but the decisions get blurred together with all the other WIP notes. Explicitly calling them out somewhere will help context-share as well as make paper writing clearer and more straightforward.

9. Making a Convex-ion

I’m rounding out my miniseries of live, reactive sync systems with ConvexSQL, a framework I keep hearing about on Twitter.

Convex is an ACID compliant, reactive datastore that lets apps atomically live-update as queries change. It doesn’t use SQL and doesn’t require ORMs — like Electric it leans more heavily on the Typescript type system. Compared to the other systems, Convex is more of a platform, running queries as if they were serverless functions within their architecture. It’s more opinionated and abstracts away any database optimization that developers might have to perform.

A Convex deployment includes the database itself alongside a sync worker and a function runner. Queries and updates must be expressed as functions, which are bundled and executed by the V8 runtime in the function runner. The database stores tables in a transaction log format that tracks all versions of documents, a bit like a write-ahead log with CRDTs. Operations are wrapped in transactions, which Convex ensures are run serializably using optimistic concurrency control and tracking read and write sets. The transactions implementation was quite cool to read about, since I recognized many of the ideas from my formally-specified database project during my research internship in Paris.

To sync, the database can use a similar algorithm as what they use to ensure serializability to detect updates in queries. After rerunning dependent queries, results are pushed to the relevant sync worker and transferred via Websocket to the client, which updates its local state.

ElectricSQL supports transactional causal consistency+, which means that there are atomic (and high-availability) transactions that respect causal consistency and that this is implemented with CRDTs. LiveStore doesn’t have transactions and is eventually consistent with a total order on events in its log. In contrast, Convex provides serializability: “We believe that any isolation level less than serializable is just too hard a programming model for developers.” This is both rather funny and rather correct, and I appreciate that it eliminates this whole class of issues in favour of mechanisms like transaction abort-retry that are more intuitive to implement.

I noted that Convex is leaning quite heavily into highlighting how optimized their platform is for AI — they have a whole landing page detailing the best prompts and rules as well as quickstart examples for building LLM integrations. Because they lean more into Typescript schemas, they claim LLMs are better equipped to generate Convex code (which I believe). Their quickstart page contains the typical React and other frameworks, but also has a plain textbox to directly generate an app. They’ve built a whole platform-on-a-platform, Chef, to oneshot entire realtime apps, which is very compelling to demo. Also, one application they show that meshes nicely with reactive frameworks is LLM streaming output — it makes sense that we might want to store the output but also stream a call’s progress in real time. Convex was started around the same time as the other projects (2020, in comparison to ElectricSQL and LiveStore, which both seem to have been started in 2021) but seems to have enjoyed more developer popularity, and I wonder if this recent LLM push had anything to do with adoption.

This writeup was mostly based on the very thorough How Convex Works blog post — I appreciate the transparency and clear walkthrough. The life-of-a-request diagrams were particularly helpful, and I’d have liked to see something similar with the other platforms to get a better sense of the flow in one place.

Reading about three adjacent realtime frameworks in a row let me compare-and-contrast them more easily: learning about each system contextualized it among the others, and I got to “map out” the space. I think three is enough sync frameworks for now, but there are many others I’d still like to read about sometime, like Turso, Jazz, Zero, and InstantDB. Check those out if you’re interested in exploring further, but for now, I plan to turn back to some notable database systems papers in my next few writeups.

10. MapReduce Your Expectations

This marks the start of a series of more mainstream database papers. I have this sense I should have read these by now, if only to build more context for what folks keep referencing in design discussions or online. The first system I chose was MapReduce, which seemed most approachable.

Functional programmers will find MapReduce intuitive to work with. It does exactly what it says on the tin: map and reduce values. The input is a large series of key-value pairs, which are then transformed into intermediate key-value pairs with the user-defined map function, and collected by intermediate key into a series of final values with the user-defined reduce function.

The framework aims to parallelize computation, so it makes use of a special master worker orchestrating many other workers. Input data is partitioned into splits and distributed among map worker nodes, which write the results locally. Intermediate pair tasks are partitioned (via a hash, for example) and then split up amongst the reduce workers. These reduce workers use RPCs to read the buffered data from map workers’ disks, sorts it, then performs the reduce operation. The final result is sent back to the master worker, which is also responsible for waking up the user program to pass the final results.

MapReduce also takes care of fault-tolerance concerns for the user. If a task times out, the master worker will mark it as failed and rerun it. Completed reduce tasks are safe since its results are stored in global storage, but if a machine that ran a map task failed, the master worker will also rerun its map task since the local disk becomes inaccessible. Interestingly, MapReduce supports non-deterministic tasks, simply providing weaker semantics and guarantees. One neat point is that stragglers, or slow/delayed tasks, can impact performance for the entire job, so once a job is close to completion, the master worker also schedules backup executions of the remaining in-progress tasks to race against existing executions. This is an interesting technique especially given nondeterministic functions are allowed, but they reported this noticeably improved performance.

The magic of MapReduce is that mapping and reducing is all users have to worry about. They don’t have to figure out how to optimally place data to reduce latency (the master worker will take locality into consideration when assigning tasks) or think about fault tolerance and durability. Because MapReduce takes care of these reliability and performance concerns in a common abstraction layer, teams don’t have to keep reinventing the wheel, which increases maintainability and productivity. I assume the MapReduce team played the role of a dev infra startup within the company — it’s cool that this type of framework and subteam forms in industry. Working on infra teams like this in the past was really fun and I enjoyed the impact and considerations we’d have to make.

The two meta points that drew my interest most were how the paper demonstrates how widely this simple pattern has been applied, and how they decided to discuss some of the actual library evolution and document real usage examples along with concrete results. For example, in addition to their more contrived experiments, they’d explain what size partitions they’d tend to choose, why, then explain the CPU utilization they observed with the configuration after real usage. I rarely see this with more academic papers: being able to share these patterns is a real strength of industry papers.

This paper was actually assigned as one of my class readings this term — I was already planning to read it, so I was happy to kill two birds with one stone. I’ll note that I based this writeup on the shorter Communications of the ACM version as opposed to the original OSDI paper. This version is a more concise version that glosses over the extensions to the basic model, like allowing users to define custom reduce partitioning logic, and is ironically more highly cited, which perhaps speaks to how easy to understand it was. These details weren’t of the most interest at my level, though, so this version sufficed. One day I’ll revisit the full version of this paper, but in the meantime I think reading shorter versions prepared by the same authors could be a good strategy for reading papers that seem intimidating, like the ones that follow.

11. Fantastic Beasts and How to Lock Them

Zookeeper is one of those system names that sticks with you: what beasts3 could they possibly be herding that warranted such a name? I’d heard about Zookeeper in many a distributed systems reading list, so I figured it was time to give it a read. I was surprised as soon as I opened the PDF: I’d always heard of Zookeeper in the AWS context, so I’d thought Zookeeper was Amazon’s brainchild, but it turns out it originated at Yahoo, who published the original paper at ATC 2010.

In it, Zookeeper is described as a coordination kernel with wait-free, hierarchical objects. It provides primitives that can be used to build locks, group management, and more, offering developers flexibility and modularity in how they’re actually implemented. I found it helpful to visualize it as a filesystem with extra message ordering guarantees. Requests are pipelined and asynchronously processed, and Zookeeper’s API provides notifications to watch for updates to objects.

Zookeeper’s main primitive is the znode concept, which are sets of data nodes stored in a hierarchical order and accessible via JSON PATCH-like paths. These can be freely created and deleted by workers, and can be even made ephemeral, cleaned up when its parent terminates. Znodes use version tags for concurrency control, like CosmosDB etags.

These znodes can be used to build a variety of utilities, as described in the paper:

  • Live config management can be implemented by having a “ready” sentinel object that a leader process can delete before making modifications to config and recreate when it’s ready to be processed. Workers can subscribe to updates on the ready object to re-process the config.
  • Group management can be implemented with ephemeral nodes that are created as child znodes on some well-known znode. These nodes are created and die in lockstep with their parent processes, and group membership can be calculated by listing the children of the znode.
  • Locks can be implemented by creating a designated node with an ephemeral flag and letting all nodes race to create the object. To avoid a thundering-herd effect, nodes can be set up to only watch the previous node in the queue; read locks can be implemented by watching only the previous write lock node.
  • The paper then goes on to explain some real-world use cases that remix the above abstractions. In a few, they mention leader election being implemented via Zookeeper, but don’t mention it in the rest of the paper: I would have really liked to see how it was put together from znodes.

Zookeeper replicates all data in in-memory DBs that each contain the whole tree. It uses a special atomic broadcast protocol, ZAB, for requests that require coordination, like writes. For these, Zookeeper guarantees linearizable state writes, but it serves reads from each local replica. These means that reads can be stale, so they also describe their sync API that forces all writes issued before the read to finish first. Zookeeper also has a WAL and uses idempotent transactions in its request processor, making recovery a straightforward process of replaying events off a fuzzy snapshot.

Again, I’ve realized I’ve been over-intimidated by a paper: it was much easier to parse than I expected. Perhaps this speaks to the quality of the papers and why they’ve been recommended so frequently. I found the structure of explaining the programming model of znodes first, followed by examples, then finally jumping into implementation made for a very clear storyline — this is a recipe I hope I’ll remember to draw on in the future.


  1. After spending a while doing infra work, it was refreshing to go back to my frontend roots and put this together! I learned some new tricks: did you know CSS has counters? Did you know you can set a separate header ID or that you can have slashes in them↩︎

  2. A mentor’s nickname for the larger community of researchers, typically in Europe, working on CRDTs and sync technology. ↩︎

  3. There’s a joke in here somewhere about the Indian boar romping about on the cover of Designing Data-Intensive Applications and zookeepers needing to wrangle them… ↩︎


‹ go back