Truth or DARE.

Published 06 October 2024 at Yours, Kewbish. 5,854 words. Subscribe via RSS.

Introduction

I think distributed systems are neat, but I also don’t think I understand them well enough to warrant my level of interest in them. I’ve never taken a distributed systems course, have only stumbled my way through the Fly.io distributed systems challenges, and have never wrestled with more complicated design constraints while architecting something from the ground up. Whenever folks ask me about what particular subfield I’m interested in, I don’t have a concrete answer. Sure, I’ve read Designing Data-intensive Applications and have covered the concepts behind different consistency models at least three times, but I still tended to butcher their distinctions when explaining them. If distributed systems was a band, I’d definitely be labelled as a fake fan. I’ve wanted to do something about this for a while — I’ve wanted to finally understand distributed systems and be a real distsys engineer. So when a great opportunity cropped up, I shipped myself off to Germany to spend a week among brutalist buildings, looming castles, and most importantly, a cohort of grad students and eminent researchers, to see if I could make some progress.

The Second ACM Europe Summer School on Distributed and Replicated Environments is not only a bit of a mouthful — it’s also a summer program that invites grad students interested in distributed systems together to participate in lectures and labs taught by leading professors in the area. Last year, it took place in Brussels, Belgium, and this year, we were hosted in Darmstadt, Germany, at the Technische Universität Darmstadt. Each day of the program consisted of several chunks of either lectures, which were more formal presentations about some new research, or labs, which were structured like a programming assignment and set us free to tackle some extension problem related to the lecture. A list of the speakers is available here, and the program for this year is here.

I’m currently an undergrad, so I would normally be ineligible. However, Professor Kleppmann (author of aforementioned Designing Data-intensive Applications) was supervising me for my current research internship and also speaking at DARE, so he encouraged me to apply. I’m very grateful that I was invited to apply, and immensely surprised that my prior research experience and quickly-put-together motivation letter sufficed to make up for my lack of a degree.

My previous experience in distributed systems work was primarily through a CPSC448 directed studies that touched on formal verification, though I didn’t do any actual proving. As well, my work at Cloudflare and at Stripe were fairly relevant: while I wasn’t directly working on quote-unquote distributed systems, I was doing infrastructure work that was appropriately flavoured as such. My motivation letter listed these, as well as name-dropping an unrelated paper I’d worked on in the software engineering space. I don’t think the applications were particularly selective, since the HotCRP listed 32/36 submissions accepted, and I’d assume at least a couple of those were test submissions. I was still very nicely surprised when I’d heard back about my acceptance, though, and I’m again very thankful I was granted the opportunity.

I’d taken notes during each lecture and wanted to revisit them, so this is a post summarizing my main takeaways from the talks and from the program overall. Each of the subsections here could be a blog post in itself — this will be the longest post on my blog to date, but it’s also because it’s covering a packed week of dense lecture material. I’ve written up my notes in the same order as the lectures occurred, so you’ll get to relive the learning at DARE as it happened. Professors and other students: I might’ve made some mistakes or simplifications in the below — please forgive me!

Martin Kleppmann — BFT Decentralized Access Control Lists

Professor Kleppmann’s lecture turned out to be very closely aligned to what I’ll be working on this term with him, and it was nice getting the high-level summary lecture at DARE before diving into extension work. His talk revolved around a decentralized access control list protocol, where members can be added and removed from a group chat. This seems simple, but there are tricky edge cases when concurrency gets involved: if two people concurrently remove each other, what happens? What happens if a user who is removed by another user concurrently adds their own guest user? The protocol also needs to be Byzantine-fault tolerant, meaning that the system must continue to provide a consistent access control model despite having nodes that don’t correctly follow the protocol. All in all, not as easy as it seems.

We started by covering some of background — how can you implement a BFT system. One approach is a version vector, where every replica keeps track of its idea of the state of the other replicas. However, they aren’t safe in a Byzantine context, since a malicious replica can send different updates to different nodes with a version vector crafted to confuse the two nodes into thinking they’re in sync when they’ve actually diverged.

You can solve this, though, by using a hash graph, like in Git. Each ‘commit’ or operation contains the hash of the previous one, which implicitly will transitively include the hash of the predecessor’s predecessor, and so on. If the hashes of the most recently received operation are the same, you can assume that two nodes are in sync, relying on the property that cryptographic hash collisions are hard to find. Otherwise, you can keep moving backwards in the graph until the two nodes converge on a common point, then exchange all updates after that point. This is good for a Byzantine context, but is inefficient in the number of roundtrip exchanges, since both nodes will have to continuously communicate as they ‘move backwards’ in the hash graph. It’d be nice if we could figure out the set of updates to share more efficiently.

Generally, you can speed up set membership checks by using Bloom filters to approximate things. A paper he worked on applied that exact idea, which seems like such an obvious idea once you think about it. Both nodes can send a Bloom filter of the operations they currently know about and walks backward to check if operations were in the other side’s Bloom filter, and if not, sends them over. This repeats until the nodes are in sync again — no more redundant updates. Someone once told me that the best research is the intuitively obvious stuff, and I think this paper is no different. Applying a relatively basic CS concept to a new problem space in exactly the way the concept should be used makes the work both easier to understand and to defend. I hope that I’ll be able to hone my ‘finding obvious directions’ skills more to take advantage of this.

Returning to the hash graph, Professor Kleppmann also went a bit into detail about blockchains. They’re overall similar to hash graphs, but instead have a total ordering of blocks, which requires Byzantine-fault tolerant consensus. This is needed in blockchain settings to prevent double-spending of currency, but in this decentralized ACL group application we don’t need there to be only one ordering of operations — we just need some sort of convergence. This taught me about the difference between consensus and collaboration: in consensus, nodes decide on one alternative; in colaboration, we can keep all alternatives and need to figure out how to put them together.

With all this background, we then worked through some design decisions around how to handle the ACL group edge cases mentioned above (spoiler alert: handling them is still active research). Some alternative systems were brought up, like Matrix’s power levels or some sort of seniority-based system. We ruled out user voting, since that’s susceptible to Sybil attacks, where malicious nodes control a coordinated group of users, and social engineering. For our lab, we were given some starter code implementing one such ACL CRDT and were given freedom to explore implementing some of these ideas.

Elisa Gonzalez Boix — CRDT Fundamentals and AmbientTalk

Professor Gonzalez Boix presented their work on AmbientTalk, an actor-based programming language that supports communication across peers without any explicit infrastructure. In particular, it supports volatile connections that might drop any time.

The language is service-based, with events for when services are discovered and when messages are received or sent. You can define custom services in these languages and use the asynchronous messaging and lease primatives to communicate between them. The intro page has more examples, but here’s an overview of what it looks like:

whenever: Service
discovered: {
	|ref| // far reference
	when: message_type@FutureMessage()
	becomes: {
		|ref|
		...rest
	}
}

def service := object: {
    def message() {
	...
	return value
	}
}

export: service as: Service;

The language has a concept of far references, which as far as I can tell originated in E. These far references are like pointers across machine boundaries that work asynchronously. Messages that are sent to far references when connectivity drops are buffered in a queue and will be resent eventually later. Objects are passed between actors as far references, but primitive data is shared via isolates, working in a pass-by-copy fashion.

Most of the talk was about some CRDT fundamentals, which I won’t go into here, as well as AmbientTalk’s unique features, but I also wanted to mention the fun lab we had working in it. We were playing around with an interactive shopping list example app based on a CRDT that was handling its network communication via AmbientTalk’s runtime. It was nice not having to code any of that and seeing things ‘just work’. The idea was that the shopping list app was collaborative, so if you specified the same Service name, you’d be able to dynamically discover peers on the same network and listen to their messages. This ran into a slight challenge, because there were thirty of us in a small room all competing for the same messages, but we worked around this by all declaring services with slightly different names. Some of us managed to get collaboration to work across devices, but something to do with my firewall or the eduroam network we were on wasn’t letting me try that particular feature out. This was still one of my favourite labs of the program, though.

Antonio Fernandez Anta — AMECOS

One of Professor Anta’s students actually presented a poster about his lecture’s research project at our poster session on the first day, so I had seen a bit of the background before the talk, although I’ll admit I still didn’t fully follow. The work presented was called AMECOS: A Modular Event-based framework for Concurrent Object Specification, In a similar vein to Professor Kaki’s talk below, they noted that currently, concurrent objects are generally specified sequentially and assume some way to keep track of the object’s state.

They define events as ‘opex’es, or ‘operation executions’, then define the various consistency models around the properties that these opexes would hold. For example:

Linearizability implies a ‘realtime’ ability to globally read a value written anywhere else immediately after the write finishes, so there can’t be simultaneous opexes and any reads will return the latest write opex’s value.
For sequential consistency, the process order must be respected, but the action doesn’t necessarily have to take effect instantaneously between its invocation and its response.
For causal consistency, we also require process order to be respected, but each process can have its own order of the other read/write opexes as long as it respects causal order between order events. This allows opexes to be executed locally without requiring communication or agreement with other processes.

It’d be difficult to specify these consistency models sequentially, so they specify the system modularly as an execution. An execution is correct if there’s an opex ordering that satisfies the consistency property required. The advantages of this approach are that they don’t assume some omniscient power knowing the total arbitration order, that the object’s state doesn’t need to be tracked, that the object is described only via its interface, and that the object’s specification can then be separated from the consistency definitions it needs.

Gowtham Kaki — Novel Consensus Proof Techniques

Professor Kaki’s talk centred around novel ideas for proving distributed systems, particularly for modelling consensus via convergence and monotonicity. The current standard approach to modelling distributed systems is via asynchronous message passing, but it’s hard to, say, specify a safety invariant for a leader election with message passing. A first attempt might look like checking ‘if A has elected B as its leader and C has elected D as its leader, then B = D’, but induction doesn’t work for this. This invariant also allows some invalid state transitions.

The key takeaway I had from his talk was his point that strengthening invariants is a valid proof strategy that can unlock the final proof. Counterintuitively, you now need to prove more properties, but you can also assume more starting points in the inductive step. The final inductive invariant in the leader election might include all of the following properties:

if a leader has been assigned to a node, a quorum of votes should exist for that leader
if a leader message exists, a quorum of votes should exist for that node
a node can only vote for one node
if a node has voted for some node and a vote message exists, the message and the leader should be the same
if two voting messages came from the same node, they should be the same
if a voting message exists, the node should have voted

We can also avoid using an induction approach for the proof and instead rely on the convergence of leadership — the fact that it doesn’t matter in what order the voting messages are processed so long as they vote for the same leader — and monotonicity of leadership — you can either vote for no one or the same leader, which you can model via a lattice. This monotonicity and convergence gives you consensus.

You can then model consensus with some way of replicating state, and place these two invariants on it. They found that while this replicated state approach required more messages than the async message passing approach, it instead performed better in terms of throughput. They’re still working on this, as they’ve mentioned they haven’t implemented garbage collection or crash recovery in the evaluation system, but I think the gist is that this is a viable alternative to the typical message passing implementations for consensus.

I had a bit of trouble following these two specification/verification talks, mostly because of the formality of the details introduced, but I’ll come back to revisit the material should they come in handy.

Carlos Baquero — State-based CRDT Performance

One of the other lectures (I don’t recall who) cited the 2011 CRDT paper that Marc Shapiro wrote. I’ve read the paper for my past research internship around formal verification, and I know the paper’s a foundational one. I didn’t realize, though, that Professor Baquero was also a co-author on the paper — I only put two and two together when the speaker was listing out the authors and gestured towards him. A very small world.

Professor Baquero’s talk was on CRDTs and an optimization that can be applied to reduce the size of state you need to send back and forth. The general idea is that state-based CRDTs can be inefficient. For example, take an add-wins set: it keeps track of a set of tombstones for removed elements to guard against cases where an element’s ‘remove’ operation is processed before its ‘add’. As well, removed elements are typically also kept in the main set, duplicating the storage needed.

This talk had some background about join semi-lattices and partially-ordered logs. A join semi-lattice is a type of structure that defines a join operation that you can apply to two states to get the joint state: the join is a bit like a set union with extra spice. Intuitively, the ’lattice’ part of the name connects to the fact that when you draw up all the possible joins and states, you end up with a lattice-like shape, with an empty state at the bottom and a final state at the top. This final state is the result of joining all the possible inputs together — there’s some maths calling it the ’least upper bound’, but you can think of it like ’the state that contains all of these other states joined together’. Very frequently, this looks like a union if you squint. These states can be defined by applying this join operation based on the operations in a partially-ordered log, or a polog, for short. If you model state for a grow-only counter in the polog like {A(1), B(2)}, for example, then you could define your join operation to take two states and do a member-wise max. If we had {A(1), B(2)} and {A(3), B(1)} to join, we would then get {A(3), B(2)}, then sum across all members to get the final grow-only counter value of of 5. Each replica keeps a local view of this polog to derive its state.

This state ends up getting pretty big, however, when you consider tombstone sets and more complicated CRDTs. Instead of sending over the whole state, the approach presented therefore relies on deltas and mutations. For example, if you get an update from replica A that it’s been incremented twice, just send that as a delta mutation instead of sending over all the extra state for replicas B, C, and D. You can fragment your state into so-called irredundant join decompositions of state, turning something like {A(1), B(2)} into {A(1)}, {B(2)} and send the deltas based on these smaller bits instead.

One of his students’ PhD thesis builds this out more, explaining how you can decompose a state, hash the members to assign their values to buckets, then calculate what buckets to send over to each replica to limit sending duplicated state. You can also apply bloom filters to efficiently test what needs to be sent over. However, there are some issues with false positives in Bloom filters, so both bucket and bloom filters need to be combined for a robust system. There’s a whole four-round system of sending bloom filters, their differences, and buckets across that you can read more about here.

Aside from the content of the talk, there were two sort of offhand points about communication times that’ve instead really stuck. One was that the whole world can be connected at a latency supporting FPS games and other real-time applications just because the Earth’s diameter is small. It’s neat to think about: if the Earth was larger, there could’ve been whole classes of apps that would’ve never been invented. The other is that communication round-trips to space colonies (e.g. Mars) will dramatically increase from what we’re used to on Earth, and could be up to a 20 minute RTT. It was very thought-provoking to consider space tech as a field where CRDTs will likely become necessary — when you have such long response times and occassional periods where communication isn’t possible, you can’t really lean too heavily on the classic client-server model. I want to learn more about the state of space tech in the future, especially since a fellow Rise Global Winner has just gotten into YC with their satellite startup. I think there’s a lot of interesting potential with critical systems that’ll need certain consistency and consensus guarantees.

Mira Mezini — Algebraic RDTs

Professor Mezini talked about replicated data types, particularly the concept of an ARDT, or an Algebraic Replicated Data Type. Her talk presented how ARDTs can be used for decentralized state management, can be used reactively for propagating changes, and can be used for coordination with various consistency and availability guarantees.

An interesting point she brought up was that Brewer, the inventor of the CAP theorem, now states that the CAP theorem’s application should be relaxed in modern systems to fit with the specific system’s requirements. Sometimes, we can relax consistency a little bit to allow for more availability, and vice versa. It doesn’t have to be an all-or-nothing choice given today’s technologies. I liked this take, particularly because I could see it in the previous internship work I’ve done — we used serializable transactions just for this part for high consistency, but for better performance left everything else in read-uncommited transactions.

ARDTs were presented as a standard library of composable, primitive data-type RDTs that you could really customize but still have working together. They differ from CRDTs because off-the-shelf CRDTs have fixed design decisions baked into them, like assumptions about various network models, and causes a bit of an impedance mismatch when trying to actually program with data. Using ARDTs allows you to decouple the state of data from its dissemination, so that the communication and network becomes irrelevant to the actual application logic. They felt a bit like Jacky’s BFT JSON CRDT library, and to be honest I can’t quite articulate the difference between ARDTs and ‘customizable CRDTs’ better than this.

She presented a subset of ARDTs called reactive ARDTs that would avoid callback hell as well as having to track data dependencies across systems. Right now, it’s difficult to model and manage updates in data being propagated correctly to other places given a system with diverse nodes — I think the final chapter of DDIA covers this a little as well. They designed a language, REScala, that operates with these reactive ARDTs and their dependency effect chains as first-class citizens in the runtime. This way, it’s easier to understand, since events are modelled as happening instantaneously, but also lets the runtime handle making the effect chains strictly serializable and more consistent.

There was also some work on coordination ARDTs mentioned along this vein, to enforce application level invariants that are difficult to otherwise manage via consensus. They achieved this via adding a verifying compiler, LoRe, to REScala. Their coordination works by making use of locking without requiring offline nodes to agree. The system models interactions between systems as having certain pre/post-conditions and actions, so the compiler can check that the overall invariants hold at each step.

Fun note about her slides is that I noticed one of the images she used on a slide about decentralized collaborative applications looked really familiar. Halfway through, it hit me that it was actually the title image from the Ink and Switch Crosscut article, which ironically is explicitly a ‘personal thinking space’ and not a collaborative tool. Say you will about me being able to recognize the Ink and Switch blog post images on sight.

Annette Bieniusa — Formalizing Broadcast, TLA+, and Erla+

Professor Bieniusa’s talk was listed on the program as something to do with reactive datatypes and Elixir, which I was looking forward to finally learning a bit more about, but she actually spoke on specifying different broadcast models in TLA+! TLA+ is a specification language used in formal verification, and it’s been used in the industry to verify and model many critical systems, notably including AWS’s S3. I’ve worked a little with TLA+ in my previous research internship on an extension to the PGo project, a compiler that translated Modular PlusCal (which itself compiles to TLA+) directly into production-ready Go systems.

So imagine my surprise when the project Professor Bieniusa talks about is about a project, Erla+, that’s almost exactly that, just with Go replaced with Erlang! However, their work compiles from a subset of PlusCal directly into TLA+ and Erlang, whereas PGo requires the use of our custom Modular PlusCal extension language first, so they’ve cut out the need to learn new syntax. Also, their compiler produces actor-based systems, which I don’t know much about but seem to map quite naturally to having multiple distributed nodes that one needs to coordinate. It was neat to see how much the two projects naturally mirrored each other.

The bulk of her talk was primarily about broadcast, though. We talked through several variations, including best-effort broadcast, reliable broadcast, and uniform reliable broadcast. Best-effort broadcast is just a broadcast where you try your best to deliver messages with no retries or other guarantees. For a reliable broadcast, you force each node to forward its messages, and for a uniform reliable broadcast, you need to ensure all messages are also delivered, or received. Unfortunately, uniform reliable broadcast can’t exist if the majority of nodes fail, for obvious reasons. We also defined a few properties that we used in these definitions:

FIFO property → if you broadcast m from p then m', then m is delivered before m'
causal property → if you broadcast m from p then m' from q, then m is delivered before m'
total order property → if m is broadcast from p before m', then m is broadcast from q before m'

We then spent some time learning TLA+ syntax and primitives to formally model these properties. I learned that TLA+ uses something called linear-time logic, which gives you a set of executions that are considered correct. You can then define linear-time properties, like safety and liveness properties. A safety property requires that if any execution is incorrect, then there was a prefix of that execution where the remainder of the execution did not fulfill the property — intuitively, it requires that if something went wrong, there was a particular ’turning point’ where things went south. Safety can only be satisfied given infinite time, but can be violated in finite time.

On the other hand, a liveness property requires that for any prefix of an execution, there is a set of following executions for which the property is also satisfied — intuitively, that the execution ‘keeps running’. Liveness can conversely only be violated in infinite time and is satisfied in finite time.

To model these in TLA+, you need a couple operators: []F denotes that F is always true, and <>F that F is eventually true. You can also combine these, so []<>F states that at all times, F is either true or will be true, so intuitively this expresses that some progress will be made towards getting to F eventually. The reverse, <>[]F, expresses that eventually, F will always hold. Intuitively, this denotes stability.

The final concept I’ll cover is how we apply these to express ‘fairness’. It’s a property that states that if something happens ‘often enough’, it should eventually happen. There’s variations: weak fairness can be expressed as <>[]F → []<>F and says that a step towards F must eventually occur. The implication reads that if F is eventually continually true, then it must eventually occur. Strong fairness, on the other hand, can be expressed as []<>F → []<>F, which says that a step must eventually occur even if something is not eventually continually true. For more intuition about the difference between weak and strong fairness, think of a traffic light. If the traffic light is strongly fair, the car will eventually have to go, because it’ll eventually be green before switching back to red. However, if the traffic light is weakly fair, then the car might never go, because the traffic light will eventually switch back to red and never has a point where it will continue to always be green. This was really mind-bending to wrap my head around, and I think the concepts of eventual-ness and the timing logic here is fun to dig into.

Another coincidence: I also learned that Professor Bieniusa will be collaborating with my supervisor for my research internship next term, so we might get to connect again soon!

German Efficiency

In addition to the cold, hard, technical details, I also learned about the finer details of European education systems (the Belgian and German ones, in particular) and about Darmstadt and Germany as a whole. I speak no German, so I had to rely on the locals speaking English. I was a little self-conscious about being the classic clueless North American tourist who romps about Europe and is generally a nuisance. Granted, I don’t think I bothered anyone, but I really felt like I was very uncultured and not well-informed before learning about any of this. I’m starting to get why people recommend exchange programs and travelling so much — you learn so much by osmosis and vibes, even from a quick stay where you’re not interacting much with the locals.

Some quick-fire notes:

The Frankfurt airport seemed very empty, even though I was arriving on a weekend afternoon, when I’d have expected it to be bustling. Maybe I was in a quieter terminal.
On the other hand, the smoking lounges seemed very full. I was mildly shocked to see a smoking lounge right out the gate, especially indoors. It was also odd to see people smoking right outside doors, young people smoking, and other indoor smoking areas. In Vancouver, it tends to be fairly rare and is almost always an older person huddled in an alleyway, not a well-dressed twenty-something strutting by with friends.
Darmstadt is literally translated as “colon/intestines-city”. Something about how if Germany was anthropomorphized into a human, Darmstadt would be smack where the bowels were. Apologies if you were enjoying a nice meal at this point in reading.
Trains seem to be consistently late. We took a train a bit closer to Frankfurt to hike, and our train there was almost ten minutes late, and our train back was closer to fifteen late. I was told that German efficiency only applies to cars.
Germans are pretty intense about their hiking. We went up a ‘small hike’ to Auerbach Castle, which was the better part of an hour up a fairly steep hill. I’d assumed since they didn’t ask about accessibility restrictions that the ‘hike’ meant a flat walk, but no, this was really a hike. At some point we saw people biking down the very steep, narrow path, and at the top I was told that most Germans would not consider our trek anything near a hike.
Everything initially seems more expensive than Canada — for example, ramen might run you 14 euros. I was told that this was relatively cheaper than other parts of Europe, but the converted equivalent of ~$21 CAD seemed a little steep. Something closer to $12-15 CAD is what I’m used to. However, when you factor in the lack of tip and the already-included tax, it’s not far off from Vancouver prices. It was nice that most prices were round numbers too, which helped with sorting out change.
It rains a lot, and people are used to it. One of the highlights of the program was a walking tour around the city on our first day, during which it started thunderstorming and pouring. The lightning and thunder were just a few seconds apart, and we were huddling, trying to recall the conversion for time between lightning and thunder to how close the storm was to figure out screwed we might be. Our guide, completely unfazed, led us around into the main city centre castle and continued peppering us with facts.
Darmstadt has their own 9/11 story, albeit in 1944. The city was heavily bombed by the RAF during WWII, destroying half the town’s homes overnight. We were actually in town for the 80th anniversary memorial event, and all through Wednesday we heard the bells tolling across town.
The cafeteria food at TU Darmstadt is quite good. It felt a bit like Ikea standing in the cafeteria line and grabbing lunch, but there was solid variety. I will warn folks that when they translate something as dumplings, though, they mean American/Western-style dumplings — I was not expecting a dense dough ball.
- The group I was with was very interested in having Asian food for our free meals, seeing as we had a German-cuisine dinner already scheduled. The two places we tried were great. It was very fun teaching them to use chopsticks and see them tank Szechuan peppercorns for the first time.
- I feel obligated to also especially shout out this hole-in-the-wall Tibetan dumpling takeaway place, which I tried on my last night. The staff offered me a free sample of mango lassi and were very sweet in explaining everything in English.

Conclusion

Besides all the distributed systems and multicultural learning, I also met many lovely people (who are probably reading this post — I can still see my blog analytics getting a suspicious number of views from Belgium). It was great making some new friends, since I think I was one of the only people who wasn’t with a contingent from their home university and perhaps the only North American. I did my share of cultural exchange too: other than being the de-facto Asian-culture expert, I also taught folks some Canadian and American slang¹ and explained how university works here.

I’m very happy that I was able to attend and feel very fulfilled in the new fields I’ve gotten a tour of during DARE. I’m proud to say that this was the longest continuous period in my life where I felt like I fully grasped the difference between strict, sequential, and causal consistency, and I can still mostly reason about the details now. Being exposed to different areas of research that I had no real background in was a challenge — once the LaTeX started flowing in a talk, I admit things generally started going over my head — but was also great to be able to build an idea of the different subareas within distributed systems. DARE was a shortcut in getting to the frontier of research in a short week, and I’m excited to be able to continue thinking about some of these problems in my current and upcoming research internships.

The other students who took this program for ECTS credits were required to do another two-week research project following DARE, building on one of these lectures. They’ll be doing a presentation soon in a few weeks, and I’m really looking forward to seeing what they’ve come up with. I didn’t have to do one since the transfer credits aren’t going to meaningfully affect my courseload next year, but in a way my current internship work is one big extension of Professor Kleppmann’s talk and work, so I’ll say it counts.

I’d very much recommend the program to anyone even tangentially interested in distributed systems. I believe the talks will be different year-to-year, and I’ve heard the next iteration is planned in Porto, Portugal. There’s funding available for European students via the Erasmus program and no fee for the program itself otherwise. Stay on the lookout for DARE 2025 — I’d strongly encourage other students to go for it!

I will never forget when we were standing in a circle with someone else from Portugal explaining Skibidi Toilet to the others and chatting about brainrot, when Professor Baquero joined the group and said something along the lines of, “ah yes, ‘brainrot’, that must be what my daughter has” and taking a look at a Skibidi Toilet episode. Oh, how I love the Internet. ↩︎

‹ go back