Distributed Systems Programming Has Stalled

Shadaj Laddad · February 26, 2025

Over the last decade, we’ve seen great advancements in distributed systems, but the way we program them has seen few fundamental improvements. While we can sometimes abstract away distribution (Spark, Redis, etc.), developers still struggle with challenges like concurrency, fault tolerance, and versioning.

There are lots of people (and startups) working on this. But nearly all focus on tooling to help analyze distributed systems written in classic (sequential) programming languages. Tools like Jepsen and Antithesis have advanced the state-of-the-art for verifying correctness and fault tolerance, but tooling is no match for programming models that natively surface fundamental concepts. We’ve already seen this with Rust, which provides memory safety guarantees that are far richer than C++ with AddressSanitizer.

If you look online, there are tons of frameworks for writing distributed code. In this blog post, I’ll make the case that they only offer band-aids and sugar over three fixed underlying paradigms: external-distribution, static-location, and arbitrary-location. We’re still missing a programming model that is native to distributed systems. We’ll walk through these paradigms then reflect on what’s missing for a truly distributed programming model.

External-distribution architectures are what the vast majority of “distributed” systems look like. In this model, software is written as sequential logic that runs against a state management system with sequential semantics:

Stateless Services with a Distributed Database (Aurora DSQL, Cockroach)
Services using gossiped CRDT state (Ditto, ElectricSQL, Redis Enterprise)¹11. This may come as a surprise. CRDTs are often marketed as a silver bullet for all distributed systems, but another perspective is they simply accelerate distributed transactions. Software running over CRDTs is still sequential. ↩This may come as a surprise. CRDTs are often marketed as a silver bullet for all distributed systems, but another perspective is they simply accelerate distributed transactions. Software running over CRDTs is still sequential. ↩
Workflows and Step Functions

These architectures are easy to write software in, because none of the underlying distribution is exposed²22. Well that’s the idea, at least. Serializability typically isn’t the default (snapshot isolation is), so concurrency bugs can sometimes be exposed. ↩Well that’s the idea, at least. Serializability typically isn’t the default (snapshot isolation is), so concurrency bugs can sometimes be exposed. ↩ to the developer! Although this architecture results in a distributed system, we do not have a distributed programming model.

There is little need to reason about fault-tolerance or concurrency bugs (other than making sure to opt into the right consistency levels for CRDTs). So it’s clear why developers opt for this option, since it hides the distributed chaos under a clean, sequential semantics. But this comes at a clear cost: performance and scalability.

Serializing everything is tantamount to emulating a non-distributed system, but with expensive coordination protocols. The database forms a single point of failure in your system; you either hope that us-east-1 doesn’t go down or switch to a multi-writer system like Cockroach that comes with its own performance implications. Many applications are at sufficiently low scale to tolerate this, but you wouldn’t want to implement a counter like this.

Static-location architectures are the classic way to write distributed code. You compose several units—each written as local (single-machine) code that communicates with other machines using asynchronous network calls:

Services communicating with API calls, possibly using async / await (gRPC, REST)
Actors (Akka, Ray, Orleans)
Services polling and pushing to a shared pub/sub (Kafka)

These architectures give us full, low-level control. We’re writing a bunch of sequential, single-machine software with network calls. This is great for performance and fault-tolerance because we control what gets run where and when.

But the boundaries between networked units are rigid and opaque. Developers must make one-way decisions on how to break up their application. These decisions have a wide impact on correctness; retries and message ordering are controlled by the sender and unknown to the recipient. Furthermore, the language and tooling have limited insight into how units are composed. Jump-to-definition is often unavailable, and serialization mismatches across services can easily creep in.

Most importantly, this approach to distributed systems fundamentally eliminates semantic co-location and modularity. In sequential code, things that happen one after the other are textually placed one after the other and function calls encapsulate entire algorithms. But with static-location architectures, developers are coerced to modularize code on machine boundaries, rather than on semantic boundaries. In these architectures there is simply no way to encapsulate a distributed algorithm as a single, unified semantic unit.

Although static-location architectures offer developers the most low-level control over their system, in practice they are difficult to implement robustly without distributed systems expertise. There is a fundamental mismatch between implementation and execution: static-location software is written as single-machine code, but the correctness of the system requires reasoning about the fleet of machines as a whole. Teams building such systems often live in fear of concurrency bugs and failures, leading to mountains of legacy code that are too critical to touch.

Arbitrary-location architectures are the foundation of most “modern” approaches to distributed systems. These architectures simplify distributed systems by letting us write code as if it were running on a single machine, but at runtime the software is dynamically executed across several machines³33. Actor frameworks don’t really count even if they support migration, since the developer still has to explicitly define the boundaries of an actor and specify where message passing happens ↩Actor frameworks don’t really count even if they support migration, since the developer still has to explicitly define the boundaries of an actor and specify where message passing happens ↩:

Distributed SQL Engines
MapReduce Frameworks (Hadoop, Spark)
Stream Processing (Flink, Spark Streaming, Storm)
Durable Execution (Temporal, DBOS, Azure Durable Functions)

These architectures elegantly handle the co-location problem since there are no explicit network boundaries in the language/API to split your code across. But this simplicity comes at a significant cost: control. By letting the runtime decide how the code is distributed, we lose the ability to make decisions about how the application is scaled, where the fault domains lie, and when data is sent over the network.

Just like the external-distribution model, arbitrary-location architectures often come with a performance cost. Durable execution systems typically snapshot their state to a persistent store between every step⁴44. With some optimizations when a step is a pure, deterministic function ↩With some optimizations when a step is a pure, deterministic function ↩. Stream processing systems may dynamically persist data and are free to introduce asynchrony across steps. SQL users are at the mercy of the query optimizer, to which they at best can only give “hints” on distribution decisions.

We often need low-level control over where individual logic is placed for performance and correctness. Consider implementing Two-Phase Commit. This protocol has explicit, asymmetric roles for a leader that broadcasts proposals and workers that acknowledge them. To correctly implement such a protocol, we need to explicitly assign specific logic to these roles, since quorums must be determined on a single leader and each worker must atomically decide to accept or reject a proposal. It’s simply not possible to implement such a protocol in an arbitrary-location architecture without introducing unnecessary networking and coordination overhead.

Obligatory LLM Section

If you’ve been following the “agentic” LLM space, you might be wondering: “Are any of these issues relevant in a world where my software is being written by an LLM?” If the static-location model is sufficiently rich to express all distributed systems, who cares if it’s painful to program in!

I’d argue that LLMs actually are a great argument why we need a new programming model. These models famously struggle under scenarios where contextually-relevant information is scattered across large bodies of text⁵55. See the Needle in a Haystack Test; reasoning about distributed systems is even harder. ↩See the Needle in a Haystack Test; reasoning about distributed systems is even harder. ↩. LLMs do best when semantically-relevant information is co-located.

The static-location model forces us to split up our semantically-connected distributed logic across several modules. LLMs aren’t great yet at correctness on a single machine; it is well beyond their abilities to compose several single-machine programs that work together correctly. Furthermore, LLMs make decisions sequentially; splitting up distributed logic across several networked modules is inherently challenging to the very structure of AI models.

LLMs would do far better with a programming model that retains “semantic locality”. In a hypothetical programming model where code that spans several machines can be co-located, this problem becomes trivial. All the relevant logic for a distributed algorithm would be right next to each other, and the LLM can generate distributed logic in a straight-line manner.

The other piece of the puzzle is correctness. LLMs make mistakes, and our best bet is to combine them with tools that can automatically find them⁶66. Lean is a great example of this in action. Teams including Google and Deepseek have been using it for some time. ↩Lean is a great example of this in action. Teams including Google and Deepseek have been using it for some time. ↩. Sequential models have no way to reason about the ways distributed execution might cause trouble. But a sufficiently rich distributed programming model could surface issues arising from network delays and faults (think a borrow-checker, but for distributed systems).

What can we take from these systems?

Although the programming models we’ve discussed each have several limitations, they also demonstrate desirable features that a native programming model for distributed systems should support. What can we learn from each model?

I’m going to skip over external-distribution, which as we discussed is not quite distributed. For applications that can tolerate the performance and semantic restrictions of this model, this is the way to go. But for a general distributed programming model, we can’t keep networking and concurrency hidden from the developer.

The static-location model seems like the right place to start, since it is at least capable of expressing all the types of distributed systems we might want to implement, even if the programming model offers us little help in reasoning about the distribution. We were missing two things that the arbitrary-location model offered:

Writing logic that spans several machines right next to each other, in a single function
Surfacing semantic information on distributed behavior such as message reordering, retries, and serialization formats across network boundaries

Each of these points have a dual, something we don’t want to give up:

Explicit control over placement of logic on machines, with the ability to perform local, atomic computations
Rich options for fault tolerance guarantees and network semantics, without the language locking us into global coordination and recovery protocols

It’s time for a native programming model—a Rust-for-distributed systems, if you will—that addresses all of these.

Thanks to Tyler Hou, Joe Hellerstein, and Ramnivas Laddad for feedback on this post!

This may come as a surprise. CRDTs are often marketed as a silver bullet for all distributed systems, but another perspective is they simply accelerate distributed transactions. Software running over CRDTs is still sequential. ↩
Well that’s the idea, at least. Serializability typically isn’t the default (snapshot isolation is), so concurrency bugs can sometimes be exposed. ↩
Actor frameworks don’t really count even if they support migration, since the developer still has to explicitly define the boundaries of an actor and specify where message passing happens ↩
With some optimizations when a step is a pure, deterministic function ↩
See the Needle in a Haystack Test; reasoning about distributed systems is even harder. ↩
Lean is a great example of this in action. Teams including Google and Deepseek have been using it for some time. ↩

Distributed Systems Programming Has Stalled

Shadaj Laddad · February 26, 2025

Obligatory LLM Section

What can we take from these systems?

Footnotes