Original post

17 May 2020

Something I’ve been thinking about recently is how when I’ve worked on any kind of distributed system, including systems as simple as a web app with frontend and backend code, probably upwards of 80% of my time is spent on things I wouldn’t need to do if it weren’t distributed. I came up with the following description of why I think this kind of programming requires so much effort: Everything is fragile narrow laggy asynchronous mismatched untrusted pipes. I think every programmer who’s worked on a networked system has encountered each of these issues, this is just my effort to coherently describe all of them in one place. I hope to prompt you to consider all the different hassles at once and think about how much harder/easier your job would be if you did/didn’t have to deal with these things. I think this is part of why web companies like Twitter seem to have so much lower impressiveness per engineer productivity than other places like game companies or SpaceX, although there’s other pieces to that puzzle. While part of the difficulty of distributed systems is inherent in physics, I think there’s lots of ideas for making each part of the problem easier, many already in common use, and I’ll try to mention lots of them. I hope that we as programmers continually develop more of these techniques and especially general implementations that simplify a problem. Like serialization libraries reducing the need for hand-written parsers/writers, I think there’s a lot of developer time out there to save by implementing generalized solutions where we currently painstakingly reimplement common patterns. I also think all these costs mean you should try really hard to avoid making your system distributed if you don’t have to.

I’ll go over each piece in detail, but briefly, whenever we introduce a network connection we usually have to deal with something that is:

  • Fragile: The network connection or the other end can have hardware failures, these have different implications but both manifest as just a timeout. Everything needs to handle failure.
  • Narrow: Bandwidth is limited so we need to carefully design protocols to only send what they need.
  • Laggy: Network latency is noticeable so we need to carefully minimize round-trips.
  • Asynchronous: Especially with >2 input sources (UIs count) all sorts of races and edge cases can happen and need to be thought about and handled.
  • Mismatched: It’s often not possible to upgrade all systems atomically, so you need to handle different ends speaking different protocol versions.
  • Untrusted: If you don’t want everything to be taken down by one malfunction you need to defend against invalid inputs and being overwhelmed. Sometimes you also need to defend against actual attackers.
  • Pipes: Everything gets packed as bytes so you need to be able to (de)serialize your data.

All of these things can be mostly avoided when programming things that run on one computer, that is unless you end up optimizing performance and realizing your computer is actually a distributed system of cores and some of them come back. Some domains manage to avoid some of these but I’ve experienced subsets of these problems working on web apps, self-driving cars, a text editor, and high-performance systems, they’re everywhere.

This isn’t even all the problems, just things about the network. Tons of effort is also expended on things like how various bottlenecks often entail a complicated hierarchy of caches that need to be kept in sync with the underlying data store.

One way you can avoid all this is to just not write a distributed system. There are plenty of cases you can do this and I think it’s worthwhile to try way harder than some people do to pack everything into one process. However past a certain point of reliability or scale, physics means you’re going to have to use multiple machines (unless you want to go the mainframe route).

Fragile

As you connect machines or increase reliability goals, the strategy of just crashing everything when one piece crashes (what multi-threaded/multi-core systems do) becomes increasingly unviable. Hardware will fail, wireless connections drop, entire data centers have their power or network taken out by squirrels. Some domains like customers with flaky internet also inevitably entail frequent connection failure.

In practice you need to write code to handle the failure cases and think carefully about what they are and what to do. This gets worse when merely noting the failure would drop important data, and you need to implement redundancy of data storage or transmission. Even worse, both another machine failing and a network connection breaking become visible just as some expected network packet not arriving after “too long”, introducing not only a delay but an ambiguity that can result in split-brain issues. Often something like TCP implements it for you but sometimes you have to implement your own heartbeating to periodically check that another system is still alive.

Attempts to make this easier include exceptions, TCP, concensus protocols and off-the-shelf redundant databases, but no solution eliminates the problem everywhere. One of my favourite attempts is Erlang’s process linking, monitoring and supervising which offers a philosophy that attempts to coalesce all sorts of failures into one easier to handle general case.

Narrow

Network bandwidth is often limited, especially over consumer or cellular internet. It may seem like this isn’t a limitation very often because you rarely hit bandwidth limits, but that’s because limited bandwidth is ingrained into everything you do. Whenever you design a distributed system you need to come up with a communication protocol that communicates on the order of what’s necessary rather than on the order of the total size of your data.

In a multi-threaded program, you might just pass a pointer to gigabytes of immutable or locked data for a thread to read what it wants from and not think anything of it. In a distributed system passing the entire memory representing your database is unthinkable and you need to spend time implementing other approaches.

Although actually multi-core systems are a certain kind of distributed system and they employ protocols behind the scenes to transfer only the data that’s necessary, but involve many more broadcasts and round trips than would be viable with most networks. I actually think trying to apply techniques used to make multi-core machines seamless to distributed systems is a good way to think of neat solutions that might be much more general than you’d otherwise design. Similarly once you really start optimizing systems hard you notice that bandwidth inside your computer becomes a constraint too.

Dealing with low bandwidth usually involves a message type for each query or modification to a shared data structure, and deciding when to ship over more data so local interactions are faster, or less data to avoid terrible bandwidth cases. It often goes further to various types of replicated state machine where each peer updates a model based on a replicated stream of changes, because sending the new model after every update would be too much bandwidth. Examples of this include RTS games to exchange feeds. However maintaining determinism and consistency in how each peer updates its state to avoid desyncs can be tricky, especially if different peers have different languages or software versions. You also often end up implementing a separate protocol for streaming a full snapshot, because replaying events from the beginning of time when connecting isn’t viable.

Attempts to make this easier include RPC libraries just making it easier to send lots of different message types for different queries and updates rather than shipping data structures, caching libraries, and compression. Cool but less commonly used systems include things like Replicant that ensure synchronized state machine code and update streams on many devices to make replicated state machines easier and less fraught.

Laggy

One network round trip can’t be a problematic latency or you need better networking hardware or a different problem to solve. The difficulties come from avoiding implementing your solution in a way that needs too many network round trips. This can lead to needing to implement special combo-messages that do a sequence of operations on the server instead of just providing smaller primitive messages.

The web, with its especially large latencies, has had lots of problems of this type such as only having the font/image URLs after loading the HTML, or REST APIs that require multiple chained calls to get the IDs needed for the next. Lots of things have been built for these problems like resource inlining, HTTP/2 server push and GraphQL.

A cool somewhat general solution is Cap’n Proto promise pipelining and other systems that involve essentially shipping a chain of steps to perform to the other end (like SQL). These systems essentialy send a limited type of program to perform on the server. Unfortunately you often run into the limitations of the language used, like you can’t add 1 to your Cap’n Proto result before passing it to a new call without a round trip. But if you make your language too powerful you can run into problems with the code you’re shipping overloading the server or being too big. Just adding a multi-step message for your use case is pretty easy if you control both ends, but can be harder if the other end is a company’s API for third parties, or even just owned by a different team at a big company, and those are the cases where they tend not to want to run your programs on their server. I think there’s lots more avenue for exploration here in terms of new approaches to sending segments of code while re-using sent code to save bandwidth and limiting the potential for it to do damage.

Another solution that can work in a data center is to use better networking. You can get network cards with 2us latencies and 100Gbps bandwidths or better, but basically only HPC, simulations and finance use them. However these just reduce the constant factor and don’t save you if your approach takes O(n) round trips.

Asynchronous

As soon as you have 2+ sources of events that aren’t synchronized then you start worrying about race conditions. This can be multiple servers, or just a web app with both user input and a channel to the server. There’s always uncommon orderings like the user clicking the “Submit” button a second time before the next page loads. Sometimes you get lucky and the design of your system means that’s fine, other times it’s not and you either fix it to handle that case or get bug reports from customers who were billed twice. The more asynchrony the more cases you have to either think about or solve with an elegant design which precludes bad states.

Depending on your language/framework, asynchrony can also entail a change to the way you normally write code that makes everything bloated and uglier. Lots of systems used to and still do require you to use callbacks everywhere, sometimes without even providing you closures, making your code an enormous mess. Many languages have gotten better at this with features like async/await or coroutines with small stack like Go, or just using threads and blocking I/O. Unfortunately some of these solutions introduce function color problems where introducing asynchrony requires making changes throughout your codebase.

Asynchrony edge cases are a reasonably fundamental problem, but there’s lots of available patterns for solving different kinds of asynchrony. Examples include concurrency primitives like locks and barriers, protocol design ideas like idempotency, and fancier things like CRDTs.

Mismatched

Usually it’s not possible to upgrade every component of a distributed system atomically when you want to change a protocol. This runs from communicating server clusters that must run 24/7 to users who have an old version of your web page loaded in a tab. This means for some time you’ll have systems that want to talk a newer protocol version communicating with systems that only know an older protocol. This is just a problem you need to solve and there’s two broad classes of common solutions with many subtypes:

  • Have the new software version be able to speak both the old and new protocol version and negotiate to use the new version with upgraded peers, either by maintaining both implementations or mapping the old handlers onto the new ones.
  • Use data structures that provide some degree of compatibility for free, then only upgrade your protocol in those ways. For example unrecognized fields in JSON objects are usually ignored so can be used for new functionality when recognized. Migrations can usually add new columns to a database table without it breaking queries. Then you usually go to great lengths to shoehorn every change into being this type of compatible.

The problem with both these cases is the first steps usually accumulate technical debt in the form of code paths to handle cases (for example of missing fields) that will never come up once all peers are upgraded past the protocol change. This usually entails multi-stage rollouts, for example introduce a new field as optional, roll out the new version everywhere, change the field to be mandatory now that all clients send it, do another rollout. I’ve definitely spent a lot of time planning multi-stage rollouts when I’ve wanted to change protocols used by multiple systems without leaving a mess.

There’s lots of things that help with both of these approaches, both serialization systems that provide lots of compatible upgrade paths like Protobufs, to various patterns for deserializing/upgrading old type versions.

Untrusted

Not only can your data fail to arrive but your system can recieve data that might actively harm it. Systems have bugs which cause invalid messages to be sent, so inputs need to be carefully validated and errors returned, not only at the serialization level but the business logic level. Bugs or new loads can cause systems to send messages faster than they can be handled, necessitating backpressure and limits. You may even have to defend against attackers who actively try and subvert your system by sending messages that would never be sent by your usual counterparties and intelligently seek out edge cases.

Here too we have lots of patterns including rate limits, field validation logic and channels with built in backpressure. On the security side we also have a field of things like encryption, certificates and fuzzing. We’ve also gotten better at being general here as we’ve reduced prevalence of manual patterns like ensuring we always escape interpolated strings in SQL and HTML, with more general patterns like ? query parameters and templating systems which always apply escaping.

Pipes

Last and mostly least, everything has to be a stream of bytes or packets of bytes. This means you need to take your nice data structures that your language makes easy to manipulate and pack them into a different form from their in-memory representation in order to send on the wire. Luckily except in very few places easy serialization/RPC libraries have made this pretty easy, if occasionally somewhat slow. You can also sometimes use methods that allow you to pick out exactly the parts you want from the byte buffers without transforming it to a different representation, perhaps by casting your buffer pointer to a C structure pointer (when that’s even close to safe-ish), or using something like Cap’n Proto that can generate accessors.

This is probably the one I’ve spent the least time fighting, but one case I can remember was when I wanted to send a large data structure, but the available serialization system could only serialize it all at once rather than streaming it packet by packet as the socket could accept it, and I didn’t want to block my server for a long time doing the entire thing, creating tail latency. I ended up choosing a different design, but I could also have written custom code to break my data structure up into chunks and send it a little bit at a time.

Conclusion

I suspect many responses to this post will be of the form “Actually {some/all of these problems} are trivial if you just {do some thing that isn’t universally applicable, is time consuming or has its own issues, possibly something I mentioned, if so probably using Erlang} and the real problem is that other people are bad at programming unlike people in the good old days”. There are lots of things that help, and there is a skill component in knowing about good solutions, choosing the right ones, and implementing them effectively. However these are still hard problems and people have to make difficult real tradeoffs because we haven’t solved them effectively enough. Maybe you would have taken a different side of the tradeoff but people make these technology decisions for real reasons and we should strive to reduce the costs, as well as improving decisions over which costs we accept.

I just can’t use Erlang for most projects I do because they require either extremely low latency, integration with some part of a non-Erlang ecosystem, or they’re too computationally intensive (yes I know about NIFs). This means there’s ample opportunity for productivity improvements just by bringing solutions from one domain and implementing them in another domain or making them faster! I love seeing efforts to bring Erlang’s benefits to more areas. And even Erlang doesn’t solve all of these problems to the extent I believe it’s possible to one day address them.

I think one of the real biggest hammers you can take to these problems is just to try really hard to avoid writing a distributed system in the first place. One of my goals for this post is to inspire people to try to develop more general solutions instead of having to repeatedly implement specific patterns, but my other goal is to try and put all the costs in your face at once and say are you sure adding that separate networked system will really make your job easier? Sometimes a distributed system is unavoidable, such as if you want extreme availability or computing power, but other times it’s totally avoidable. To pick specific examples:

  • I think people should be more willing to try and write performance-sensitive code as a (potentially multi-threaded) process on one machine in a fast language if it’ll fit rather than try and distribute a slower implementation over multiple machines. I acknowledge that this takes time and effort to learn how to do and optimize, but it’ll pay off in a simpler system.
    • In particular I think people should be more aggressive about trying to use multi-threading on a really big computer when possible. I personally find multi-threaded programming in Rust way easier than parallelizing with multiple processes when it’s viable. Some problems like asynchrony are similar but others like serialization, latency and bandwidth largely go away except at performance levels way higher than you’d probably get out of a hypothetical distributed version.
  • I think people should be more willing to use C FFI to bind to libraries in other languages rather than putting them in a separate networked service (example picking on users of my own library, although I don’t actually know what their constraints were). Yes you have to learn how to do C FFI and deal with unsafety, but I’d take that trade to avoid the network service.
  • There are reasons people choose to split things into separate services other than availability and parallelism. For example ability to deploy updates quickly without coordinating with another team, fast CI, using a different language, isolation.
    • We should build more alternatives that don’t involve separate systems, like tools for using auto-updating hot-reloaded dynamically linked libraries with sandboxing instead of microservices (eliminating “narrow”, “laggy” and “asynchronous”). I’m pretty sure at least one instance of hot-reloading dylib updates pushed over the network exists (I’d appreciate links!) but we’re far from availability of excellent implementations in many languages and in the mean time it isn’t a viable alternative for most people considering adding a microservice to build this themselves.
    • Better tools for continuous integration, continuous deployment, isolation, and monorepos can reduce the incentive to split off services to reduce iteration cycle time.

I follow Jonathan Blow’s Twitter and streams and end up with mixed feelings. On the one hand I resonate with his feeling that modern software is way more complex than it needs to be and like the aesthetic and focus on performance and compile time power embodied in his language. On the other hand when he rants about how modern programmers just don’t know how to do things the Good Old Ways™ and need to stop making terrible design choices to be productive, I can’t help but think back to how I as one person have both been what he considers terribly unproductive working on web systems, and productive and effective when writing fast systems in his preferred style. It’s not that I just made terrible decisions sometimes but not other times, or was unaware of systems programming or data oriented design, it’s that I was facing different tradeoffs that forced me to make a distributed system and face a bunch of unproductive challenges that aren’t fully solved. The distributed systems I work on nowadays are low level, very fast, minimize layers of complexity, and my coworkers are extremely skilled. If anything, I’m less productive per similar-sounding feature when I work on these distributed systems than I was when I was programming in Ruby on Rails, because there’s less available tooling than for Rails. Most of my effort still goes into addressing the same distributed systems problems, which you just have to deal with less when programming a game. I agree with him that it’s totally possible for things to be better and dramatically less complex, but people decide to use established technologies because they don’t have the luxury of taking the time to write their ideal platform from scratch first. That’s why I’m so excited when people like him work to develop new tools like his language. I think even if everybody suddenly knew all his favorite game developer skills, more people would have the ability to build new types of tools, but until those tools were built, creating distributed systems would remain hard and unproductive. Also to make sure I tick off Blow fans and haters alike I should say that I recommend watching some of his streams, I think he’s really interesting, skilled and worth listening to, despite his abrasiveness and strong opinions. I find “what about this design would Jonathan Blow yell about being terrible” a good lens to help me come up with interesting alternatives.

Anyhow, I hope this leads you to think about the ways that your work could be more productive if you had better tools to deal with distributed systems, and what those might be. Alternatively I hope it prompts you to seriously consider the costs of writing distributed systems and what you can do to bend all the tradeoffs in your area of the programming world more towards non-distributed systems. Also try to think about what reasons people might not appear to you to be as good at developing software as you are with your Chosen Technology™ and how you can understand the constraints and tradeoffs they are dealing with and what solutions might shift the balance.