Breaking the Rules With Stateful Services

Nicholas Tietz-Sokolsky on 2022-02-15

Most web applications have far more reads than writes, and the techniques for optimizing a read-heavy web application are pretty well known. Where things get really fun and interesting is when you step outside of the common case and enter an unusual one: when you need low latency, you have frequent writes which are not cachable, and you read your writes frequently.

We’re going to go through a case study of how we designed a low-latency write-heavy service to resolve our biggest scaling bottlenecks, and then we’ll wrap things up with what we can take away from this project. Let’s go!

Remesh has unusual scaling characteristics, because our service isn’t your average web application.

At its core, the main Remesh experience involves a rapid-fire experience where the moderator asks a question (”What’s the best programming language?”) and participants put in their answers (”Rust,” obviously — fight me) and then do a series of comparison prompts between different answers. While participants are putting in their answers, we’re training a machine learning model in the background so that it’s available to the moderator shortly after the question is done, and they can get insights almost immediately (even while the question is running, from early iterations and partial data).

A diagram showing the flow from frontend submitting to backend, and backend hitting an ML service to train a model.

This interactive experience is part of what makes scaling Remesh a challenge. Each time a participant submits the answer to a prompt (text, or a comparison between two pieces of text), that input is used to determine what your next prompt will be. Each time a participant submits a piece of text, you also want it to be considered for all other paricipants’ prompts, so data can be gathered on it immediately. That means that under a naive implementation, each time a participant submits a prompt, you have one write (save the data) and at least one read (retrieve the data needed to generate the next prompt).

Sequence diagram of a participant submitting an exercise, then reading their next prompt. Reads and writes are in an even ratio.

We were following a traditional web application structure here. We were using an HTTP request-response API backed by stateless services. This architecture often lets you get a lot of scale pretty easily by throwing a cache in front of it and calling it a day. That assumes that reads are much higher volume than writes, and it also assumes that reads are cacheable.

Neither of those is true for us. The reads-to-writes ratio is closer to 1:1 than to 10:1 (a more typical ratio for a web application), and the reads are very unfriendly to the cache, because every single write busts the cache. We did try updating the cache, but concurrency issues make this a challenge to get correct, and the Redis operations we were using were very CPU intensive — a problem exacerbated by Redis being single-threaded. Needless to say, we outgrew some of the original design decisions of our application and we were just running headlong into the brick wall of two tough bottlenecks.

Ultimately, we found that we had a hard upper limit on our scaling, and it came down to two bottlenecks, one with our Redis usage and one with our PostgreSQL usage.

Our ML service was using Redis for two purposes: to cache models, and for the prompt generation / submission handling system. Caching models was fine and continues to be a great use case for Redis in our stack. Handling submissions from participants is where things got dicey for us. We hit CPU bottlenecks because we were using set operations and myriad other CPU-intensive operations in pseudo-transactions. I don’t remember all the details from this system anymore, but whenever we ran a load test, we’d hit 80–100% CPU on Redis.

In the backend monolith, we also were maxing out write IOPS with PostgreSQL in our Google Cloud SQL instance and hitting very high CPU loads there, as well. This ultimately came down to the fact that we were writing every single submission into the database in its own transaction, which was causing high write loads and maxing out what the underlying disk was capable of. We also had lots of instances all handling their own connections to the database, and this connection handling was consuming a lot of resources. Our database usage was a classic case of underoptimization and using patterns which worked fine for a prototype and a launch but did not work when we hit higher scale.

To address these bottlenecks, the traditional approach would be to shard our redis instance to double its capacity, scale up our PostgreSQL instance, and implement batching of our writes to that instance. We may be able to batch on the ML side, as well, to squeeze more capacity out of the system. We considered this approach, and we found it lacking. This would significantly increase complexity of the most complex parts of our system, resulting in far higher chance of critical breaking bugs. It would also be a big project to just kick the can down the road — I’d rather solve this more completely and not have to revisit it for years.

We took a step back and looked at things holistically, and realized that we could question some assumptions and take a more dramatic restructuring to ensure we won’t have to revisit this problem in a major way for years to come.

The solution ended up being pretty elegant. For each participant and each request, you can split the data into two pieces: data submitted by the participant themselves, and data submitted by other participants. These two parts have different consistency requirements. We need participants to see their own up-to-date data so they don’t get the same prompts twice (for example). But the data about other participants simply needs to be reasonably up to date, so you can get a good sampling across the entire space of submissions.

This leads to a pretty nice solution: If you keep a connection open all the time, then you can keep the participant’s submissions in memory for that connection to use them. In the background, you can periodically refresh a local cache of all the text submissions for communal use across all the connections, as well.

But wait, you say. We need stateless services! After all, The Twelve-Factor App says that applications should be stateless! And that this is not just a general rule, but that statefulness “should never be used or relied upon”. Well, we’ve broken the rules before by rewriting instead of refactoring.

Sometimes you have to break the rules to get big wins. In our case, we realized that we did not have a traditional application, and we were much more real-time driven. This meant we were not playing the same game that traditional web apps are, which is why playing by those rules boxed us into poor performance, and why breaking those rules was critical for unlocking the next level of scale. This can be uncomfortable, and it should be: Best practices are usually best practices for a reason. After thorough analysis, we ultimately concluded that we weren’t missing something and we really were in a situation which merits going outside the norm.

So, we went down that path. We shifted the responsibility for accepting submissions away from the ML system and into a new submissions service. We also shifted recording these submissions entirely out of the monolithic backend and into this new submissions service. These no longer get recorded into Redis or Postgres! (Periodically, both the monolith and the ML service poll the submissions service to retrieve batches of the data they care about — a fraction of the total data — and save it in their own store.)

This new service serves WebSockets to clients. When a participant starts responding to a message, they open a WebSocket connection to the server, which then holds their exercises in the connection handler. These get written out in the background to BigTable so that if the connection dies and the client reconnects to a different instance, that new instance can read their previous writes to fill up the initial local cache and maintain consistency. This connection remains open for the duration that they will be responding, giving us a persistent, sticky connection. (We explored sticky sessions instead, with a traditional HTTP API, but that brings additional complexity with load shedding, not to mention that our Kubernetes ingress doesn’t support it.)

Simplified sequence diagram representing submissions and requests to the new service, replacing the old flow.

And with this new service and shifting responsibilities, the vast majority of the heavy operations stopped. You now have a local cache of all the participant’s submissions, which means you avoid a network round trip to retrieve them and you can update that immediately on submission. And you also have a local cache of all the texts which you’ll need to compare between, which also saves a network round trip to retrieve them. Multiplied across tens of thousands of simultaneous requests, this is a huge savings.

Ultimately, the project was a success. We implemented the new service in a couple of months with two engineers (and one of them 👋 went on paternity leave near the end of the project). The new service was rolled out without causing issues for end users (notable especially as it was used in every single customer project on our platform, and is critical!), and it resulted in huge scaling and performance improvements for us:

So we got way better performance, much higher scale, and we saved money while doing it. That’s a definite win.

After reflecting on this project, there are two main lessons that I took away from it.

Know when to break rules. The dogma is that stateless services are the ideal, the platonic form of a microservice. This is great and all, but it’s not a universal truth: Sometimes a stateful service is just what you need. Best practices often get a cult-like following. If people don’t know why things are a best practice, they won’t know when to break them. But if you do know why something is a best practice, you can break it when necessary.

Measure before you cut. Before we started on this project, we did a significant round of load testing and measurement to figure out where the bottleneck in our system was. We thought there was just one. Oh, the naivety! Once we got past optimizing all the bad queries and the low hanging fruit, we were left with two big bottlenecks which we couldn’t resolve through throwing hardware at the problem. Ultimately, we would never have found this solution if we hadn’t measured to find where the problem lies and experimented with different hypotheses. Hunches can get you started, but by measuring before you commit to an approach, you’ll save lots of toil.

And that’s it. If you enjoyed this post, please reach out. We have a few open positions, and I generally just love to geek out about software engineering!