Durable Objects in Production - How Linc uses Cloudflare's new serverless real-time data platform


A few weeks ago, Cloudflare announced Durable Objects: a "truly serverless approach to storage and state", which piqued our interest here at Linc. We're already big proponents of Cloudflare Workers (Workers is our recommended deploy target, after all), so we're always interested in what new capabilities that platform is getting. But it was this section that sounded ideal for solving a particularly annoying UX problem with much less code/infrastructure than we'd assumed we would need (see the comparison section at the end).

Historically ... there was no way to control which instance received a request, [so] there was no way to force two clients to talk to the same Worker... Durable Objects change that: requests related to the same topic can be forwarded to the same object, which can then coordinate between them, without any need to touch storage.

This is a big deal. And, after a few days tinkering, we've been able to successfully deploy our solution to production! In this post we'll talk about the problem we solved, how Durable Objects work, we'll go through the code and what it's like working with them, and then look at just how much work it saved us.

Note: Durable Objects are in limited beta, and "not recommended for production use" (yet). However, our use-case doesn't push any of the beta limits and we're using it to progressively enhance our existing solution, so we've felt safe deploying it to production already!

The problem—streaming build logs

While Linc is a product focussed on automating deployments and previewing each commit, a big part of the day-to-day usage is building commits from source into a FAB, and viewing the logs of that process is a core piece of the product:

image

When you're first getting set up, or when something changes in your build or goes awry, you can find yourself poring over these logs to find and fix your issue. And so being able to watch them "live", as they're being built, is a useful tool in reducing your feedback loop (and of course Linc is big on reducing feedback loops!). But that turns out to be an annoyingly difficult feature to implement.

For starters, most of the time a build will start before any clients are watching (but not always e.g. when restarting a build after a config change). In this case, you need to accumulate a partial build log somewhere so it's ready to send it to the first client to connect. And multiple clients might be watching the same build, and disconnect/reconnect at any moment, too. It's not intractable, but it's just complex enough to be tricky.

Our current compromise: reuse our existing GraphQL subscriptions

Until now, we've taken a pragmatic approach: use GraphQL subscriptions to periodically flush the entire `Build` record, logs included, every few seconds. This actually reuses our existing live-update infrastructure that powers almost every view in the app:

So while updates are triggered from DB events, using a GraphQL subscription server in this way gives us an important guarantee: the clients are always getting the full snapshot of the data that they're interested in. In other words, any change to a record in the DB results in the client getting a full update of any affected queries from the server—there's no need to manage or merge updates on the client, it can simply replace its local cache with the latest update. It makes a fully live UI much easier to implement, at the expense of some redundant data transfer.

As a general pattern, it's worked fantastically for us. But to send each log of a current build line-by-line it's way way way too heavy. Not just in terms of sending way more data than necessary, but we'd be absolutely thrashing our Dynamo table  even if nobody's watching the logs currently. So, we've taken a compromise: use the existing infrastructure and flush the logs to the DB every ~10 seconds. Most builds take between 2 and 4 minutes so it's only a handful of extra writes, and it feels OK for a stop-gap. We've had an open ticket to build out a dedicated log-streaming solution but the amount of additional infrastructure put us off. That is until Durable Objects came along...

What are Durable Objects?

Note: it's worth reading the introductory blog post to get a background, but this is my best attempt at a TL;DR:

Durable Objects are a way of defining a serverless function as an instance of a JS class. They can't be reached by external HTTP requests directly, instead a "normal" (i.e. stateless) Worker creates/accesses instances using their namespace (usually the name of their class) and an id. For a given namespace & id, there will be exactly one instance, somewhere in the world, and it can store data. Instances communicate with workers over normal HTTP requests/responses, and support websockets.

If you find that a bit confusing, you're not alone! I didn't really "get" it until I'd read an example, so I've reproduced and commented our Worker and Object code in The Solution section below.

Durable Objects? "Stateful workers"? "Materialised actors"?

I hope it's not terribly controversial, but I'm not a big fan of the name "Durable Objects". It conjures a very data-centric model, where it sounds like a way to... maybe define classes that magically run in the cloud, and where instances... maybe get "frozen" and stored when they go idle? It presents it like a type of database, whereas the reality is much closer to the Worker/Serverless model, just with per-worker storage.

That said, the phrase "workers with per-worker storage" massively downplays how incredibly transformative this concept is, though, so I get that Cloudflare want it to sound like a "whole new thing". But conceptually, you may be better thinking about these as "materialised actors" or, as we've come to coining them internally "stateful workers".

We'll just call them "Objects" and "instances" in the remainder of this blog post.

Instances as a kind of "state"

In the documentation, there's a lot of focus on the fact that an Object has its own key-value storage using `controller.storage`. But that API is available only once you have a live instance of an Object, and getting there is actually a whole type of state in disguise.

The key phrase is here:

Each object has a globally-unique identifier. That object exists in only one location in the whole world at a time. Any Worker running anywhere in the world that knows the object's ID can send messages to it. All those messages end up delivered to the same place.

Combine this with the fact that each Object instance supports websockets, suddenly you can define a system where, as long as two endpoints share some kind of key, they can pass messages directly (kind of the way WebRTC connections are possible when two computers can't talk directly, but can both talk to a TURN server). And it turns out, for Linc to stream live build logs to the browser, that's exactly what we need.

The solution

We already have an ideal shared key: the git SHA of the current commit being built. We actually use the `tree_id` of the commit, not the commit sha, because a `tree_id` is a hash of only the underlying code, not its history or commit message (an aside: this is how Linc can release instantly when you merge an up-to-date PR, without waiting for a new build).

The solution has 4 components:

  • The "normal" Cloudflare Worker that receives requests from the client/builder and connects them to the appropriate Object instance.
  • The Durable Object itself, which maintains a list of current clients, a history of logs, and broadcasts new log entries from the builder to each client.
  • The Builder, which is the machine in our AWS cluster that runs `npm run fab:build` on the source code. It hasn't needed to change very much, but now sends logs to the Worker as well as saving them to Dynamo. We sometimes call this "build server" or just "server" in the code.
  • The Client, which is our normal React app, that needs to reconcile "live" logs from the Worker with the existing data from GraphQL.

The Worker

It's fair to say that this is the piece that I was most confused about after reading the guide. It feels like you should just be able to deploy the Object itself and talk to it directly, but the Worker layer makes a lot of sense once it fits into the bigger picture: it's the only way clients (who talk HTTP to their nearest edge location) can talk to Object instances (which run in exactly one place worldwide).

The simplest possible Worker would look like this:

--CODE:language-js--export default {  async fetch(request, env) {    // Convert our key into a Object ID    const id = env.MyObjectNamespace.idFromName('some-fixed-value')    // Connect to that instance, booting it if necessary    const instance = await env.MyObjectNamespace.get(id)    // Forward the current HTTP request to it    return instance.fetch(request)  }

}

Note that this worker uses a single key for all requests (`'some-fixed-value'`) which means _every_ request would be directed to a _single Object instance_. This is almost certainly not what you want in production, but it was handy when getting started (particularly if you change `'some-fixed-value'` once or twice, so you can be sure you're getting a new instance from the last time you deployed).

Our Worker isn't actually much more complex, but parses the route to find the `tree_id` to direct all requests for a particular build to a shared instance:

--CODE:language-js--export default {  async fetch(request, env) {    const { pathname } = new URL(request.url)    // Pro tip: put the parsing of the route in a static method on the Object so you can use it in both places    // (note: the Worker and Object have to be in the same file to share the helper function)    const route = DurableBuildLog.toRouteParams(pathname)    if (!route.match) return notFound()    // Validate that the Client has access to this Site's build logs    if (route.client) {      if (!await clientHasAccess(route.sitename, request.headers)) {        return notAuthorized()      }    }    // Validate that the Build Server is definitely one of ours.    if (route.server) {      if (!serverIsAuthentic(request.headers)) {        return notAuthorized()      }    }    // Find the DO for our tree_id    const objectId = env.BuildObjects.idFromName(route.tree_id)    const object = await env.BuildObjects.get(objectId)    // Pass our request through and return    const response = await object.fetch(request)    // Logging the response helped a lot with tracking down bugs inside the DO.    // You can use `wrangler tail` to follow these, but beware: it only logs    // from inside the worker, not the Object instance.    console.log(response)    return response  },

}

That's it! We do logging, route parsing and authentication/authorization here but otherwise we just pass requests through to the instance ID derived from the `tree_id` of the current commit.

The Object

Once you start thinking of a "Durable Object" as a "stateful worker I can spring into existence anywhere in the world based off its ID", it starts to become a bit easier to imagine use-cases. For us, we just need to create Websocket connections whenever clients and build servers connect to us, then dispatch events as they come in.

The first step is to parse the HTTP request route:

--CODE:language-js--export class DurableBuildLog {  // Static method so we can call it from inside and outside the Object

 static toRouteParams(pathname) {

   const match = pathname.match(      /^\/((client)|(server))\/([\w-]+)\/([a-f0-9]{15})\/ws$/    )    if (!match) return { match: false }    const [_, __, client, server, sitename, tree_id] = match    return { match: true, client, server, sitename, tree_id }

 }

Note that this assumes that the "Worker" and our "Object" are defined in the same file (FYI: this is how the demo works but not the guide). Using the same file means that both the Worker and the Object can use `DurableBuildLog.toRouteParams` to parse the route which feels like a good enough reason to co-locate them to me.

But this makes the deployment a little confusing—there's actually three steps (I learned about this from the publish.sh file in the demo):

  1. Deploy the script as if it was a normal worker but have the DO class defined and `export`ed inside it. It won't work yet, we just need it to have been published once before the next step works.
  2. Create a namespace from the Object: effectively telling Cloudflare: hey, `DurableBuildLog` is a Durable Object definition, please create a namespace for it and give me the ID.
  3. Deploy the script again with the DO namespace in your Worker's bindings. This tells Cloudflare to inject `BuildObjects` on the `env` argument to the Worker, wiring things up.

Thankfully, after the first time, you only need to do step 3 to update both your Worker and your Object code. Again, that for me is a good reason to colocate your Worker and Object, but again this is early beta DX and I'm sure it'll improve.

Anyway, back to our Object code:

--CODE:language-js--
 constructor(state, env) {    // We could use state.storage if we needed these logs to persist, but we don't.    // So we'll just use an instance variable and accept its limitations (see below)    this.server = null    this.clients = []    this.resetLogs()  }  resetLogs = () => {    this.logs = []

 }

Here, `state.storage` is the API the docs talk about, and we thought we would need, but we so far haven't. It turns out, these instances have a small amount of in-memory storage, and so we can just use `this.logs = []` to store our build-logs-so-far. Since our Builder opens a WS connection while it's building, the effect is that this in-memory instance variable seems to be preserved for the duration of the build, regardless of whether a client or builder reaches the Object first.

Note: In-memory storage is subject to the same eviction rules as normal workers. But in our case, if the Object instance gets evicted during a build, our experience degrades back to our GraphQL fallback. In practice, since our builds are quite short, we've been OK so far, but it's something we could revisit with a more resilient implementation down the track.

The fetch method is where the bulk of the logic goes:

--CODE:language-js--  // Entry point for all new clients/build servers

async fetch(request) {

   // See note on handleErrors below

   return await this.handleErrors(request, async () => {


     const { pathname } = new URL(request.url)      const { match, client, server } = DurableBuildLog.toRouteParams(pathname)      if (!match)

       return new Response('Not found', {

         status: 404,        })      // The only requests we expect to get this far are WebSocket connections

     if (request.headers.get('Upgrade') !== 'websocket') {


       return new Response('expected websocket', { status: 400 })      }      // Note: const [client_ws, our_ws] = new WebSocketPair() explodes      // as `WebSocketPair` is a secret Rust object and hasn't been made iterable.      // This is a bug and has been reported.      const pair = new WebSocketPair()      const [client_ws, our_ws] = [pair[0], pair[1]]      // Accept our end of the WebSocket. This tells the runtime that we'll be terminating the      // WebSocket in JavaScript, not sending it elsewhere.      our_ws.accept()      // If this is a server connection, kick any previous connections,      // clear the logs, and wire event handlers up      if (server) {        if (this.server) this.server.close(1000, 'You got kicked')        this.resetLogs()        this.server = our_ws        our_ws.addEventListener('message', this.serverMessage)        our_ws.addEventListener('close', this.closeServer)        this.broadcast({ meta: 'server connected' })      }      // For client connections, send the logs so far, and add them to the client list      if (client) {        const initial_payload = [

         { meta: 'connected ok', VERSION },

         ...this.logs,        ]        if (this.safeSend(...initial_payload)(our_ws)) {          this.clients.push(our_ws)        }      }

     return new Response(null, { status: 101, webSocket: client_ws })


 }

Note: This took a long time to get right, mainly because debugging is a pain (console logs here don't reach the `wrangler tail` logs on the Worker, despite being in the same file). One absolute little champion nugget of code is the handleErrors helper from the Chat demo. This catches exceptions and sends valid WebSocket packets with the error message, rather than letting the request fail. I wouldn't have been able to get this going without it, but again, this is extremely early-access code, I'm sure the DX story will vastly improve.

This method, assuming the request is valid, creates a WebSocketPair (which is a non-standard extension to the Fetch API of Cloudflare's invention, btw) keeps a hold of one end in-memory and sends the other down as part of the Response. Once that connection is established, you can poke any data you want down it. Easy as.

It makes the code for handling new logs really simple:

--CODE:language-js--  // When the server sends us an event, broadcast it to the currently-connected clients  // and save it (in memory) for any clients that connect later

 serverMessage(event) {

   let msg    try {

     msg = JSON.parse(event.data)

   } catch (e) {      msg = { error: e }    }    this.broadcast(msg)    this.logs.push(msg)

 }

As for broadcasting to the clients, the only slight complexity is handling the fact that disconnections in WebSockets aren't always events, sometimes a `.send` call will just fail:

An aside: code like `this.clients = this.clients.filter(...)` always makes me nervous as it's not threadsafe. But since JS is single-threaded and your instance is guaranteed to be running in exactly one place worldwide, it's fine. And embracing that simplifies your code. A lot.

A couple of small utility functions have been ommitted, including a VERSION constant that we find-replace when we deploy, again just so I know 100% that the code has been updated on DO. But other than that, that's the whole Durable Object!

The Builder

The changes on the Build Server side of things are extremely minimal. We use Websocket-Node to connect to `wss://live-build-logs.lincbot.com/server/${sitename}/${tree_id}/ws` with a shared secret, then push each line of logging as it's created, along with some metadata like `messageType` of `start_cmd` or `append_log`, etc, knowing that it will be broadcast to any clients who are watching. Of course, if no clients are connected, the Object instance will come into being, accumulate the logs in-memory, then shut down, happy in the knowledge that it was ready, even if it never got asked to do anything.

But the real win? We didn't need to change our existing logging: we still flush to Dynamo and trigger GraphQL every 10 seconds. So if Durable Objects is ever unavailable, or our implementation ever breaks, our existing fallback will continue to work.

The Client

The final piece is to reconcile the events as they occur over the build log WS with the ones coming from the GraphQL subscription to display in the UI. That ends up involving a little bit of complexity but it's entirely related to the structure of our app and the choices we've made for partitioning data in the past, so I won't go into the details here.

The one thing I will mention is that there are two alternatives we explored: the first is using events from the DO to update the local GraphQL cache, and the second is leaving the GraphQL cache as-is and merging the data within the React component tree. We chose the latter because we're considering the DO "live" build logs very much a progressive enhancement—until we've run them in production for a while we didn't want to have any code that touched our GraphQL "source of truth". That and I liked the look of the websocket React hook library we're using, which leant itself to merging the data structures in the view layer.

Demo

Putting it all together, though, it works great:

--VIDEO:5e9dfd5c226042c4df98bff291e70d3b--

We've had it behind a feature flag for a week or so, and enabled by-default for all builds for the last few days, and it's amazing the impact on UX is when you're debugging a build problem. It's been a huge success for us, largely because of how little impact it had on the rest of our infrastructure.

Building this without Durable Objects

This feature is a prime example of the way UX suffers depending on your infrastructure choices—with our existing system, it was simply not worth the effort to build out a whole separate system for streaming Build Logs. So I thought it'd be worth mentioning briefly what we were expecting to need to do to set this up. Or, to put it another way: what's hard with AWS might be easy with Durable Objects.

At first blush, the architecture looks similar, with an API Gateway in between the Client and Builder. But in reality, since API Gateway has no state in and of itself (it simply terminates the WS connections and gives us an API to send data down it), our solution becomes a lot more complex.

For starters, figuring out the connection from the client to the builder actually requires several steps, a dedicated DB table, and two Lambdas. This is to enable us to store a list of connectionIds for a given tree_id, for both builder and client, so that when a client connects or a builder starts, they can find each other.

But the bigger problem is that, since our build server is the only stateful part in our whole system, we'd have had to take the management of a list of current clients and broadcasting events and place it there. Which is a very user-centric concern in one of the deepest layers of our infrastructure.

We could try introducing a complete server layer just for this, but that takes on an architectural cost (deployment, upkeep, scaling) that feels even harder to justify. And so we're left with a straight tradeoff: our app's UX and our system's maintainability are now in direct opposition, and we're forced to compromise (by flushing the logs every few seconds).

Final thoughts

I've come away from this experience with a fairly firm belief that the "Stateful Worker" model is genuinely inspired—I can now start to see how they could model virtually every data problem we face, and replacing almost every piece of infrastructure we currently use. That's potentially revolutionary, but only time and experience will tell whether it genuinely outperforms existing alternatives. But for our first foray, this was an unbridled success.

Cloudflare's Durable Objects, as the only current implementation of this concept that I'm aware of, absolutely looks like it could be the goods, but there are definitely some rough edges in understanding them, getting started, debugging etc. Those will get improved, and the limitations on the current beta will be relaxed. I'll be watching keenly for others' experiences and other implementations in production.

My feeling is that, at the very least, Durable Objects will be the best way to achieve a decent subset of current server-side data and processing tasks, like streaming Build Logs, or chat apps, etc. But potentially, if they hold up to closer scrutiny and heavier workloads, they might turn out to replace far more than that.