The Unscalable, Deadlock-prone, Thread Pool

By Paul Khuong

Epistemic Status: I’ve seen thread pools fail this way multiple times, am confident the pool-per-state approach is an improvement, and have confirmed with others they’ve also successfully used it in anger. While I’ve thought about this issue several times over ~4 years and pool-per-state seems like a good fix, I’m not convinced it’s undominated and hope to hear about better approaches.

Thread pools tend to only offer a sparse interface: pass a closure or a function and its arguments to the pool, and that function will be called, eventually.1 Functions can do anything, so this interface should offer all the expressive power one could need. Experience tells me otherwise.

The standard pool interface is so impoverished that it is nearly impossible to use correctly in complex programs, and leads us down design dead-ends. I would actually argue it’s better to work with raw threads than to even have generic amorphous thread pools: the former force us to stop and think about resource requirements (and lets the OS’s real scheduler help us along), instead of making us pretend we only care about CPU usage. I claim thread pools aren’t scalable because, with the exception of CPU time, they actively hinder the development of programs that achieve high resource utilisation.

This post comes in two parts. First, the story of a simple program that’s parallelised with a thread pool, then hits a wall as a wider set of resources becomes scarce. Second, a solution I like for that kind of program: an explicit state machine, where each state gets a dedicated queue that is aware of the state’s resource requirements.

Stages of parallelisation

We start with a simple program that processes independent work units, a serial loop that pulls in work (e.g., files in a directory), or wait for requests on a socket, one work unit at a time.

At some point, there’s enough work to think about parallelisation, and we choose threads.2 To keep things simple, we simply spawn a thread per work unit. Load increases further, and we observe that we spend more time switching between threads or contending on shared data than doing actual work. We could use a semaphore to limit the number of work units we process concurrently, but we might as well just push work units to a thread pool and recycle threads instead of wasting resources on a thread-per-request model. We can even start thinking about queueing disciplines, admission control, backpressure, etc. Experienced developers will often jump directly to this stage after the serial loop.

The 80s saw a lot of research on generalising this “flat” parallelism model to nested parallelism, where work units can spawn additional requests and wait for the results (e.g., to recursively explore sub-branches of a search tree). Nested parallelism seems like a good fit for contemporary network services: we often respond to a request by sending simpler requests downstream, before merging and munging the responses and sending the result back to the original requestor. That may be why futures and promises are so popular these days.

I believe that, for most programs, the futures model is an excellent answer to the wrong question. The moment we perform I/O (be it network, disk, or even with hardware accelerators) in order to generate a result, running at scale will have to mean controlling more resources than just CPU, and both the futures and the generic thread pool models fall short.

The issue is that futures only work well when a waiter can help along the value it needs, with task stealing, while thread pools implement a trivial scheduler (dedicate a thread to a function until that function returns) that must be oblivious to resource requirements, since it handles opaque functions.

Once we have futures that might be blocked on I/O, we can’t guarantee a waiter will achieve anything by lending CPU time to its children. We could help sibling tasks, but that way stack overflows lie.

The deficiency of flat generic thread pools is more subtle. Obviously, one doesn’t want to take a tight thread pool, with one thread per core, and waste it on synchronous I/O. We’ll simply kick off I/O asynchronously, and re-enqueue the continuation on the pool upon completion!

Instead of doing

A, I/O, B

in one function, we’ll split the work in two functions and a callback

A, initiate asynchronous I/O
On I/O completion: enqueue B in thread pool
B

The problem here is that it’s easy to create too many asynchronous requests, and run out of memory, DOS the target, or delay the rest of the computation for too long. As soon as the I/O requests has been initiated in A, the function returns to the thread pool, which will just execute more instances of A and initiate even more I/O.

At first, when the program doesn’t heavily utilise any resource in particular, there’s an easy solution: limit the total number of in-flight work units with a semaphore. Note that I wrote work unit, not function calls. We want to track logical requests that we started processing, but for which there is still work to do (e.g., the response hasn’t been sent back yet).

I’ve seen two ways to cap in-flight work units. One’s buggy, the other doesn’t generalise.

The buggy implementation acquires a semaphore in the first stage of request handling (A) and releases it in the last stage (B). The bug is that, by the time we’re executing A, we’re already using up a slot in the thread pool, so we might be preventing Bs from executing. We have a lock ordering problem: A acquires a thread pool slot before acquiring the in-flight semaphore, but B needs to acquire a slot before releasing the same semaphore. If you’ve seen code that deadlocks when the thread pool is too small, this was probably part of the problem.

The correct implementation acquires the semaphore before enqueueing a new work unit, before shipping a call to A to the thread pool (and releases it at the end of processing, in B). This only works because we can assume that the first thing A does is to acquire the semaphore. As our code becomes more efficient, we’ll want to more finely track the utilisation of multiple resources, and pre-acquisition won’t suffice. For example, we might want to limit network requests going to individual hosts, independently from disk reads or writes, or from database transactions.

Resource-aware thread pools

The core issue with thread pools is that the only thing they can do is run opaque functions in a dedicated thread, so the only way to reserve resources is to already be running in a dedicated thread. However, the one resource that every function needs is a thread on which to run, thus any correct lock order must acquire the thread last.

We care about reserving resources because, as our code becomes more efficient and scales up, it will start saturating resources that used to be virtually infinite. Unfortunately, classical thread pools can only control CPU usage, and actively hinder correct resource throttling. If we can’t guarantee we won’t overwhelm the supply of a given resource (e.g., read IOPS), we must accept wasteful overprovisioning.

Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread.

My favourite approach assigns one global thread pool (queue) to each function or processing step. The arguments to the functions will change, but the code is always the same, so the resource requirements are also well understood. This does mean that we incur complexity to decide how many threads or cores each pool is allowed to use. However, I find that the resulting programs are better understandable at a high level: it’s much easier to write code that traverses and describes the work waiting at different stages when each stage has a dedicated thread pool queue. They’re also easier to model as queueing systems, which helps answer “what if?” questions without actually implementing the hypothesis.

In increasing order of annoyingness, I’d divide resources to acquire in four classes.

  1. Resources that may be seamlessly3 shared or timesliced, like CPU.
  2. Resources that are acquired for the duration of a single function call or processing step, like DB connections.
  3. Resources that are acquired in one function call, then released in another thread pool invocation, like DB transactions, or asynchronous I/O semaphores.
  4. Resources that may only be released after temporarily using more of it, or by cancelling work: memory.

We don’t really have to think about the first class of resources, at least when it comes to correctness. However, repeatedly running the same code on a given core tends to improve performance, compared to running all sorts of code on all cores.

The second class of resources may be acquired once our code is running in a thread pool, so one could pretend it doesn’t exist. However, it is more efficient to batch acquisition, and execute a bunch of calls that all need a given resource (e.g., a DB connection from a connection pool) before releasing it, instead of repetitively acquiring and releasing the same resource in back-to-back function calls, or blocking multiple workers on the same bottleneck.4 More importantly, the property of always being acquired and released in the same function invocation, is a global one: as soon as even one piece of code acquires a given resource and releases in another thread pool call (e.g., acquires a DB connection, initiates an asynchronous network call, writes the result of the call to the DB, and releases the connection), we must always treat that resource as being in the third, more annoying, class. Having explicit stages with fixed resource requirements helps us confirm resources are classified correctly.

The third class of resources must be acquired in a way that preserves forward progress in the rest of the system. In particular, we must never have all workers waiting for resources of this third class. In most cases, it suffices to make sure there at least as many workers as there are queues or stages, and to only let each stage run the initial resource acquisition code in one worker at a time. However, it can pay off to be smart when different queued items require different resources, instead of always trying to satisfy resource requirements in FIFO order.

The fourth class of resources is essentially heap memory. Memory is special because the only way to release it is often to complete the computation. However, moving the computation forward will use even more heap. In general, my only solution is to impose a hard cap on the total number of in-flight work units, and to make sure it’s easy to tweak that limit at runtime, in disaster scenarios. If we still run close to the memory capacity with that limit, the code can either crash (and perhaps restart with a lower in-flight cap), or try to cancel work that’s already in progress. Neither option is very appealing.

There are some easier cases. For example, I find that temporary bumps in heap usage can be caused by parsing large responses from idempotent (GET) requests. It would be nice if networking subsystems tracked memory usage to dynamically throttle requests, or even cancel and retry idempotent ones.

Once we’ve done the work of explicitly writing out the processing steps in our program as well as their individual resource requirements, it makes sense to let that topology drive the structure of the code.

Over time, we’ll gain more confidence in that topology and bake it in our program to improve performance. For example, rather than limiting the number of in-flight requests with a semaphore, we can have a fixed-size allocation pool of request objects. We can also selectively use bounded ring buffers once we know we wish to impose a limit on queue size. Similarly, when a sequence (or subgraph) of processing steps is fully synchronous or retires in order, we can control both the queue size and the number of in-flight work units with a disruptor, which should also improve locality and throughput under load. These transformations are easy to apply once we know what the movement of data and resource looks like. However, they also ossify the structure of the program, so I only think about such improvements if they provide a system property I know I need (e.g., a limit on the number of in-flight requests), or once the code is functional and we have load-testing data.

Complex programs are often best understood as state machines. These state machines can be implicit, or explicit. I prefer the latter. I claim that it’s also preferable to have one thread pool5 per explicit state than to dump all sorts of state transition logic in a shared pool. If writing functions that process flat tables is data-oriented programming, I suppose I’m arguing for data-oriented state machines.