This is a follow-up post to Lock-freedom without garbage collection from 2015, which introduced Crossbeam, a Rust library that implements efficient lock-free data structures without relying on a tracing garbage collector.

Crossbeam has gone through a long list of improvements since then, and it’s time to showcase where it’s at today. We’re aiming to provide a rich set of tools for concurrency akin to java.util.concurrent and outdo Go channels in features and performance.

To see what is currently offered by the library, jump to the documentation for an overview.

The following tour through the history of Crossbeam contains something like 150 links. I encourage interested readers to click on them - you may find hidden treasures of useful resources buried in here! 💎

Contents

What is Crossbeam?

Before we get started, it’s helpful to clarify a little bit what exactly Crossbeam is and how it relates to other libraries for concurrency and parallelism.

A common question I get is how Crossbeam differs from Rayon and Tokio. My answer is:

  • Rayon splits your data into distinct pieces, gives each piece to a thread to do some kind of computation on it, and finally aggregates results. Its goal is to distribute CPU-intensive tasks onto a thread pool.

  • Tokio runs tasks which sometimes need to be paused in order to wait for asynchronous events. Handling tons of such tasks is no problem. Its goal is to distribute IO-intensive tasks onto a thread pool.

  • Crossbeam is all about low-level concurrency: atomics, concurrent data structures, synchronization primitives. Same idea as the std::sync module, but bigger. Its goal is to provide tools on top of which libraries like Rayon and Tokio can be built.

2015: Early days

The same year Rust 1.0 was released, aturon published the blog post titled Lock-freedom without garbage collection, which demonstrates that one doesn’t need a language with a tracing garbage collector to write fast lock-free programs. The secret sauce is a technique called epoch-based garbage collection, which is much different from traditional garbage collectors and is easily implemented as a library.

In those early days, Crossbeam had:

Then, in 2017, aturon declared he didn’t have time to continue working on the project and asked the Rust community to take it over. Many people showed interest in contributing and that is how I got involved, too.

2017: New beginnings

At that time, we discovered some pieces of Crossbeam like AtomicOption, epoch-based GC, and scoped threads had soundness holes. All of them were easy to fix but difficult to spot. Low-level concurrency is notoriously tricky and scary, so we first made sure those bugs are ironed out before growing the library.

Organizational changes were put in place. We split the library into multiple subcrates:

The main crossbeam crate re-exported those subcrates. We didn’t want to split crates any further so MPMC queues and the lock-free stack were kept in the main crate for the time being.

Next, we created the RFCs repository and begun discussing the overall direction of the project and new features we should implement. A wiki page with learning resources was set up.

The first RFC we accepted was a roadmap for the next 3-6 months. In hindsight, that was overly optimistic and should’ve been a plan for the year…

Epochs: Rewritten from scratch

The epoch-based GC went through a complete rewrite. The atomics API was first revamped - soundness holes got fixed, pointer tagging was added, more efficient atomic operations were introduced.

Next, we redesigned the core epoch mechanism. Pinning got more than two times faster, garbage collection became incremental in order to reduce pauses, manual garbage flushing was added, and we made it possible to run arbitrary destructors before freeing memory.

Correctness of memory orderings in the garbage collector was proven in an incredible RFC written by jeehoonkang. The proof got us more confidence in the implementation and was a big step forward in the maturity of the project.

He followed up with another RFC that made it possible to create independent GC instances and use crossbeam-epoch in #[no_std] environments. Memory allocator elfmalloc was then able to use it as a dependency.

Another milestone was when we realized guards can be safely used for pinning, so the ugly pin scopes got removed and pinning became more ergonomic.

Channels: Improving on mpsc and Go

In the spring of 2017, I got interested in channels because they’ve become the bread-and-butter synchronization tool in Go, while our channel implementations were lacking in many regards. My observations were:

  • std::sync::mpsc has a number of flaws. Sender is not Sync and Receiver cannot even be cloned. Bounded channels are just deques inside mutexes and therefore slow. The select! macro isn’t moving towards stabilization due to insurmountable obstacles in its design. And there are known long-standing bugs.

  • Like our bounded mpsc, Go channels are protected by big mutexes. A promising design proposal for lock-free channels was published in 2014 and there’s even a pull request for it, but it has been open for years with little progress.

  • The only way to use channels with select on stable Rust was using BurntSushi’s chan crate. It was great but never designed for high performance and was even slower than Go channels, which is unfortunate.

My goal was to build channels that have cloneable and shareable senders and receivers, are faster than both Go channels and mpsc, have select!, support dynamic selection, and fix ergonomic warts in mpsc. I had serious doubts that such a thing is even possible, but it was worth giving a shot to see how far we can go.

After seven months of research and experimentation, I published version 0.1 of crossbeam-channel, which delivered on most of the envisioned goals. A blog post, an RFC, and benchmarks accompanied the release. But it wasn’t a wild success as much as I hoped. For example, some of the responses were:

I still can’t get over the select design. I know it’s performant, and every other part of the library is great, but the conditions seems way too easy to mess up

I also don’t like the global state with the threadlocal variable, why can’t it be a struct with methods?

Not only it is brittle, but I must admit that even after reading the description of the mechanism several times, I am not quite clear on exactly how it works.

Back to the drawing board. After seven more months crafting a complicated select! macro, version 0.2 was released, which also simplified the API surface, basing decisions on the received feedback and conclusions of a lengthy discussion in the issue tracker.

The release was a significant improvement, but still not perfect. It was not possible to do dynamic selection and users complained about dropped receivers not closing channels, which sparked another interesting discussion, where we dug deeper into the design space of channels and reverted some decisions.

It took five months to figure out what to do next and rewrite the selection mechanism from scratch. Finally, all pieces somehow fell into the right places and, with version 0.3, my dream came true - Crossbeam channels now offer everything one could ask for, and I haven’t received any major complaints or feature requests so far. They are fast, MPMC, and have a powerful select. This year might be a good time to publish version 1.0 since I don’t expect API changes anymore.

Servo switched from mpsc to crossbeam-channel, which removed a bunch of unsafe code and the dependence on the unstable mpsc_select feature. That was an important milestone because it proved crossbeam-channel is mature and reliable enough for such a big project.

I’m also intending to write an RFC with a proposal to give mpsc a desperately needed refresh: improve performance, enrich the interface, and fix bugs. There’s still a known bug in it, which is even documented in the standard library! This is not a good look and we should do something about it as soon as possible. One option is to take a subset of crossbeam-channel to replace the whole guts of mpsc, just like we’re replacing mutexes and hash tables with parking_lot and hashbrown. But more on that another time…

Fortunately, crossbeam-channel has had very few bugs during its life, and none have ever been reported by users, thanks to the extensive suite of 400 tests, some of which are borrowed from mpsc and Go’s channels.

This is just a summary of how these channels came to be. Given how much interesting research went into producing this crate, it deserves a blog post of its own so I hope to write more sometime!

Scoped threads: Makeover

Scoped threads are such a simple convenience, yet may be the feature Crossbeam is best known for. In the past year, we fixed soundness issues and bugs, removed some cruft, polished up thread joining, and added support for nested spawns. It’s really just a bunch of small incremental changes.

Some notable breaking changes were:

  • In scope.spawn(f), closure f now accepts a single argument of type &Scope, which can be used for spawning nested scoped threads. Rayon’s scopes use a similar pattern.

  • Function crossbeam::scope() now returns a Result that contains an error if any automatically joined child thread has panicked. Before the change, such panics would get silently ignored.

AtomicCell: Like Cell, but atomic!

The set of atomics provided by the std::sync::atomic module is not particularly easy to use. In many ways, it feels very low-level:

  • There’s only a restricted set of primitive atomic types like AtomicBool and AtomicUsize. But what if we want arbitrary types to be atomic?

  • AtomicPtr<T> can load and store raw pointers only, forcing us to use unsafe. It would be nice to have atomic Box<T> and Arc<T> instead.

  • Every atomic operation needs a memory ordering. Reckless use of Relaxed resulting in UB is worryingly common. Even experienced programmers sometimes fall into this trap.

Enter AtomicCell<T>, which works just like Cell<T>, except it can also be shared among threads. Arbitrary types may be used with AtomicCell<T>, although some operations only work with Copy types. Sane defaults are used for memory orderings so one doesn’t have to worry about them.

Of course, there must be some magic enabling AtomicCell<T> to work with arbitrary types:

  • When type T can be transmuted into AtomicUsize, we internally pretend that T is actually AtomicUsize and perform atomic operations that way.

  • When T cannot be transmuted into AtomicUsize, a hidden global array of spinlocks is used. To perform an atomic operation, we take the pointer address of the AtomicCell<T> and pick one of the spinlocks based on it. Then we lock it, pretend the AtomicCell<T> is just a Cell<T>, and perform the desired operation.

Most implementations of std::atomic<T> in C++ use the same trick, so this is nothing new. However, there is one thing that sets AtomicCell<T> apart: optimistic reads.

Our spinlocks are implemented as sequential locks, which means every lock has a stamp, an atomic integer that gets incremented on every write operation. Read operations load the stamp, optimistically read data, and then check whether the stamp has changed. If not, we’re done, and if it has, we need to retry.

This way read operations don’t contend with each other, meaning they are very fast. Neat!

Synchronization: New primitives

As of recently, Crossbeam also features synchronization primitives. They are tools in the same category as mutexes and conditional variables, except a little bit more exotic.

Currently, we have the following primitives:

  • Parker, the same mechanism behind function thread::park(), but extracted for implementing custom thread notification. Tokio uses it for parking and unparking threads in its thread pool. Rayon might adopt it in the future, too.

  • ShardedLock<T> is like RwLock<T>, except it has an array of small RwLocks called shards. Each thread performing a read operation locks a distinct shard, thus reducing contention and making reads faster. However, writes are slower because they need to lock all shards.

  • WaitGroup, which allows threads to synchronize the beginning or end of some computation. It is inspired by Go’s WaitGroup from its standard library.

Deque: Juggling tasks in schedulers

Almost every work-stealing task scheduler has the same setup:

  • There is a shared global queue of tasks, usually called injector and is the entry point for new tasks. For example, if you call rayon::spawn(task) outside Rayon’s thread pool, the task will be pushed into the global queue. Any worker thread is then allowed to take tasks from it.

  • Each thread in the thread pool has its own worker queue. Only the thread that owns it is allowed to push and pop tasks, but other threads may steal tasks, which is a particular operation optimized for task scheduling.

  • The job of each worker thread is to wait for tasks to appear and run them. To find the next task to run, a thread will first look into its worker queue. If empty, it looks into the global queue or attempts to steal tasks from other threads.

This setup is used in Rayon, Tokio, Go, Thread Building Blocks, you name it. The advantage of work stealing is automatic work balancing among all threads even in presence of skewed workloads.

Crossbeam’s deque originally started with a basic Chase-Lev for work stealing, but it got beefed up since then:

  • We added support for FIFO worker queues in addition to the classic LIFO queue. LIFO order makes sense for Rayon because it prioritizes tasks for cache utilization, while FIFO makes more sense for Tokio because it prioritizes tasks for fairness.

  • Batched steal operations were added, which significantly reduce the total cost of queue operations. They got us nice speedups in Tokio.

  • A special Injector queue was introduced, which integrates nicely with Worker queues and supports similar operations.

Every such improvement in crossbeam-deque has a ripple effect on the library ecosystem. By bumping dependency versions and leveraging new features, Tokio’s thread pool gets faster, and therefore every application using Tokio gets faster, too!

Queues: Revamped

Until very recently, there were just two unbounded MPMC queues in Crossbeam:

  • MsQueue, the classic lock-free Michael-Scott queue. It allocates on every push operation, putting high pressure on the global allocator. An interesting extra feature it has is blocking pop operation.

  • SegQueue, which is like MsQueue, except it allocates segments of nodes. Even though it’s not strictly speaking lock-free, fewer allocations and better cache locality make it quite a bit faster than MsQueue in practically every case.

Since MsQueue offers almost nothing over SegQueue, we’ve decided to remove it. And if one needs blocking pop operations, channels can be used as an alternative.

Then we added ArrayQueue, which is based on dvyukov’s bounded MPMC queue. The original implementation in C++ has two rough edges: it is not linearizable and forces the capacity to always be a power of two. We’ve smoothened both.

A month ago, I had an epiphany and realized the bounded MPMC queue can be generalized to an unbounded queue. By combining segments from SegQueue and turning the sequence field into something functionally akin to hazard pointers, we can replace epoch-based GC with a different garbage collection scheme, bringing two benefits:

  • Garbage collection is entirely eager rather than lazy. Under high concurrency, it can be measured that this new queue uses less memory than the old SegQueue.

  • Epoch-based GC incurs a certain overhead on every operation due to thread pinning and occasional garbage collection. By removing it we get performance wins.

SegQueue was then rewritten from scratch and got us much better benchmark numbers.

Both ArrayQueue and the new SegQueue implementations sprung from the effort of optimizing channels, which were the first to use those queues internally, and then we just ripped them out and moved into crossbeam-queue.

Skiplist: Just around the corner

There’s an experimental crate crossbeam-skiplist featuring maps and sets based on lock-free skiplists. They are very similar to BTreeMap and BTreeSet in functionality and interface, but can be mutated concurrently from multiple threads.

These maps and sets scale very well, much better than mutex-protected B-tree based equivalents. However, in single-threaded workloads, skiplists offer rather underwhelming performance, generally three to four times slower in comparison to B-trees.

The main inspiration here has been Java’s ConcurrentSkipListMap. I’ve always been jealous of Java having concurrent skiplists that are straightforward to use, so this is an effort to bring them to Rust as well. The fact that Rust doesn’t use the JVM to rely on a mature and well-engineered concurrent GC made implementation much more difficult for us. Surely one of my most complex pieces of code!

The crate has been in a coming soon state for a long time now. We’ve just been slacking off on writing additional tests and documentation to push it over the finish line and publish the first version. But really, it’s been completed - you can clone the repository and play with it.

Utilities: Simple and handy

There are two utilities that are difficult to put into any category. While very simple, they are indispensable in low-level concurrency.

Backoff performs exponential backoff in spin loops by executing the PAUSE or YIELD instruction and yielding the current thread to the OS scheduler. Here’s how one might use it to wait for an AtomicBool to become true:

fn spin_wait(ready: &AtomicBool) { let backoff = Backoff::new(); while !ready.load(SeqCst) { backoff.snooze(); }
}

CachePadded<T> wraps a value of type T, padding it and aligning to the length of a cache line. Useful when we want to make sure an atomic variable doesn’t get falsely shared. For example, in a MPMC queue, it’s a good idea to put the head and tail indices into their own cache lines:

struct Queue<T> { head: CachePadded<AtomicUsize>, tail: CachePadded<AtomicUsize>, buffer: Buffer<T>,
}

Where to next?

Crossbeam has come a long way already, but there is still work to do. In summary, there are two sorely missing features we need as soon as possible, and a list of less critical nice-to-haves.

In an interesting blog post comparing concurrency in Java and Rust, hniksic points out we don’t really have a satisfactory equivalent of AtomicReference. The closest one we have today is probably arc-swap, but there is still room for improvement - for example, in some cases, it is an order of magnitude slower than AtomicReference.

Another common request is a concurrent hash table. While we do have wonderful crates like evmap, they tend to be designed for niche uses and often have non-standard interfaces. I’d love to also see a generic concurrent hash table that works well enough across a broad set of applications, with few surprises, and is ideally as similar to HashMap as possible.

So these are the things I’d like to see in Crossbeam this year: AtomicReference and ConcurrentHashMap written in Rust. There is a lot of design and implementation work to do here, but we generally know how to move forward. If you’re interested, take a peek at discussions on atomic references and hash tables in the issue tracker.

Finally, there is a neverending list of lower-priority features we might want to explore, but are not essential to the project:

Thanks

Looking back at the past two years, I’m thrilled with what we’ve accomplished so far! These days Crossbeam is even getting referenced in research papers.

My work on Crossbeam was sponsored by Mozilla and patrons on Patreon, to which I owe a big thank you! Without your support, it just wouldn’t be able to happen. Sustainability of open source work is a complicated topic and, instead of going into a ramble, I’ll just say everything withoutboats wrote about it at the end of their blog post is spot on and rings true for me personally.

I’d also like to thank all contributors who write code, participate in discussions, and share their feedback. This experience made me realize firsthand just how much every single contribution means, even if it’s as simple as, say, a comment in the issue tracker.

On a final note, I feel humbled by technical and leadership knowledge of so many people involved with Rust. And Crossbeam has contributors whose grasp of wide ranges of topics is sometimes way over my head. I’ve learned a lot from you and still am!