YOU could have invented the LMAX Disruptor... if only you were limited enough!

By H. Mijail Antón Quiles

This is about a shocking realization I got while learning about the LMAX Disruptor – a data structure / architecture that got published in 2011 and made some waves across the web because it allowed LMAX to process +6M transactions per second on a single thread. In Java. Which was interesting because Java was (still?) supposed to be too slow for that kind of thing. By then I was already a couple of years out of the Java world and getting into embedded C, so I took a mental note to look into it just out of curiosity. But, as could be expected, the note was promptly forgotten. I'm now looking again into Java, so finally I managed to take a peek at the Disruptor. Shock: this is just a description of what I was implementing in C while I was still learning it! But it's not that I'm a genius; it's that this is probably the only way to do things when in a limited environment. In fact, the basic architecture is the one used typically in drivers and in Linux's NAPI network architecture, which is in place since the early 00's. I see a handful of others across the web having commented as much during these years. So... is THIS what merited such coverage, talks in conferences, etc? I used to think that a surprising part of the embedded people I've met are typically rather short-sighted and closed minded about their part of the stack; now it's looking like the higher levels are also equivalently blindsided. So it sounds like there really is a need for some condensing view, some whole-picture consideration in the industry. But, how? Who? I guess I could take this as a job opportunity :P, but it rather sounds depressing / worrying. If computing is akin to thinking... what does it say if we programmers seemingly are such poor thinkers? And I think that "specialization" would be a really poor excuse here. "Specialization is for insects" – right? Sounds to me more like oblivious churning out. Which might be well and good when making another templated website, but not as good when writing software to control, say, medical equipment. So let's try to analyze how and why the Disruptor is or isn't interesting, and the implications. A big thing in the Disruptor architecture is that it tries to work with the hardware, not against it. Which means: profile, and do more of what the hardware does fast, and less of what makes the hardware go slow [1]. It's kinda funny that LMAX go as far as calling it "mechanical sympathy"; I guess that sounds better than "pay attention to what you're doing". [2] The implicit thing in there is that there is a lot that you can do to make the hardware go slow, and maybe one doesn't even realize. But of course when you're programming a 12 MHz, 8 bit microcontroller with RAM in the KB's, one can't afford to make the hardware go slow. But... no, actually that's what a self-serving embedded programmer would say. Yes, it's true that one can't afford to make the hardware go slow, but it's also true that the only way to make it go slow is to do something boneheaded, like using an O(n^2) algorithm instead of O(n), or using big types when there's no hardware support, or not exploiting arithmetic tricks with powers-of-2 or etc etc. 12 MHz is not a lot, but no-one will steal them from you; one only has to avoid waste, and possibly learn how to cram more in less. Meanwhile, in a 2 GHz multi-core CPU with multiple cache levels, pipelining, vector units, Hyper-Threading, prefetching and speculative execution, tens of registers and tons of memory there are a lot of things that must be kept balanced to avoid throughput scattering all over the place. But all the same, the key (or one of them) is to avoid waste. The best way to optimize a task still is to get rid of it. To put things in numbers: in an AVR microcontroller I sometimes had about 2000 clock cycles for
  1. jumping into the Interrupt Service Routine,
  2. capturing inputs, storing them as the producer, signaling the consumer
  3. returning to the main loop
  4. checking for new data, consuming it and reacting to it (send bus messages, etc)
  5. start again!
1 and 3 are usually negligible; but when the timing budget is tight enough they need to be taken into account! So, how to do this? There is only one good way: a circular buffer in which the ISR drops the data and "announces its availability" to the consumer. By being careful with the data sizes and types (and paddings and declarations and compilation), you'll be done with (1-2-3) in the low hundreds of cycles. [3] An optimization is that if you have a single producer and consumer one can have a queue without locks – so you win some time (no locks to deal with), AND the lack of locking allows the ISR timer to be perfectly regular and jitter-less, because the main loop will never block the ISR. An advantage of this tight context is that one can count on the data being consumed ASAP. So there is no actual need to signal the consumer that new data is waiting, the consumer will just poll waiting for updates to the producer's index. And that's pretty much it. Low budget, but not so much to do with it at the end of the day. [4] Why do I say that this is the only good way? Let's see. I didn't have a cache, which of course is a disadvantage if you want to keep your CPU fed with fast data when the memory is slower. But in my system RAM and CPU were running at the same speed, so no need for a cache, so no need for its complexity. Also an advantage in that the timing limits I was working with were was always crystal clear. Relatedly, this was an all in-memory system – so no need to even wonder about anything else than RAM. (Funny how out of embedded, "in-memory system" sounds kinda exotic/luxurious) This was single-CPU – so no distraction/temptation with trying to be multi-core (apart from the ISR concurrency, of course), which might make you go faster but might also get you into lots of synchronization problems, both regarding correctness and speed. Indeed, that's one of the things the LMAX people learnt: queuing messages between CPUs was killing performance. And I didn't have the luxury to even consider any kind of queue but a circular buffer, since I had very limited memory; and dynamic allocation is typically forbidden in embedded development, so I had to just declare a fixed array at compile time and make do with it: which spells the extreme case of "not generating garbage". Which is not so bad, and actually, I wonder why would LMAX consider anything else. They mention a queue over a linked list, but I don't see how that could ever make sense for this use case. What about other design decisions not that dominated by the hardware? For example, what to do when the producer is too fast for the consumer? In my case, I had a nagging feeling that I should do something explicit. But, what could be done? Nothing really; this could only happen if the consumer got blocked for some reason, which could never happen. There were no pauses, no blocking calls to anything. If the consumer was working at all, it would be faster than the producer, simply because the producer was more rate-limited than the consumer. If the consumer wasn't working, then well, the system is dead anyway, pretty much by definition. So I just turned the case into an error report to whomever could be listening and a reset. Which still felt like cheating. But turns out that the Disruptor basically does just the same supposition: the producer never reaches the consumer. (If it does, it blocks, and they call it "backpressure". They consider it a feature; in my case it wasn't, but again, I had the advantage of having a very concrete goal for the system.)

They also mention a batching effect – but that's again just giving a name to something that happens naturally given the design, and something that has been used in lots of drivers I've seen and written up to now: the initial interrupt notifies you that there's work to do in the circular buffer, and one disables further interrupts and starts draining the queue by polling. Maybe the interesting aspect to this is that Linux' docs highlight this as NAPI as IRQ-vs-polling, while LMAX highlight this as batching: one focuses on the mechanics, the other on the effect (cache). And both concepts get conjoined in DPDK, which is all about polling to enable batching (among other things).

Another aspect is that my case was the simplest possible, in the sense that there was clearly and statically a single producer and a single consumer. The Disruptor seems to be (slightly?) more generic, or at least the authors gave some thought to generalization; multiple readers are there from the beginning. But it still such generalization seems rather straightforward... or rather, not needing any special insight. Looks like the first difficulties would appear for multiple writers, and still sounds relatively mild: use atomic operations and types, and memory barriers to signal others about the state of the queue. So: implementing an embedded-style queue over a circular buffer in Java makes for a fast architecture. Who would have thought?

To be fair, LMAX themselves seemed rather open about this being simply a judiciously designed and used circular buffer.

The implied thing (or did I already make it explicit enough?) is that analyzing the Disruptor like this feels like a vindication of the typical PoV of C/C++ programmers who consider Java (or anything else) unusably bloated/wasteful. It's so wasteful, that being non-wasteful gains you props! But again, that sounds uncomfortably self-serving. I guess that, yes, this kind of "revelation" can happen more easily in Java because Java programmers probably are not used to look at the hardware behavior. But, how many C/C++ programmers do just assume wrongly that they know enough about memory management? I've found a surprising quantity of them who think that manual memory management is a perfectly predictable, kinda-O(1) process. If you don't realize that allocating, freeing and even pointing to memory has a (varying) cost, then you're perfectly primed to recreate all the problems that made the Disruptor "innovative" in the first place. Or maybe even worse, because in C/C++ we don't have a JIT to optimize furiously even at runtime and a garbage collector to defragment memory – so I would fully expect a naïve queue in Java to work better in the long-ish term than a naïve queue in C/C++. And Java programmers have been castigated enough about performance that at least they probably know what kind of problems to look for, while smug C/C++ programmers would probably assume memory can't be a problem – or one that can't be dealt with, anyway. [5] And in this respect, it's interesting to see the various attempts to reimplement the Disruptor in C++. There's a discussion in Google Groups where the original author talks about it and discusses the possible gains and related platform-specific pain – because by 2011, C++ still didn't even have a memory model!

Particularly I saw some implementation which seems to reach 2x performance in C++, though limited to 1 producer. Heh – isn't that cheating? As already mentioned that's the particularly easy case; and as the Disruptor author said in the mentioned thread, to make things faster for a given application he would hardwire a number of things. Anyway, once you get into the path of such simplifications, one has to wonder whether everything else was taken into account as well. 

My queue/circular buffer library was header-only, and later I generalized it to be "parametrizable" by #defining some parameters before #including the libraries. Later I saw that this is a pattern that others had also used to get similar results; which was both disappointing and relieving, because it reduced the "this must be wrong" feeling of doing this kind of unholy, unexplainable, overcomplicated thing. God, what a mess it was. And that's what pushed me to start wondering what was wrong with C and what makes a language better than others...

Also, it's ... interesting... to see how Martin Fowler examines the Disruptor. He seems to be in awe, as in circling it from a respectful distance while poking at it softly with a stick. Also, as someone mentions in Hacker News, he seems to be rather out of his depth regarding the whole memory discussion. Food for thought.

An example sentence: "So what indicates you shouldn't go down this path? This is always a tricky questions for little-known techniques like this, since the profession needs more time to explore its boundaries." Huh... sure. The summary regarding the Disruptor itself is that there seems to be (or there was?) a surprising lot of hype here which could be substituted by "profile and rearchitect". The summary regarding the experience of an embedded developer who programs out of the embedded world is that looks like we could be much more aggressive about how we apply our low-level knowledge to high-level architectures.

I guess it's also interesting that here profiling was used to rip through layers of abstraction, to be able to build something mapeable to the desirable behaviors of the hardware. "Layering considered harmful" – RFC3439

The interesting question, for myself at least, is: would I have created/decided on/stumbled upon this architecture if I was on a platform as rich as Java? And it's a bit of an anticlimax that the question doesn't really make much sense, in that it would have never been posed. The times I programmed in Java, the requisites were about GUI responsiveness or general throughput; no one thought about measuring time budgets, much less in microseconds. Not necessarily because it couldn't be done; rather it just wasn't in the vocabulary. It's a kind of rule of thumb for what a system is good for, that ends up turning into an Overton window. Which segues into cargo cults and "nobody ever got fired for buying IBM". And nobody ever got fired for programming a microcontroller in C. And nobody ever got fired for being slow and bloated in Java. The converse is... write something fast in Java and get published! Profile to blast and optimize across layers, and dumbfound architects! Interestingly, the main developer of the Disruptor, Martin Thompson, mentions in Google Groups that he spent the 90s writing C++, and he seems to know what he's talking about when discussing C++ implementation challenges. Makes me wonder how much that experience helped him design the Disruptor in the first place.

[1] Somewhere the LMAX people mention that the hardware's design and behavior is actually at odds with the common theory of how to make parallel, concurrent processing work. That's actually an interesting insight, and a worrying one; lately I also read some commentary on how systems designed in the 60/70s (Burroughs?) seemed much more amenable to parallel processing than current ones, and how simplistic processor architectures in the 80s killed them (alas, I forget where that was). Sounds to me like current design has tried so hard and so long to keep Moore's Law going that anything that doesn't reach for plain serial speed ends up seemingly disproportionately penalized.

[2] Which is also how I summarize DPDK and Netmap. Mhm, maybe I'm being too radical and I see this too clearly because I've had exposure to many versions of the same idea?

[3] I am being purposefully weasely with these numbers – this was about 6 years ago and don't trust my memory. Fortunately I blogged about part of it in the moment: one surprising source of delays was the way gcc was calculating the indexes into the array backing the circular buffer. At the moment I identified that the array access stride couldn even double the timings; now I'd additionally guess that playing with signed/unsigned indexes might have helped, by allowing the compiler to relax the arithmetic. But probably would be even better to just map the array to a well defined memory range and just calculate my own indexes with a bitmask to force a "mod maxindex" calculation.

Which sounds to me like conceding that compiler's optimizations only can go so far without explicitly involving the language and the programmer – as D.J. Bernstein argued. (Dropping to assembler for something as simple as this is accepting that C is beyond broken)

[4] OK, let's be fair here too: put like that it does sound really easy, and I'm kinda surprised that it can be so summarized. But reaching the point where this was all clear and established and reliable and repeatable took weeks. Hindsight? Experience? Mhm.

[5] Of course, the converse point is that, since a naïve C/C++ queue would probably turn inviable and die sooner than later, only the non-naïve/fixed ones will stay. While a less-than-perfect Java queue might be kept alive by the runtime; instead of dying, it will just underperform.