- jumping into the Interrupt Service Routine,
- capturing inputs, storing them as the producer, signaling the consumer
- returning to the main loop
- checking for new data, consuming it and reacting to it (send bus messages, etc)
- start again!
They also mention a batching effect – but that's again just giving a name to something that happens naturally given the design, and something that has been used in lots of drivers I've seen and written up to now: the initial interrupt notifies you that there's work to do in the circular buffer, and one disables further interrupts and starts draining the queue by polling. Maybe the interesting aspect to this is that Linux' docs highlight this as NAPI as IRQ-vs-polling, while LMAX highlight this as batching: one focuses on the mechanics, the other on the effect (cache). And both concepts get conjoined in DPDK, which is all about polling to enable batching (among other things).Another aspect is that my case was the simplest possible, in the sense that there was clearly and statically a single producer and a single consumer. The Disruptor seems to be (slightly?) more generic, or at least the authors gave some thought to generalization; multiple readers are there from the beginning. But it still such generalization seems rather straightforward... or rather, not needing any special insight. Looks like the first difficulties would appear for multiple writers, and still sounds relatively mild: use atomic operations and types, and memory barriers to signal others about the state of the queue. So: implementing an embedded-style queue over a circular buffer in Java makes for a fast architecture. Who would have thought?
To be fair, LMAX themselves seemed rather open about this being simply a judiciously designed and used circular buffer.The implied thing (or did I already make it explicit enough?) is that analyzing the Disruptor like this feels like a vindication of the typical PoV of C/C++ programmers who consider Java (or anything else) unusably bloated/wasteful. It's so wasteful, that being non-wasteful gains you props! But again, that sounds uncomfortably self-serving. I guess that, yes, this kind of "revelation" can happen more easily in Java because Java programmers probably are not used to look at the hardware behavior. But, how many C/C++ programmers do just assume wrongly that they know enough about memory management? I've found a surprising quantity of them who think that manual memory management is a perfectly predictable, kinda-O(1) process. If you don't realize that allocating, freeing and even pointing to memory has a (varying) cost, then you're perfectly primed to recreate all the problems that made the Disruptor "innovative" in the first place. Or maybe even worse, because in C/C++ we don't have a JIT to optimize furiously even at runtime and a garbage collector to defragment memory – so I would fully expect a naïve queue in Java to work better in the long-ish term than a naïve queue in C/C++. And Java programmers have been castigated enough about performance that at least they probably know what kind of problems to look for, while smug C/C++ programmers would probably assume memory can't be a problem – or one that can't be dealt with, anyway.  And in this respect, it's interesting to see the various attempts to reimplement the Disruptor in C++. There's a discussion in Google Groups where the original author talks about it and discusses the possible gains and related platform-specific pain – because by 2011, C++ still didn't even have a memory model!
Particularly I saw some implementation which seems to reach 2x performance in C++, though limited to 1 producer. Heh – isn't that cheating? As already mentioned that's the particularly easy case; and as the Disruptor author said in the mentioned thread, to make things faster for a given application he would hardwire a number of things. Anyway, once you get into the path of such simplifications, one has to wonder whether everything else was taken into account as well.My queue/circular buffer library was header-only, and later I generalized it to be "parametrizable" by #defining some parameters before #including the libraries. Later I saw that this is a pattern that others had also used to get similar results; which was both disappointing and relieving, because it reduced the "this must be wrong" feeling of doing this kind of unholy, unexplainable, overcomplicated thing. God, what a mess it was. And that's what pushed me to start wondering what was wrong with C and what makes a language better than others...
Also, it's ... interesting... to see how Martin Fowler examines the Disruptor. He seems to be in awe, as in circling it from a respectful distance while poking at it softly with a stick. Also, as someone mentions in Hacker News, he seems to be rather out of his depth regarding the whole memory discussion. Food for thought.An example sentence: "So what indicates you shouldn't go down this path? This is always a tricky questions for little-known techniques like this, since the profession needs more time to explore its boundaries." Huh... sure. The summary regarding the Disruptor itself is that there seems to be (or there was?) a surprising lot of hype here which could be substituted by "profile and rearchitect". The summary regarding the experience of an embedded developer who programs out of the embedded world is that looks like we could be much more aggressive about how we apply our low-level knowledge to high-level architectures.
I guess it's also interesting that here profiling was used to rip through layers of abstraction, to be able to build something mapeable to the desirable behaviors of the hardware. "Layering considered harmful" – RFC3439The interesting question, for myself at least, is: would I have created/decided on/stumbled upon this architecture if I was on a platform as rich as Java? And it's a bit of an anticlimax that the question doesn't really make much sense, in that it would have never been posed. The times I programmed in Java, the requisites were about GUI responsiveness or general throughput; no one thought about measuring time budgets, much less in microseconds. Not necessarily because it couldn't be done; rather it just wasn't in the vocabulary. It's a kind of rule of thumb for what a system is good for, that ends up turning into an Overton window. Which segues into cargo cults and "nobody ever got fired for buying IBM". And nobody ever got fired for programming a microcontroller in C. And nobody ever got fired for being slow and bloated in Java. The converse is... write something fast in Java and get published! Profile to blast and optimize across layers, and dumbfound architects! Interestingly, the main developer of the Disruptor, Martin Thompson, mentions in Google Groups that he spent the 90s writing C++, and he seems to know what he's talking about when discussing C++ implementation challenges. Makes me wonder how much that experience helped him design the Disruptor in the first place.
 Somewhere the LMAX people mention that the hardware's design and behavior is actually at odds with the common theory of how to make parallel, concurrent processing work. That's actually an interesting insight, and a worrying one; lately I also read some commentary on how systems designed in the 60/70s (Burroughs?) seemed much more amenable to parallel processing than current ones, and how simplistic processor architectures in the 80s killed them (alas, I forget where that was). Sounds to me like current design has tried so hard and so long to keep Moore's Law going that anything that doesn't reach for plain serial speed ends up seemingly disproportionately penalized.
 Which is also how I summarize DPDK and Netmap. Mhm, maybe I'm being too radical and I see this too clearly because I've had exposure to many versions of the same idea?
 I am being purposefully weasely with these numbers – this was about 6 years ago and don't trust my memory. Fortunately I blogged about part of it in the moment: one surprising source of delays was the way gcc was calculating the indexes into the array backing the circular buffer. At the moment I identified that the array access stride couldn even double the timings; now I'd additionally guess that playing with signed/unsigned indexes might have helped, by allowing the compiler to relax the arithmetic. But probably would be even better to just map the array to a well defined memory range and just calculate my own indexes with a bitmask to force a "mod maxindex" calculation.Which sounds to me like conceding that compiler's optimizations only can go so far without explicitly involving the language and the programmer – as D.J. Bernstein argued. (Dropping to assembler for something as simple as this is accepting that C is beyond broken)
 OK, let's be fair here too: put like that it does sound really easy, and I'm kinda surprised that it can be so summarized. But reaching the point where this was all clear and established and reliable and repeatable took weeks. Hindsight? Experience? Mhm.
 Of course, the converse point is that, since a naïve C/C++ queue would probably turn inviable and die sooner than later, only the non-naïve/fixed ones will stay. While a less-than-perfect Java queue might be kept alive by the runtime; instead of dying, it will just underperform.