Massive concurrency and hardware parallelism are sexy topics in the 21st century. There are a couple of good reasons for this and one rather unfortunate one.
Two good reasons are the combination of eye-catching uses of Graphics Processing Units (GPUs) in games and their unexpected secondary uses in deep-learning AI – these exploit massive hardware parallelism internally. The unfortunate reason is that single-processor execution speeds hit a physics wall in about 2006. Current leakage and thermal runaway issues now sharply limit increases in clock frequency, and the classic way out of that bind – lowering voltage – is now bumping up against serious quantum-noise issues.
Hardware manufacturers competing for attention have elected to do it by putting ever more processing cores in each chip they ship and touting the theoretical total throughput of the device. But there have also been rapidly increasing amounts of effort put into pipelining and speculative execution techniques that use concurrency under the hood in attempts to make the serial single processors that programmers can see crank instructions more rapidly.
The awkward truth is that many of our less glamorous computing job loads just can’t use visible concurrency very well. There are different reasons for this that have differing consequences for the working programmer, and a lot of confusion abroad among those reasons. In this episode I’m going to draw some distinctions that I hope will help all of us think more clearly.
First, we need to be clear about where harnessing hardware parallelism is easy and why that seems to be the case. We look at computing for graphics, neural nets, signal processing, and Bitcoin mining, and we see a pattern: parallelizing algorithms work best on hardware that is (a) specifically designed to execute them, and (b) can’t do anything else!
We also see that the inputs to the most successful parallel algorithms (sorting, string matching, fast-Fourier transform, matrix operations, image reverse quantization, and the like) all look rather alike. They tend to have a metric structure and an implied distinction between “near” and “far” in the data that allows it to be carved into patches such that coupling between elements far from each other is negligible.
In the terms of an earlier post on semantic locality, parallel methods seem to be applicable mainly when the data has good locality. And they run best on hardware which – like like the systolic-array processors at the heart of GPUs – is designed to support only “near” communication, between close-by elements.
By contrast, writing software that does effective divide-and-conquer for input with bad locality on a collection of general-purpose (Von Neumann architecture) computers is notoriously difficult.
We can sum this up with a heuristic: Your odds of being able to apply parallel-computing techniques to a problem are inversely proportional to the degree of irreducible semantic nonlocality in your input data.
Another limit on parallel computing is that some important algorithms can’t be parallelized at all – provably so. In the blog post where I first explored this territory I coined the term “SICK algorithm”, with the SICK expanded to “Serial, Intrinscally – Cope, Kiddo!” Important examples include but are not limited to: Dijkstra’s n-least-paths algorithm; cycle detection in directed graphs (with implications for 3-SAT solvers); depth first search; computing the nth term in a cryptographic hash chain; network-flow optimization.
Bad locality in the input data is implicated here, too, especially in graph- and tree-structure contexts. Cryptographic hash chains can’t be parallelized because their entries have to be computed in strict time order – a strictness which is actually important for validating the chain against tampering.
There’s a blocking rule here: You can’t parallelize if a SICK algorithm is in the way.
We’re not done. There are at least two other classes of blocker that you will frequently hit.
One is not having the right tools. Most languages don’t support anything but mutex-and-mailbox, which has the advantage that the primitives are easy to implement but the disadvantage that it induces horrible complexity explosions and is nigh-impossible to model accurately in your head at scales over about four interacting locks.
If you are lucky you may get some use out of a more tractable primitive set like Go channels (aka Communicating Sequential Processes) or the ownership/send/sync system in Rust. But the truth is, we don’t really know what the “right” language primitives are for parallelism on von-Neuman-architecture computers. And there may not even be one right set of primitives; there might be two, three, or more different sets of primitive appropriate for different problem domains but as incommensurable as one and the square root of two. At the present state of the art in 2018 nobody actually knows.
Last but not least, the limitations of human wetware. Even given a tractable algorithm, a data representation with good locality, and sharp tools, parallel programming seems to be just plain difficult for human beings even when algorithm being applied is quite simple. Our brains are not all that good at modelling the simpler state spaces of purely serial programs, and much less so at parallel ones.
We know this because there is plenty of real-world evidence that debugging implementations of parallelizing code is worse than merely _difficult_ for humans. Race conditions, deadlocks, livelocks, and insidious data corruption due to subtly unsafe orders of operation plague all such attempts.
Having a grasp on these limits has, I think, has been growing steadily more important since the collapse of Dennard scaling. Due to all of these bottlenecks in the supply of code that can use multiple cores effectively, some percentage of the multicore hardware out there must be running software that will never saturate its cores; or, to look at it from the other end, the hardware is overbuilt for its job load. How much money and effort are we wasting this way?
Processor vendors would love you to overestimate the functional gain from snazzy new silicon with ever larger multi-core counts; however else will they extract enough of your money to cover the eye-watering cost of their chip fabs and still make a profit? So there’s a lot of marketing push out there that aims to distract capacity planners from ever wondering when those gains are real.
And, to be fair, some places they are. The kind of servers that live in rack mounts and handle hundreds of thousands of concurrent transactions per second probably have their core count matched to their job load fairly well. Smartphones or embedded systems, too – in both these extreme cases a lot of effort goes into minimizing build costs and power budgets, and that’s going to exert selective pressure against overprovisioning.
But for typical desktop and laptop users? I have dark suspicions. It’s hard to know, because we’ve been collecting real performance gains due to other technology changes like the shift from spinning-rust to solid-state mass storage. Gains like that are easy to mistake for an effect of more CPU throughput unless you’re profiling carefully.
But here’s the shape of my suspicion:
1. For most desktop/laptop users the only seriously parallel computing that ever takes place on their computers is in their graphics chips.
2. More than two processor cores is usually just wasteful hotrodding. Operating systems may be able to parcel out applications between them, but the general run of application software is unable to exploit parallelism and it is rare for most users to run enough different processor-hungry applications simultaneously to saturate their hardware that way.
3. Consequently, most of the processing units now deployed in 4-core-and-up machines are doing nothing most of the time but generating waste heat.
My regulars include a lot of people who are likely to be able to comment intelligently on this suspicion. It will be interesting to see what they have to say.
UPDATE: A commenter on G+ points out that one interesting use case for multicores is compiling code really quickly. Source for a language like C has good locality – it can be compiled in well-separated units (source files) into object files that are later joined by a linker.