An Intel Programmer Jumps Over the Wall: First Impressions of ARM SIMD Programming

By geofflangdale

buddha-jump-over-the-wall-2280691_1920(the pictured dish is apparently materials for “Buddha Jumps Over The Wall”, named for its ability to seduce away vegetarians – sadly it uses shark fin so has some ethical issues…)

I’ve done a lot of SIMD coding. However, aside from dabbling with a bit of 32-bit ARM coding during the early Hyperscan era (back before the Intel acquisition of Sensory Networks), it’s been nearly all Intel SIMD – SSE2 through to AVX512.

Recently I’ve been working on an ARM port of simdjson, our fast (Gigabytes/second) SIMD parser. I’ll be uploading preliminary ARM SIMD code for this soon. While the experience is fresh in my mind, I thought I’d write up some first impressions of ARM AArch64 SIMD programming (good, bad and just plain ugly).

The Good

First of all, orthogonality. it’s really nice to program with a SIMD instruction set where one (usually) doesn’t have to wonder whether there will be a an operation for a given data size. Every time I went looking for an operation on bytes, I found it (this is by contrast to Intel SIMD programming, where a whole raft of operations don’t exist for bytes and some don’t exist for 16-bit “word” quantities either).

[ many of these missing byte operations will finally appear with SNC, assuming GFNI is fast enough; the catchily named GF2P8AFFINEQB will allow arbitrary bit permutes, thus including rotates and shifts – see Intel® Architecture Instruction Set Extensions and Future Features Programming Reference for details ]

Orthogonality of these operations is a huge relief to the programmer – it’s less work to commit things to memory, and fewer “what the hell” moments later when you realize something that should exist doesn’t. For example, when doing the “bits to indexes” work my original paper notes on the code happily had an operation that didn’t exist (masked 512-bit OR using a byte-granularity kreg).

Second, multiple-table permutes: TBL and TBX can take multiple registers – up to 4 – as inputs. Thus, you can permute over up to 512 bits. This is a leap-frogging race here – with VBMI and Cannonlake, Intel will allow 2 AVX512-bit registers to be used in a VPERMI2B or VPERMT2B. More on latencies later (I would like to praise these ARM SIMD operations more but, despite many happy claims about how fast these ops are in some architectures – e.g. A12 – I can’t find any documentation).

Note for Intel people – TBL/TBX yield “zero” or “no operation” on out of range indexes, which is a contrast to PSHUFB (with its odd behavior governed by the high bit) or VPERM*, where only the low-order bits affect what the permute does. This seems to be a neutral change; sometimes the PSHUFB behavior is annoying, sometimes it’s useful.

Third, horizontal operations and pairwise operationsThis is something that exists spottily on Intel SIMD, but ARM allows a wide variety of operations to be either applied across the whole vector or be done as a pairwise approach. ADD and MAX/MIN are pretty handy in both contexts.

Fourth, multiple vector interleaved load and store. This is pretty awesome, and the latency/throughput numbers aren’t too bad for at least A75.

Some elements of ARM SIMD coding are pleasant but no longer a significant advantage. The comprehensive range of compare operations is now matched by AVX512. Overall, it’s still a very pleasant instruction set to work with.

The Bad

There is no equivalent of PMOVMSKB on ARM. People have been complaining about this for years. I have a couple not-too-terrible workarounds for this – especially if one has lots of these operations to do at once (e.g. 4×128 bulk PMOVMSKB equivalent to a 64-bit register) which will be the topic of a blog post in the near future. There is at least a decent version involving masking and a series of paired-add operations. So this can be worked around.

It’s 2019 and we’re still doing 128-bit SIMD. I’m excited to see Scalar Vector Extensions (SVE), but… it was announced late 2016 and the only place you can use this extension is a Fujitsu supercomputer? The absence of SVE from the Neoverse announcement was worrying; this will be a processor shipping almost 4 years after SVE was announced that seemed like a logical fit for SVE. ARM really needs to announce a roadmap for SVE. Is anyone going to support it?

The Ugly

Documentation. OK, so we have tons of different ARM variants out there supporting AArch64. Why do none of them – aside from ARM itself – publish tables of instruction latency and throughput? Plenty of people complain about Intel’s documentation, but the Software Optimization Guide coupled to all the supplementary information (Agner Fog, uops.info) is a wonderful source of information by comparison.

ARM apparently has made real strides in openness – I can get a lot of information off their site without registering or being force-signed-up for marketing material (there are still some regrettable exceptions to this – there’s a Neon programmers guide that forces you to sign up for marketing emails, then turns out to be old…). However, most of the other vendors (Apple, Marvell, Ampere) seem to provide zero information to the public (there might be a super-secret NDA version?). This is depressing: you guys do understand that it helps you to have people write good code for your architecture, right?

Also, holes in the tables of throughput and latency are never welcome, no matter which vendor you are. I’d like to wax more enthusiastic about TBL and TBX but even ARM’s data on this has holes (no throughput data).

Conclusion

All up it’s been fairly pleasant to port simdjson to ARM. Missing operations are counter-balanced by a lot of nice new tricks. I’d say the biggest source of pain at this stage is how scarce information on non-ARM implementations of the instruction set.

If anyone has better software optimization resources for ARM (or if there’s a secretive ARM equivalent of Agner Fog or uops.info lurking out there), please comment or send me an email.