Too many cores

By directhex

ARM is important for us. It’s important for IOT scenarios, and it provides a reasonable proxy for phone platforms when it comes to developing runtime features.

We have big beefy ARM systems on-site at Microsoft labs, for building and testing Mono – previously 16 Softiron Overdrive 3000 systems with 8-core AMD Opteron A1170 CPUs, and our newest system in provisional production, 4 Huawei Taishan XR320 blades with 2×32-core HiSilicon Hi1616 CPUs.

The HiSilicon chips are, in our testing, a fair bit faster per-core than the AMD chips – a good 25-50%. Which begged the question “why are our Raspbian builds so much slower?”

Blowing a raspberry

Raspbian is the de-facto main OS for Raspberry Pi. It’s basically Debian hard-float ARM, rebuilt with compiler flags better suited to ARM11 76JZF-S (more precisely, the ARMv6 architecture, whereas Debian targets ARMv7). The Raspberry Pi is hugely popular, and it is important for us to be able to offer packages optimized for use on Raspberry Pi.

But the Pi hardware is also slow and horrible to use for continuous integration (especially the SD-card storage, which can be burned through very quickly, causing maintenance headaches), so we do our Raspbian builds on our big beefy ARM64 rack-mount servers, in chroots. You can easily do this yourself – just grab the raspbian-archive-keyring package from the Raspbian archive, and pass the Raspbian mirror to debootstrap/pbuilder/cowbuilder instead of the Debian mirror.

These builds have always been much slower than all our Debian/Ubuntu ARM builds (v5 soft float, v7 hard float, aarch64), but on the new Huawei machines, the difference became much more stark – the same commit, on the same server, took 1h17 to build .debs for Ubuntu 16.04 armhf, and 9h24 for Raspbian 9. On the old Softiron hardware, Raspbian builds would rarely exceed 6h (which is still outrageously slow, but less so). Why would the new servers be worse, but only for Raspbian? Something to do with handwavey optimizations in Raspbian? No, actually.

When is a superset not a superset

Common wisdom says ARM architecture versions add new instructions, but can still run code for older versions. This is, broadly, true. However, there are a few cases where deprecated instructions become missing instructions, and continuity demands those instructions be caught by the kernel, and emulated. Specifically, three things are missing in ARMv8 hardware – SWP (swap data between registers and memory), SETEND (set the endianness bit in the CPSR), and CP15 memory barriers (a feature of a long-gone control co-processor). You can turn these features on via abi.cp15_barrier, abi.setend, and abi.swp sysctl flags, whereupon the kernel fakes those instructions as required (rather than throwing SIGILL).

CP15 memory barrier emulation is slow. My friend Vince Sanders, who helped with some of this analysis, suggested a cost of order 1000 cycles per emulated call. How many was I looking at? According to dmesg, about a million per second.

But it’s worse than that – CP15 memory barriers affect the whole system. Vince’s proposal was that the HiSilicon chips were performing so much worse than the AMD ones, because I had 64 cores not 8 – and that I could improve performance by running a VM, with only one core in it (so CP15 calls inside that environment would only affect the entire VM, not the rest of the computer).

Escape from the Pie Folk

I already had libvirtd running on all my ARM machines, from a previous fit of “hey one day this might be useful” – and as it happened, it was. I had to grab a qemu-efi-aarch64 package, containing a firmware, but otherwise I was easily able to connect to the system via virt-manager on my desktop, and get to work setting up a VM. virt-manager has vastly improved its support for non-x86 since I last used it (once upon a time it just wouldn’t boot systems without a graphics card), but I was easily able to boot an Ubuntu 18.04 arm64 install CD and interact with it over serial just as easily as via emulated GPU.

Because I’m an idiot, I then wasted my time making a Raspbian stock image bootable in this environment (Debian kernel, grub-efi-arm64, battling file-size constraints with the tiny /boot, etc) – stuff I would not repeat. Since in the end I just wanted to be as near to our “real” environment as possible, meaning using pbuilder, this simply wasn’t a needed step. The VM’s host OS didn’t need to be Raspbian.

Point is, though, I got my 1-core VM going, and fed a Mono source package to it.

Time taken? 3h40 – whereas the same commit on the 64-core host took over 9 hours. The “use a single core” hypothesis more than proven.

Next steps

The gains here are obvious enough that I need to look at deploying the solution non-experimentally as soon as possible. The best approach to doing so is the bit I haven’t worked out yet. Raspbian workloads are probably at the pivot point between “I should find some amazing way to automate this” and “automation is a waste of time, it’s quicker to set it up by hand”

Many thanks to the #debian-uk community for their curiosity and suggestions with this experiment!