Cargo builds like:

$ RUSTFLAGS="--emit asm" cargo build --release
$ cargo rustc --release -- --emit asm

Do not always output assembly equivalent to the machine code you’d get from:

Possibly rustc --emit=asm has some uses, like examining a single file with no external dependencies, but it’s not useful for my normal case of wanting to look at the asm for an arbitrary release build.

The long version

Previously I rewrote my ray tracer to use crossbeam::scope and crossbeam::queue instead of rayon. Internally rayon leans heavily on crossbeam::deque for its work-stealing implementation, so my expectation was that this change would neutral or a slight improvement, depending on how good of a job the compiler had been doing to condense rayon’s abstractions.

Instead it was a ~15% regression.

Looking at the asm, pt. 1

The asm output appeared sane. I saw no expensive indirection, calls, etc. - things were getting properly inlined and optimized.

Understanding rayon

I first questioned my understanding of rayon and spent some time digging through its guts. It’s well-engineered, and it’s impressive that clang’s able to condense all of its abstractions down into basically no overhead - but I also didn’t see anything fundamentally novel or surprising going on that would give it a significant performance edge. The splitting/work assignment portion of the vec codepath looked like it would lead to slightly more even partitioning than my hand-built crossbeam method, but not by a lot, and definitely not by 15%. So that was bust. I did notice that crossbeam needed to heap allocate the closure I was using for my thread body, so perhaps that caused some additional overhead, but it should have been negligible.

CPU profiling

At this point I dumped both versions into Instruments and did some basic CPU profiling. rayon’s a bit annoying to poke around in because you end up with extremely deep stacks of join frames, but nothing really stood out. The crossbeam version was simply slower with no major red flags.

More in-depth CPU profiling

I’d been looking for an excuse to try Intel VTune for awhile, but since it’s only supported on Windows and Linux and is best run on bare-metal, it had always been slightly too much effort to stand up for smaller projects. It seemed warranted for this one! I had an existing Windows bootcamp partition, so figured I’d see just how much hassle it was to get everything working in that before I dusted off something to run Linux.

Sidebar: turns out Rust on Windows is… really nice. I’m not a Windows dev. There are things I admire about the ecosystem (like a good first-party debugger and some decent OS APIs), but apart from some Java way back in high school I’ve never even tried to compile software on a Windows machine. It always looked like a nightmare for C/C++ projects - I’m familiar enough with the code side of cross-platform support, but as for actually building things… I think cmake can spit out a Visual Studio project? And I keep hearing about WSL? So I went in with significant trepidation. Turns out it took all of ten minutes to install the VS C++ tools, rustup, a rust toolchain, vtune, and get everything building and working together. Pretty impressive.

VTune itself is a complex beast. Most (all?) of the data in it is stuff you could get out of perf, but the collection and workflow is streamlined - it does a good job of keeping track of previous runs, grouping them in a way so you don’t lose anything, surfacing useful information based on top-level categories (eg. “I want to look at memory access”), and providing a diff view between runs. It looks particularly useful for guiding iterative optimization and refinement. It’s a bit less useful when I’m comparing the performance of two fairly different programs, because many of the stack traces are unique to either the rayon or crossbeam version, so “you have 100% more of these rayon stack traces in this run” is not helpful. Looking through the data I saw that I was getting flagged on uarch perf, retiring instructions maybe 5% worse in the crossbeam version. Thinking that could be stalling waiting on memory, I ran a memory access profile and saw:

Crossbeam version is on the left, rayon version is on the right. Okay, 3s runtime difference - that’s commensurate with the perf regression I’m seeing. Interesting, we’re memory bound twice as frequently. That’s strange because our memory access pattern should be pretty similar. We’re doing over twice as many stores. We’re doing some additional loads. We’re…

Wait.

We’re doing over twice as many stores?! That doesn’t make sense.

Replacing crossbeam::scope

Perhaps heap allocating the closures was more expensive than I thought, or had bad knock-on effects. It’s a long shot, but the whole point of side projects is following some of those random tangents. I set about eliminating crossbeam::scope and using std::thread directly instead. This was a quick and dirty test: the entire point of scope is to create an abstraction that communicates to the borrow checker that threads we’ve spun off have been joined, otherwise it doesn’t know when a thread’s borrow is guaranteed to have ended and requires that data references from a thread’s closure are all static lifetime. In this case I’m manually joining the threads, so I can do a transmute to placate the compiler. Don’t ship code like this, it defeats the purpose of using Rust in the first place - you’d have a better experience with C++. But it can be really handy to circumvent these sorts of checks when doing quick prototyping/performance analysis to decide if it’s worth the time to build out a safe abstraction. I would welcome a “just build this without the borrow checker” mode for cases like this, though I’m probably in the minority and I don’t expect that would be an easy feature to add.

My testing code looked roughly like this:

let pixels = unsafe { mem::transmute::<&mut [V3], &'static mut [V3]>(pixels)
};
let handle = std::thread::spawn(move || { // code that uses &pixels
});
handle.join().unwrap();

As expected, no significant performance gains were had.

Looking at the asm, pt. 2…

Something isn’t adding up so I want to look at the assembly again, but I’d like to clearly distinguish between my unchanged business logic and the rayon/crossbeam coordination code. The majority of my business logic is behind a single function named cast; adding #[inline(never)] to that single ray processing function should give me a nice seam between rayon and my business logic.

Build, run and the rayon version slows down… in fact it runs exactly as slow as the crossbeam version.

I try adding #[inline(always)] to the cast function in the crossbeam version, and lo and behold it speeds up to match the original rayon version, my regression disappears.

But, how’s that possible? The first thing I did was look at inlining. Maybe my quick once-over missed it, maybe I misread and this whole circuitous path is all my fault?

I generated assembly output for both the inlined and noninlined versions of the crossbeam ray tracer:

$ rg ecl_rt4cast inline.s 21293:	.asciz	"_ZN6ecl_rt4cast17hc1100eade04dff75E"
$ rg ecl_rt4cast noinline.s 21293:	.asciz	"_ZN6ecl_rt4cast17hc1100eade04dff75E"

I’m building release with symbols, so that string is expected. But neither version, not even the non-inlined version, is making calls to cast(). Curious.

$ wc -l inline.s 203969 inline.s
$ wc -l noinline.s 203969 noinline.s

Now I feel like I’m being gaslighted. These are the exact same length. A diff shows that the only changes are some arbitrary IDs in debug info. I have a difficult relationship with optimizing compilers, so my first thought is maybe clang’s being clang again and I should go validate the binaries instead…

$ objdump -d ecl_rt_inline | rg ecl_rt4cast
<no output>
$ objdump -d ecl_rt_noinline | rg ecl_rt4cast
0000000100003190 __ZN6ecl_rt4cast17hc1100eade04dff75E:
100003299: e9 af 01 00 00 jmp	431 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x2bd>
1000034a2: eb 1f jmp	31 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x333>
1000034c6: 74 38 je	56 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x370>
1000034e5: 0f 82 f5 00 00 00 jb	245 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x450>
1000034ee: 72 1d jb	29 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x37d>
1000034f0: e9 eb 00 00 00 jmp	235 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x450>
100003503: 0f 83 d7 00 00 00 jae	215 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x450>
100003515: 0f 87 16 03 00 00 ja	790 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6a1>
10000351e: 0f 82 1f 03 00 00 jb	799 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6b3>
100003527: 0f 82 2b 03 00 00 jb	811 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6c8>
100003530: 0f 82 37 03 00 00 jb	823 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6dd>
100003539: 0f 82 40 03 00 00 jb	832 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x6ef>
100003590: 0f 84 1a ff ff ff je	-230 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x320>
1000035db: e9 d0 fe ff ff jmp	-304 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x320>
10000360a: 0f 86 a1 01 00 00 jbe	417 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x621>
100003637: 0f 87 57 02 00 00 ja	599 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x704>
100003668: 0f 86 3d 02 00 00 jbe	573 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x71b>
100003682: 0f 84 41 01 00 00 je	321 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x639>
100003707: 0f 85 93 fb ff ff jne	-1133 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x110>
10000371a: 0f 86 9f 01 00 00 jbe	415 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x72f>
100003723: 0f 86 a8 01 00 00 jbe	424 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x741>
10000372c: 0f 86 b1 01 00 00 jbe	433 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x753>
1000037ac: e9 57 fc ff ff jmp	-937 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x278>
1000037c7: eb 12 jmp	18 <__ZN6ecl_rt4cast17hc1100eade04dff75E+0x64b>
100009740: e8 4b 9a ff ff callq	-26037 <__ZN6ecl_rt4cast17hc1100eade04dff75E>

Bingo - note the callq. Clearly my crossbeam version wasn’t inlining as aggresively as the rayon version, possibly due to the Box::new(closure). Instructing the compiler to do so brought performance in line with expectations. It’s silly that the compiler wasn’t inlining it in the first place, this function has a single callsite and inlining it improves both runtime performance and binary size.

That means --emit=asm does something entirely unexpected. I dug around and sure enough there are reports that running with --emit=asm will build with a different configuration due to interaction with ThinLTO and codegen units.

Fin

It’s not ideal to rely on disassemblers because they’re also fallible. In the same way that going from C to asm loses fidelity and makes decompiling from asm to C difficult, going from asm to machine code also loses fidelity and there can be inconsistenices when disassembling machine code back into asm.

The common disassemblers like objdump are linear sweep and can suffer from mistaking data for code. There’s another family of disassemblers based on recursive traversal that avoid those problems, but come with their own set of tradeoffs.

Note that the learning curve on disassemblers can be steep. These tools are often packaged into a suite and targeted towards reverse engineering and malware analysis, they come with far more features than “give me a good disassembly and make it easy to visualize/browse.” Hopefully it’ll be easier to match the --emit=asm build config to a normal release build config in the future, but until then I’ll be getting comfortable with Ghidra.