24-core CPU and I can’t type an email (part two)

By brucedawson

In my last post I promised to give more details about some rabbit holes that I went down during the investigation, including page tables, locks, WMI, and a vmmap bug. Those details are here, along with updated code samples. But first, a really quick summary of the original issue:

In the last post I talked about how every time a CFG-enabled process allocates executable memory some Control Flow Guard (CFG) memory is allocated as well. Windows never frees the CFG memory so if you keep allocating and freeing executable memory at different addresses then your process can accumulate an arbitrary amount of CFG memory. Chrome was doing this and that was leading to an essentially unbounded waste of memory, and hangs on some machines.

And, I have to say, hangs are hard to avoid if VirtualAlloc starts running more than a million times slower than normal.

In addition to the wasted CFG memory there was some other wasted memory, although it wasn’t as bad as vmmap suggested.

CFG and pages

Both code memory and CFG memory are ultimately allocated in 4-KB pages (more on that later). Since 4 KiB of CFG memory can describe 256 KiB of code memory (details on that later) that means that if you allocate a 256 KiB block of code memory that is 256 KiB aligned then you will get one 4 KiB CFG page. And, if you allocate a 4 KiB block of code memory then you will still get a 4 KiB CFG page, but with most of it unused.

imageThings get tricky when executable memory is freed. If you VirtualFree a block of executable memory that is not a multiple of 256 KiB or is not 256 KiB aligned then the OS would have to do some bookkeeping to see whether any other executable memory is still using the CFG pages. The CFG authors decided not to bother with this complexity and they just always leave the CFG memory allocated – always. That’s unfortunate. That means that when my test program allocates and then frees 1 GiB of aligned executable memory, this leaves 16 MiB of CFG memory allocated.

And, more practically, this means that when Chrome’s JavaScript engine allocates and then frees 128 MiB of aligned executable memory (not all of it was committed, but the whole address range was allocated and freed at once) then up to 2 MiB of CFG memory will be left allocated, even though it’s trivially provable that it could be freed. Since Chrome was repeatedly allocating/freeing code memory at randomized addresses this was causing the problem that I discovered.

Additional wasted memory

On any modern operating system each process gets their own virtual memory address space so that the operating system can isolate processes and protect memory. It does that using a memory-management unit (MMU) and page tables. Memory is broken up into 4-KiB pages – that’s the smallest amount of memory that the OS can give you. Each page needs to be pointed to by an eight-byte page-table entry, and those are themselves stored in 4-KiB pages. Each of those pages can point to only 512 different memory pages, so we need a hierarchy of page tables. For the 48-bit address space supported on an x64 operating system:

  • The level-one page table can cover 256 TiB (48 bits) by pointing to 512 different level-two page tables
  • Each level-two page table can cover 512 GiB by pointing to 512 different level-three page tables
  • Each level-three page table can cover 1 GiB by pointing to 512 different level-four page tables
  • Each level-four page table can cover 2 MiB by pointing to 512 different physical pages

The MMU indexes into the level-one page table with the first (of 48) 9 address bits, then into the pointed-to level-two page table with the next 9 address bits, then into the pointed-to level-three page table with the next 9 address bits, then into the level-four page table with the next 9 address bits. At that point the MMU has used 36 bits, has 12 remaining, and those are use to index into the 4 KiB page whose address was found in the level-four page table. Phew.

If all of these page-table levels were fully populated you’d need more than 512 GiB of RAM just for them, so they are sparsely populated and filled in as needed. This means that when you allocate a page of memory the OS may need to allocate some page tables – anywhere from zero to three of them, depending on whether your allocation is in a previously unused 2 MiB region, previously unused 1 GiB region, or previously unused 512 GiB region (the level-one page table is always allocated).

In short, sparse allocations can be significantly more expensive than nearby allocations, because page tables can’t be shared as much. The leaking CFG allocations were spread out quite sparsely so when vmmap told me that Chrome was using 412,480 KiB of page tables (edited out of the previous blog post) I assumed that the numbers were correct. Here’s the same screenshot of vmmap showing the chrome.exe memory layout that I had in the last blog post, but with the Page Table row shown:

Some incorrect page table numbers

But something didn’t seem right. I ended up adding a page-table simulator to my VirtualScan tool that would count how many page table pages were needed for all of the committed memory in the process being scanned. This is simply a matter of walking through committed memory, incrementing a counter every time a new multiple of 2 MiB, 1 GiB, or 512 GiB is encountered.

I quickly found that my simulator results matched vmmap on normal processes, but on processes with a large amount of CFG memory the results were off, by roughly the amount of CFG working set. For the process above where vmmap said there were 402.8 MiB (412,480 KiB) of page tables my tool said there were 67.7 MiB.

     Scan time,  Committed, page tables, committed blocksTotal: 41.763s, 1457.7 MiB,    67.7 MiB,  32112, 98 code blocks

CFG:   41.759s,  353.3 MiB,    59.2 MiB,  24866

I was able to independently verify that vmmap was wrong by running VAllocStress which in its default settings causes Windows to commit 2 GiB of CFG memory. vmmap claimed that it had also allocated 2 GiB of page tables:

image

And yet, when I killed the process Task Manager showed the commit amount going down by just 2 GiB. So, vmmap is wrong, my quick-hack page-table calculations are correct, and after discussion with a helpful twitterati the vmmap bug report has been passed along and should get fixed. The CFG memory still consumes a lot of page table entries (59.2 MiB in the example above) but not as many as vmmap said, and when the fix ships it will consume hardly any.

What is CFG and CFG memory?

I want to step back and give a bit more information on what CFG is.

CFG is short for Control Flow Guard. It is an anti-exploit technique used to halt attacks that overwrite function pointers. When CFG is enabled the compiler and OS can work together to make sure a branch target is valid. First the relevant CFG control byte is loaded from the 2-TiB CFG reservation. A 64-bit process on Windows gets a 128-TiB address space so dividing the branch target address by 64 lets us find the relevant CFG byte for that target.

uint8_t cfg_byte = cfg_base[size_t(target_addr) / 64];

Now we have one byte that is supposed to describe which addresses in a 64-byte range are valid branch targets. To make that work CFG treats the byte as four two-bit values, each one corresponding to a 16-byte range. That two-bit number (whose value goes from zero to three) is interpreted like this:

  • 0 – all targets in this sixteen-byte block are invalid indirect branch targets
  • 1 – the start address in this sixteen-byte block is a valid indirect branch target
  • 2 – all addresses except the start address are valid indirect branch targets (unused?)
  • 3 – all addresses in this sixteen-byte block are valid indirect branch targets

If an indirect branch target is found to be invalid then the process is terminated and an exploit is avoided. Hurray!

image

From this we can tell that indirect branch targets should be 16-byte aligned for maximum security, and we can see why the amount of CFG memory used by a process will be roughly 1/64th the amount of code memory.

The actual CFG code loads 32-bits at a time, but that is just an implementation detail. Many sources describe the CFG memory as one-bit per 8 bytes rather than two-bits per 16 bytes. My explanation is better.

And that’s why we can’t have nice things

The hang which helped me find this memory problem happened for two reasons. One reason is that the scanning of CFG memory on Windows 10 16299 or earlier is painfully slow. I’ve seen the scan of the address space of a process take 40 or more seconds and literally 99.99% of that time was scanning the CFG memory reservation, even though the CFG reservation was only about three quarters of the committed memory blocks. I don’t know why the scanning was so slow and since it is fixed in Windows 10 17134 I don’t care enough to investigate.

This slow scanning caused a hang because gmail wanted to get the CFG reservation and WMI was holding a lock while scanning. But the lock on that memory reservation wasn’t being held the entire length of the scan. In my sample above there were ~49,000 blocks in the CFG region and the NtQueryVirtualMemory function, which acquires and releases the lock, was being called once for each of them. Therefore the lock was being acquired and released ~49,000 times and was held for, on average, less than one ms at a time.

But for some reason, even though the lock was released 49,000 times the Chrome process was never able to acquire it. That’s not fair!

And that is exactly the problem. As I said last time:

This is because Windows locks are, by design, not fair and if a thread releases a lock and then tries to reacquire it immediately then it can, in cases like this, reacquire it every single time.

A fair lock means that two threads fighting over a lock will alternate, with each making progress. However there will be lots of context switches, which are expensive, and the lock might spend a lot of time with no thread inside its critical region.

image

Unfair locks are cheaper, and they don’t force threads to wait in a line. They can just grab the lock, as Joe Duffy’s article mentions. Joe Duffy also says:

The change to unfair locks clearly has the risk of leading to starvation. But, statistically speaking, timing in concurrent systems tends to be so volatile that each thread will eventually get its turn to run, probabilistically speaking.

So how can we reconcile Joe’s 2006 statement about the rarity of starvation with my experience of a 100% repeatable and long-running starvation problem? I think the main reason is something else that happened in 2006. Intel launched the Core Duo, and multi-core computers began to become ubiquitous.

Because it turns out that my starvation problem will only happen on a multi-core system! On a multi-core system the WMI thread will release the lock, signal the Chrome thread to wake up, and then continue working. Because the WMI thread is already running it has a “head start” on the Chrome thread and it can easily re-enter NtQueryVirtualMemory and reacquire the lock before Chrome has a chance to get there.

On a single-core system you can, obviously, only have a single thread running at a time. Windows usually boosts the priority of a thread that has just been readied and that priority boost means that when the lock is released Chrome’s thread will be readied and will immediately preempt the WMI thread. This gives the Chrome thread lots of time to wake up and acquire the lock and the starvation never happens.

You see? On the multi-core system the priority boost will, in most cases, have no effect on the WMI thread, because it will be on a different core!

This means that a system with extra cores may end up being less responsive than one with the same workload and fewer cores. Curiously enough this means that if my computer had been busier – if it had had threads of the appropriate priority running on all of the CPU cores – then the hang might have been avoided (don’t try this at home).

So, unfair locks have higher throughput but can lead to starvation. I suspect that the solution is what I call “occasionally-fair” locks which would be unfair, say, 99% of the time, but would be fair 1% of the times when a contended lock was released. This would give most of the performance advantages, while avoiding most starvation. Windows locks used to be fair, they probably could be fair again, and maybe being occasionally fair would be the perfect balance. Disclaimer: I am not a locking expert or an OS engineer, but I’d be interested to hear thoughts on this, and at least I’m not the first to suggest something like this.

In summary: Holding a lock for seconds at a time is bad form, and restricts parallelism.
However, on multi-core systems with unfair locks, releasing and then immediately reacquiring the lock behaves almost identically – other threads will have no chance to sneak in.

Close call with ETW

imageI rely on ETW tracing for all of these investigations so I was a bit horrified when I started investigation this issue and found that Windows Performance Analyzer (WPA) couldn’t load Chrome’s symbols. It used to work, just last week, I’m certain. What happened…

What happened is that Chrome M68 shipped, and it is linked with lld-link instead of with VC++’s linker, and if you run dumpbin and look at the debug information you see this:

C:\b\c\b\win64_clang\src\out\Release_x64\./initialexe/chrome.exe.pdb

Okay, I guess WPA doesn’t like those slashes, but that doesn’t make sense because I signed off on the switch to lld-link and I know that I tested WPA before doing that so what happened…

What happened is that the 17134 version of WPA had shipped. I had tested the lld-link’s output with WPA 16299 and that had worked. What perfect timing! The new linker and new WPA were incompatible.

I reverted my copy of WPA so I could continue the investigation (xcopy from a machine with the older version works fine) and filed an lld-link bug which the team promptly fixed. I’ll be able to upgrade back to WPA 17134 when M69 ships, built with the fixed linker.

WMI

WMI, the trigger for the hangs, is Windows Management Instrumentation and it is something that I know very little about. I found that somebody had hit problem with high CPU use in WmiPrvSE.exe inside of perfproc!GetProcessVaData in 2014 or earlier, but they didn’t leave enough information to be useful. At this point I made the mistake of trying to figure out what crazy WMI query could be so expensive that it would cause gmail to hang for seconds at a time. I roped in some experts and wasted a bunch of time trying to find the magical query. I recorded Microsoft-Windows-WMI-Activity in my ETW traces, I experimented with powershell to find all of the Win32_Perf queries and went down a few other rabbit holes that are too tedious to share. I eventually found that the Win32_PerfRawData_PerfProc_ProcessAddressSpace_Costly counter, invokable through a single line of powershell, would trigger the hang in gmail:

measure-command {Get-WmiObject -Query “SELECT * FROM Win32_PerfFormattedData_PerfProc_ProcessAddressSpace_Costly”}

I then went down even more rabbit holes because of the counter name (“Costly”? really?) and because this counter appears and disappears based on factors that I don’t understand.

But the details of WMI didn’t really matter. WMI wasn’t doing anything wrong – not really – it was just scanning memory. Writing my own scanning code was far more useful.

Chores for Microsoft

Chrome has landed a mitigation, and all the remaining work items are on Microsoft

  1. Speed up the scanning of CFG regions – okay, this one is done
  2. Free up CFG memory when executable memory is released – at least for the 256 KiB aligned case, that’s easy
  3. Consider a flag to allow allocating executable memory without CFG memory, or use PAGE_TARGETS_INVALID for this purpose
  4. Fix vmmap’s calculation of page tables for processes with lots of CFG commit

Code updates

I updated my code samples – especially VAllocStress. VAllocStress now includes twenty lines of code that demonstrates how to find the CFG reservation in your process. I also added test code that uses SetProcessValidCallTargets to validate the meaning of the CFG bits, and to demonstrate the tricks needed to call it successfully (hint: calling it through GetProcAddress will probably hit a CFG violation!).