JDK 16 is out, and as usual, each new release comes with a bunch of new features, enhancements and bug fixes. ZGC received 46 enhancements and 25 bug fixes. Here I’ll cover a few of the more interesting enhancements.
(a.k.a. Concurrent Thread-Stack Processing)
When we started the ZGC project, our goal was to never have a GC pause take longer than 10ms. At the time, 10ms seemed like an ambitious goal. Other GCs in HotSpot typically offered max pause times several magnitudes worse than that, especially when using large heaps. Reaching this goal was to a large extent a matter of doing all the really heavy work, such as relocation, reference processing, and class unloading, in a concurrent phase rather than in a Stop-The-World phase. Back then, HotSpot lacked a lot of the infrastructure needed to do this concurrently, so it took a few years of development to get there.
After reaching that initial 10ms goal, we re-aimed and set our target on something more ambitious. Namely that a GC pause should never be longer than 1ms. Starting with JDK 16, I’m happy to report that we’ve reached that goal too. ZGC now has O(1) pause times. In other words, they execute in constant time and do not increase with the heap, live-set, or root-set size (or anything else for that matter). Of course, we’re still at the mercy of the operating system scheduler to give GC threads CPU time. But as long as your system isn’t heavily over-provisioned, you can expect to see average GC pause times of around 0.05ms (50 µs) and max pause times of around 0.5ms (500 µs).
So, what did we do to get here? Well, prior to JDK 16, ZGC pause times still scaled with the size of (a subset of) the root-set. To be more precise, we still scanned thread stacks in a Stop-The-World phase. This meant that if a Java application had a large number threads, then pause times would increase. The pause times would increase even more if those threads had deep call stacks. Starting with JDK 16, scanning of thread stacks is done concurrently, i.e. while the Java application continues to run.
As you can imagine, poking around in thread stacks, while threads are running, requires a bit of magic. This is accomplished by something called a Stack Watermark Barrier. In short, this is a mechanism that prevents a Java thread from returning into a stack frame without first checking if it’s safe to do so. This is an inexpensive check that is folded into the already existing safe-point check at method return. Conceptually you could think of it as a load barrier for a stack frame, which, if needed, will force the Java thread to take some form of action to bring the stack frame into a safe state before returning into it. Each Java thread has one or more stack watermarks, which tell the barrier how far down the stack it’s safe to walk without any special action. To walk past a watermark, a slow path is taken to bring one or more frames into the currently safe state, and the watermark is updated. The work to bring all thread stacks into a safe state is normally handled by one or more GC threads, but since this is done concurrently, Java threads will sometimes have to fix a few of their own frames, if it wants to return into a frame that the GC hasn’t gotten to yet. If you’re interested in more details, please have a look at JEP 376: ZGC: Concurrent Thread-Stack Processing, which describes this work.
With JEP 376 in place, ZGC now scans exactly zero roots in Stop-The-World phases. For many workloads, you saw really low max pause times even before JDK 16. But if you ran on a large machine, and your workload had a large number of threads, you could still see max pause times well above 1ms. To visualize the improvement, here’s an example comparing JDK 15 and JDK 16, running SPECjbb®2015 on a large machine with a couple of thousand Java threads.
In JDK 16, ZGC got support for in-place relocation. This feature helps avoid
OutOfMemoryError in situations where the GC
needs to collect garbage when the heap is already filled to the brim. Normally, ZGC compacts the heap (and thereby frees up memory)
by moving objects from sparsely populated heap regions into one or more empty heap regions where these objects can be densely packed.
This strategy is simple and straightforward and lends itself very well for parallel processing. However, it has one drawback.
It requires some amount of free memory (at least one empty heap region of each size type) to get the relocation process started.
If the heap is full, i.e. all heap regions are already in use, then we have nowhere to move objects to.
Prior to JDK 16, ZGC solved this by having a heap reserve. This heap reserve was a set of heap regions that was set aside and made unavailable for normal allocations from Java threads. Instead only the GC itself was allowed to use the heap reserve when relocating objects. This ensured that empty heap regions were available, even if the heap was full from a Java thread’s perspective, to get the relocation process started. The heap reserve was typically a small fraction of the heap. In a previous blog post, I wrote about how we improved it in JDK 14 to better support tiny heaps.
Still, the heap reserve approach had a few problems. For example, since the heap reserve was not available to Java threads doing
relocation, there was no hard guarantee that the relocation process could complete and hence the GC couldn’t reclaim (enough) memory.
This was a non-issue for basically all normal workloads, but our testing revealed that it was possible to construct a program that
provoked this problem, which in turn resulted in premature
OutOfMemoryError. Also, setting aside some (though small) portion of the
heap, just in case it was needed during relocation, was a waste of memory for most workloads.
Another approach to free up contiguous chunks of memory is to compact the heap in-place. Other HotSpot collectors (e.g. G1, Parallel and Serial) do some version of this when they do a so called Full GC. The main advantage of this approach is that it doesn’t need memory to free up memory. In other words, it will happily compact a full heap, without needing a heap reserve of some sort.
However, compacting the heap in-place also has some challenges and typically comes with an overhead. For example, the order in which objects are moved now matter a lot, as you otherwise risk overwriting not-yet-moved objects. This requires more coordination between GC threads, doesn’t lend itself as well to parallel processing, and also affects what Java threads can and can’t do when they relocate an object on behalf of the GC.
In summary, both approaches have advantages. Not relocating in-place typically performs better when empty heap regions are available, while relocating in-place can guarantee that the relocation process successfully completes even when no empty heap regions are available.
Starting with JDK 16, ZGC now uses both approaches to get the best of both worlds. This allows us to get rid of the need for a heap reserve, while maintaining good relocation performance in the common case, and guaranteeing that relocation always successfully completes in the edge case. By default, ZGC will not relocate in-place as long as there is an empty heap region available to move objects to. Should that not be the case, ZGC will switch to relocate in-place. As soon as an empty heap region becomes available, ZGC will again switch back to not relocating in-place.
Switching back and forth between these relocation modes happens seamlessly, and if needed, multiple times in the same GC cycle.
However, most workloads will never run into a situation where the need to switch arises in the first place. But knowing that ZGC will
cope with these situations well, and never throw a premature
OutOfMemoryError because of failure to compact the heap, should give some
peace of mind.
The ZGC logging was also extended to show how many heap regions (
ZPages) of each size group (
Large) were relocated
in-place. Here’s an example where 54MB worth of small objects where relocated, and 3 small pages needed to be relocated in-place.
... GC(15) Small Pages: 120 / 240M, Empty: 0M, Relocated: 54M, In-Place: 3 GC(15) Medium Pages: 2 / 64M, Empty: 0M, Relocated: 0M, In-Place: 0 GC(15) Large Pages: 1 / 4M, Empty: 0M, Relocated: 0M, In-Place: 0 ...
Allocation & Initialization of Forwarding Tables
When ZGC relocates an object, the new address of that object is recorded in a forwarding table, a data structure allocated outside of the Java heap. Each heap region selected to be part of the relocation set (the set of heap regions to compact to free up memory) gets a forwarding table associated with it.
Prior to JDK 16, the allocation and initialization of forwarding tables could take a significant part of the overall GC cycle time when the relocation set was very large. The size of the relocation set correlates with the number of objects moved during relocation. If you have, for example, a >100GB heap and the workload causes lots of fragmentation, with small holes evenly distributed across the heap, then the relocation set will be large and allocating/initializing it can take a while. Of course, this work has always been done in a concurrent phase, so it has never affected GC pause times. Still, there was room for improvements here.
In JDK 16, ZGC now allocates forwarding tables in bulk. Instead of making numerous calls (potentially many thousands) to
allocate memory for each table, we now do a single call to allocate all memory needed by all tables in one go. This helps avoid typically
allocation overheads and potential lock contention, and significantly reduces the time it takes to allocate these tables.
The initialization of these tables was another bottleneck. The forwarding table is a hash table, so initializing it means setting up a small header and zeroing out a (potentially large) array of forwarding table entries. Starting with JDK 16, ZGC now does this initialization in parallel using multiple threads, instead of with a single thread.
In summary, these changes significantly reduced the time is takes to allocate and initialize forwarding tables, especially when collecting very large heaps that are sparsely populated, where the reduction can be on the order of one or two magnitudes.
With concurrent thread-stack scanning, ZGC now has pause times in the microsecond domain, with average pause times of ~50µs and max pause times of ~500µs. Pause times are unaffected by the heap, live-set and root-set size.
The heap reserve is now gone, and ZGC will instead do in-place relocation when needed. This saves memory, but also guarantees that the heap can be successfully compacted in all situations.
Forwarding tables are now allocated and initialized more efficiently, which shortens the time it takes to complete a GC cycle, especially when collecting large heaps that are sparsely populated.