The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net!
The Meltdown CPU vulnerability, first disclosed in early January, was frightening because it allowed unprivileged attackers to easily read arbitrary memory in the system. Spectre, disclosed at the same time, was harder to exploit but made it possible for guests running in virtual machines to attack the host system and other guests. Both vulnerabilities have been mitigated to some extent (though it will take a long time to even find all of the Spectre vulnerabilities, much less protect against them). But now the newly disclosed "L1 terminal fault" (L1TF) vulnerability (also going by the name Foreshadow) brings back both threats: relatively easy attacks against host memory from inside a guest. Mitigations are available (and have been merged into the mainline kernel), but they will be expensive for some users.
Understanding L1TF requires an understanding of the x86 page-table entry (PTE) format. Remember that, in a virtual-memory system, the memory addresses used by both user space and the kernel do not point directly into physical memory. Instead, the hierarchical page-table structure is used to translate between virtual and physical addresses. At the bottom level of this structure, the PTE tells the processor whether the page is actually present in physical memory, where it is, and a few other details. It looks like this for a 4KB page on an x86-64 system:
The page-frame number (PFN) tells the processor where to find the page in physical memory. The other bits control which memory protection key is assigned to the page, access permissions, whether and how the page is cached, whether it is dirty, and more. All of this, though, depends on the present ("P") bit in the least-significant position. If that bit is not set, the page is not actually present in physical memory, and any attempt to reference it will generate a page fault.
For non-present pages, none of the other bits in the page-table entry are meant to be used by the processor, so the kernel can use those bits to store useful information; for example, for pages that have been swapped out, the location in the swap area is stored in the PTE. In other cases, the data left in non-present PTEs is essentially random.
Ignoring the present bit
If the present bit in a given PTE is not set, the PFN number field of that PTE has no defined meaning and the CPU has no business trying to use it. So, naturally, Intel CPUs do exactly that during speculative execution (it would appear that Intel is the only vendor to make this particular mistake). During speculative execution, non-present PTEs are treated as if they were valid, so non-present PTEs can be used to speculatively read whatever data lives in the indicated PFN — but, importantly, only if that data is in the processor's L1 cache. The access is speculative only; the processor will eventually notice that the page is not actually present and generate a page fault instead. But, by the time that happens, the usual sorts of covert channels can be used to exfiltrate the data in whatever page the PTE might have pointed to.
Since this attack goes directly to a physical address, it can in theory read any memory in the system. Notably, that includes data kept within an SGX encrypted enclave, which is supposed to be protected from this kind of thing.
Exploiting this vulnerability requires the ability to run code on the target system. Even then, on its face, this bug is somewhat hard to exploit. Attackers cannot directly create non-present PTEs pointing to a page of interest, so they must depend on such PTEs already existing in their address space. By filling the address space with pages that will eventually get reclaimed or by playing tricks with PROT_NONE mappings, an attacker can essentially throw darts at the system and hope that one hits in an interesting place, but it's a non-deterministic process where it's even hard to tell if one has succeeded.
Nonetheless, the potential for the extraction of important secrets exists, and thus this bug must be defended against. The approach taken here is to simply invert all of the bits in a PTE when it is marked as being not present; that will cause that PTE to point into a nonexistent region of memory. The fix is easy, and the performance cost is almost zero. A quick kernel upgrade, and this problem is solved.
At least, the problem is solved on systems where virtualization is not in use. On systems with virtualized guests then, at a minimum, those guests must also run a kernel using the PTE-inversion technique to protect against attacks. If guests are trusted, or if they cannot install their own kernels, the problem stops here.
But if the system is running with untrusted guests and, in particular, if that system allows those guests to provide their own kernels (as many hosting services do), the situation changes. An attacker can then run a kernel that creates arbitrary non-present PTEs on demand, turning a shot-in-the-dark attack into something that can be targeted with precision. To make an attacker's life even easier, the speculative data reference bypasses the extended page tables in the guest, allowing direct access to physical memory. So an attacker who can install a kernel in a guest instance can attack the host (or other guests) with relative ease. In this context, L1TF can be seen as a limited form of Meltdown that can escape virtualization.
Protecting against hostile guests is a harder task, and the correct answer will depend on the specifics of the workload being run. The first step is to take advantage of the fact that L1TF can only read data that is in the processor's L1 cache. If that cache is cleared every time the kernel transfers control to a virtual machine, there will be no data available for the attacker to read. That is indeed what the kernel will do. This mitigation will be rather more costly, needless to say; how much it costs will depend on the workload. On systems where entries into (and exits from) guests are relatively rare, the cost will be low. On systems where those events are common, the cost could approach a 50% performance hit.
Unfortunately, just clearing the L1 cache is not a complete solution if the CPU is running symmetric multi-threading (SMT or "hyperthreads"). The threads running on that processor share the L1 cache. So, while the hostile guest is running in one thread, an unrelated process could be repopulating the L1 cache with interesting data in the other thread. That clearly reopens the can of worms.
The obvious solution here is to disable SMT, which can potentially protect against other security issues as well. But that clearly comes with a significant performance cost of it own. It is not as bad as simply removing half of the system's processors, but, in a virtual sense, that is exactly what is happening. An alternative is to use CPU affinities to restrict guests to specific processors and to not allow anything else (including, for example, kernel functionality like interrupt handling) to run on those processors. This approach might gain back some performance for specific workloads, but it clearly requires a lot of administrator knowledge about what those workloads are and a lot of manual configuration. It also seems somewhat error-prone.
There is another approach that can be taken to protect hosts from hostile guests: rather than do all of the above, simply disable the use of the extended page-table feature. That forces the system back to the older "shadow page table" mechanism, where the hypervisor retains the ultimate control over all PTEs. This, too, will slow things down significantly, but it provides complete protection since the attacker is no longer able to create non-present PTEs pointing to pages of interest.
As an aside, it's worth pointing out an interesting implication of this vulnerability. Virtualization is generally seen as being more secure than containers due to the extra level of isolation used. But, as we see here, virtualization also requires an extra level of processor complexity that can be the source of security problems in its own right. Systems running container workloads will be only lightly affected by L1TF, while those running virtualization will pay a heavy cost.
Kernel settingsPatched kernels will perform the inversion on non-present PTEs automatically. Since there is no real cost to this technique, there is no reason (and no ability) to turn it off. The flushing of the L1 cache on entry to virtual guests will be done if extended page tables are enabled. The disabling of SMT, though, will not be done by default; administrators of systems running untrusted guests will have to examine the tradeoffs and decide what the best approach is to protect their systems. For people faced with this kind of choice, some more information can be found in Documentation/admin-guide/l1tf.rst.
The 4.19 kernel will contain the mitigations, of course. As of this writing, the 4.18.1, 4.17.15, 4.14.63, 4.9.120, and 4.4.148 updates, containing the fixes, are in the review process with release planned on August 16.
As was the case with the previous rounds, the mitigations for L1TF were
worked out under strict embargo. The process appears to have worked a
little better this time around, with no real leakage of information to
force an early disclosure. One can only wonder how many more of these are
known and under embargo now — and how many are yet to be discovered. It
seems likely that we will be contending with speculative-execution
vulnerabilities for some time yet.
Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.
(Log in to post comments)