Exploiting null-dereferences in the Linux kernel

Exploiting null-dereferences in the Linux kernel

Original text by Seth Jenkins, Project Zero

For a fair amount of time, null-deref bugs were a highly exploitable kernel bug class. Back when the kernel was able to access userland memory without restriction, and userland programs were still able to map the zero page, there were many easy techniques for exploiting null-deref bugs. However with the introduction of modern exploit mitigations such as SMEP and SMAP, as well as mmap_min_addr preventing unprivileged programs from mmap’ing low addresses, null-deref bugs are generally not considered a security issue in modern kernel versions. This blog post provides an exploit technique demonstrating that treating these bugs as universally innocuous often leads to faulty evaluations of their relevance to security.

Kernel oops overview

At present, when the Linux kernel triggers a null-deref from within a process context, it generates an oops, which is distinct from a kernel panic. A panic occurs when the kernel determines that there is no safe way to continue execution, and that therefore all execution must cease. However, the kernel does not stop all execution during an oops — instead the kernel tries to recover as best as it can and continue execution. In the case of a task, that involves throwing out the existing kernel stack and going directly to make_task_dead which calls do_exit. The kernel will also publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This may seem like an odd choice to make when memory corruption has clearly occurred — however the intention is to allow kernel bugs to more easily be detectable and loggable under the philosophy that a working system is much easier to debug than a dead one.

The unfortunate side effect of the oops recovery path is that the kernel is not able to perform any associated cleanup that it would normally perform on a typical syscall error recovery path. This means that any locks that were locked at the moment of the oops stay locked, any refcounts remain taken, any memory otherwise temporarily allocated remains allocated, etc. However, the process that generated the oops, its associated kernel stack, task struct and derivative members etc. can and often will be freed, meaning that depending on the precise circumstances of the oops, it’s possible that no memory is actually leaked. This becomes particularly important in regards to exploitation later.

Reference count mismanagement overview

Refcount mismanagement is a fairly well-known and exploitable issue. In the case where software improperly decrements a refcount, this can lead to a classic UAF primitive. The case where software improperly doesn’t decrement a refcount (leaking a reference) is also often exploitable. If the attacker can cause a refcount to be repeatedly improperly incremented, it is possible that given enough effort the refcount may overflow, at which point the software no longer has any remotely sensible idea of how many refcounts are taken on an object. In such a case, it is possible for an attacker to destroy the object by incrementing and decrementing the refcount back to zero after overflowing, while still holding reachable references to the associated memory. 32-bit refcounts are particularly vulnerable to this sort of overflow. It is important however, that each increment of the refcount allocates little or no physical memory. Even a single byte allocation is quite expensive if it must be performed 232 times.

Example null-deref bug

When a kernel oops unceremoniously ends a task, any refcounts that the task was holding remain held, even though all memory associated with the task may be freed when the task exits. Let’s look at an example — an otherwise unrelated bug I coincidentally discovered in the very recent past:

static int show_smaps_rollup(struct seq_file *m, void *v)
{
        struct proc_maps_private *priv = m->private;
        struct mem_size_stats mss;
        struct mm_struct *mm;
        struct vm_area_struct *vma;
        unsigned long last_vma_end = 0;
        int ret = 0;
        priv->task = get_proc_task(priv->inode); //task reference taken
        if (!priv->task)
                return -ESRCH;
        mm = priv->mm; //With no vma's, mm->mmap is NULL
        if (!mm || !mmget_not_zero(mm)) { //mm reference taken
                ret = -ESRCH;
                goto out_put_task;
        }
        memset(&mss, 0, sizeof(mss));
        ret = mmap_read_lock_killable(mm); //mmap read lock taken
        if (ret)
                goto out_put_mm;
        hold_task_mempolicy(priv);
        for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {
                smap_gather_stats(vma, &mss);
                last_vma_end = vma->vm_end;
        }
        show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops here
        seq_pad(m, ' ');
        seq_puts(m, "[rollup]\n");
        __show_smap(m, &mss, true);
        release_task_mempolicy(priv);
        mmap_read_unlock(mm);
out_put_mm:
        mmput(mm);
out_put_task:
        put_task_struct(priv->task);
        priv->task = NULL;
        return ret;
}

This file is intended simply to print a set of memory usage statistics for the respective process. Regardless, this bug report reveals a classic and otherwise innocuous null-deref bug within this function. In the case of a task that has no VMA’s mapped at all, the task’s mm_struct mmap member will be equal to NULL. Thus the priv->mm->mmap->vm_start access causes a null dereference and consequently a kernel oops. This bug can be triggered by simply read’ing /proc/[pid]/smaps_rollup on a task with no VMA’s (which itself can be stably created via ptrace):

This kernel oops will mean that the following events occur:

  1. The associated struct file will have a refcount leaked if fdget took a refcount (we’ll try and make sure this doesn’t happen later)
  2. The associated seq_file within the struct file has a mutex that will forever be locked (any future reads/writes/lseeks etc. will hang forever).
  3. The task struct associated with the smaps_rollup file will have a refcount leaked
  4. The mm_struct’s mm_users refcount associated with the task will be leaked
  5. The mm_struct’s mmap lock will be permanently readlocked (any future write-lock attempts will hang forever)

Each of these conditions is an unintentional side-effect that leads to buggy behaviors, but not all of those behaviors are useful to an attacker. The permanent locking of events 2 and 5 only makes exploitation more difficult. Condition 1 is unexploitable because we cannot leak the struct file refcount again without taking a mutex that will never be unlocked. Condition 3 is unexploitable because a task struct uses a safe saturating kernel refcount_t which prevents the overflow condition. This leaves condition 4. 


The mm_users refcount still uses an overflow-unsafe atomic_t and since we can take a readlock an indefinite number of times, the associated mmap_read_lock does not prevent us from incrementing the refcount again. There are a couple important roadblocks we need to avoid in order to repeatedly leak this refcount:

  1. We cannot call this syscall from the task with the empty vma list itself — in other words, we can’t call read from /proc/self/smaps_rollup. Such a process cannot easily make repeated syscalls since it has no virtual memory mapped. We avoid this by reading smaps_rollup from another process.
  2. We must re-open the smaps_rollup file every time because any future reads we perform on a smaps_rollup instance we already triggered the oops on will deadlock on the local seq_file mutex lock which is locked forever. We also need to destroy the resulting struct file (via close) after we generate the oops in order to prevent untenable memory usage.
  3. If we access the mm through the same pid every time, we will run into the task struct max refcount before we overflow the mm_users refcount. Thus we need to create two separate tasks that share the same mm and balance the oopses we generate across both tasks so the task refcounts grow half as quickly as the mm_users refcount. We do this via the clone flag CLONE_VM
  4. We must avoid opening/reading the smaps_rollup file from a task that has a shared file descriptor table, as otherwise a refcount will be leaked on the struct file itself. This isn’t difficult, just don’t read the file from a multi-threaded process.

Our final refcount leaking overflow strategy is as follows:

  1. Process A forks a process B
  2. Process B issues PTRACE_TRACEME so that when it segfaults upon return from munmap it won’t go away (but rather will enter tracing stop)
  3. Proces B clones with CLONE_VM | CLONE_PTRACE another process C
  4. Process B munmap’s its entire virtual memory address space — this also unmaps process C’s virtual memory address space.
  5. Process A forks new children D and E which will access (B|C)’s smaps_rollup file respectively
  6. (D|E) opens (B|C)’s smaps_rollup file and performs a read which will oops, causing (D|E) to die. mm_users will be refcount leaked/incremented once per oops
  7. Process A goes back to step 5 ~232 times

The above strategy can be rearchitected to run in parallel (across processes not threads, because of roadblock 4) and improve performance. On server setups that print kernel logging to a serial console, generating 232 kernel oopses takes over 2 years. However on a vanilla Kali Linux box using a graphical interface, a demonstrative proof-of-concept takes only about 8 days to complete! At the completion of execution, the mm_users refcount will have overflowed and be set to zero, even though this mm is currently in use by multiple processes and can still be referenced via the proc filesystem.

Exploitation

Once the mm_users refcount has been set to zero, triggering undefined behavior and memory corruption should be fairly easy. By triggering an mmget and an mmput (which we can very easily do by opening the smaps_rollup file once more) we should be able to free the entire mm and cause a UAF condition:

static inline void __mmput(struct mm_struct *mm)
{
        VM_BUG_ON(atomic_read(&mm->mm_users));
        uprobe_clear_state(mm);
        exit_aio(mm);
        ksm_exit(mm);
        khugepaged_exit(mm);
        exit_mmap(mm);
        mm_put_huge_zero_page(mm);
        set_mm_exe_file(mm, NULL);
        if (!list_empty(&mm->mmlist)) {
                spin_lock(&mmlist_lock);
                list_del(&mm->mmlist);
                spin_unlock(&mmlist_lock);
        }
        if (mm->binfmt)
                module_put(mm->binfmt->module);
        lru_gen_del_mm(mm);
        mmdrop(mm);
}

Unfortunately, since 64591e8605 (“mm: protect free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is permanently readlocked many times, any calls to __mmput will manifest as a permanent deadlock inside of exit_mmap.

However, before the call permanently deadlocks, it will call several other functions:

  1. uprobe_clear_state
  2. exit_aio
  3. ksm_exit
  4. khugepaged_exit

Additionally, we can call __mmput on this mm from several tasks simultaneously by having each of them trigger an mmget/mmput on the mm, generating irregular race conditions. Under normal execution, it should not be possible to trigger multiple __mmput’s on the same mm (much less concurrent ones) as __mmput should only be called on the last and only refcount decrement which sets the refcount to zero. However, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will trigger an __mmput. This is because each mmput that decrements the refcount to zero (despite the corresponding mmget being why the refcount was above zero in the first place) believes that it is solely responsible for freeing the associated mm.

This racy __mmput primitive extends to its callees as well. exit_aio is a good candidate for taking advantage of this:

void exit_aio(struct mm_struct *mm)
{
        struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
        struct ctx_rq_wait wait;
        int i, skipped;
        if (!table)
                return;
        atomic_set(&wait.count, table->nr);
        init_completion(&wait.comp);
        skipped = 0;
        for (i = 0; i < table->nr; ++i) {
                struct kioctx *ctx =
                rcu_dereference_protected(table->table[i], true);
                if (!ctx) {
                        skipped++;
                        continue;
                }
                ctx->mmap_size = 0;
                kill_ioctx(mm, ctx, &wait);
        }
        if (!atomic_sub_and_test(skipped, &wait.count)) {
                /* Wait until all IO for the context are done. */
                wait_for_completion(&wait.comp);
        }
        RCU_INIT_POINTER(mm->ioctx_table, NULL);
        kfree(table);
}

While the callee function kill_ioctx is written in such a way to prevent concurrent execution from causing memory corruption (part of the contract of aio allows for kill_ioctx to be called in a concurrent way), exit_aio itself makes no such guarantees. Two concurrent calls of exit_aio on the same mm struct can consequently induce a double free of the mm->ioctx_table object, which is fetched at the beginning of the function, while only being freed at the very end. This race window can be widened substantially by creating many aio contexts in order to slow down exit_aio’s internal context freeing loop. Successful exploitation will trigger the following kernel BUG indicating that a double free has occurred:

Note that as this exit_aio path is hit from __mmput, triggering this race will produce at least two permanently deadlocked processes when those processes later try to take the mmap write lock. However, from an exploitation perspective, this is irrelevant as the memory corruption primitive has already occurred before the deadlock occurs. Exploiting the resultant primitive would probably involve racing a reclaiming allocation in between the two frees of the mm->ioctx_table object, then taking advantage of the resulting UAF condition of the reclaimed allocation. It is undoubtedly possible, although I didn’t take this all the way to a completed PoC.

Conclusion

While the null-dereference bug itself was fixed in October 2022, the more important fix was the introduction of an oops limit which causes the kernel to panic if too many oopses occur. While this patch is already upstream, it is important that distributed kernels also inherit this oops limit and backport it to LTS releases if we want to avoid treating such null-dereference bugs as full-fledged security issues in the future. Even in that best-case scenario, it is nevertheless highly beneficial for security researchers to carefully evaluate the side-effects of bugs discovered in the future that are similarly “harmless” and ensure that the abrupt halt of kernel code execution caused by a kernel oops does not lead to other security-relevant primitives.

Exploiting CVE-2022-42703 — Bringing back the stack attack

Exploiting CVE-2022-42703 - Bringing back the stack attack

Original text by Seth Jenkins, Project Zero

This blog post details an exploit for CVE-2022-42703 (P0 issue 2351 — Fixed 5 September 2022), a bug Jann Horn found in the Linux kernel’s memory management (MM) subsystem that leads to a use-after-free on struct anon_vma. As the bug is very complex (I certainly struggle to understand it!), a future blog post will describe the bug in full. For the time being, the issue tracker entry, this LWN article explaining what an anon_vma is and the commit that introduced the bug are great resources in order to gain additional context.

Setting the scene

Successfully triggering the underlying vulnerability causes folio->mapping to point to a freed anon_vma object. Calling madvise(…, MADV_PAGEOUT)can then be used to repeatedly trigger accesses to the freed anon_vma in folio_lock_anon_vma_read():

struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
					  struct rmap_walk_control *rwc)
{
	struct anon_vma *anon_vma = NULL;
	struct anon_vma *root_anon_vma;
	unsigned long anon_mapping;

	rcu_read_lock();
	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	if (!folio_mapped(folio))
		goto out;

	// anon_vma is dangling pointer
	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
	// root_anon_vma is read from dangling pointer
	root_anon_vma = READ_ONCE(anon_vma->root);
	if (down_read_trylock(&root_anon_vma->rwsem)) {
[...]
		if (!folio_mapped(folio)) { // false
[...]
		}
		goto out;
	}

	if (rwc && rwc->try_lock) { // true
		anon_vma = NULL;
		rwc->contended = true;
		goto out;
	}
[...]
out:
	rcu_read_unlock();
	return anon_vma; // return dangling pointer
}

One potential exploit technique is to let the function return the dangling anon_vma pointer and try to make the subsequent operations do something useful. Instead, we chose to use the down_read_trylock() call within the function to corrupt memory at a chosen address, which we can do if we can control the root_anon_vma pointer that is read from the freed anon_vma.

Controlling the root_anon_vma pointer means reclaiming the freed anon_vma with attacker-controlled memory. struct anon_vma structures are allocated from their own kmalloc cache, which means we cannot simply free one and reclaim it with a different object. Instead we cause the associated anon_vma slab page to be returned back to the kernel page allocator by following a very similar strategy to the one documented here. By freeing all the anon_vma objects on a slab page, then flushing the percpu slab page partial freelist, we can cause the virtual memory previously associated with the anon_vma to be returned back to the page allocator. We then spray pipe buffers in order to reclaim the freed anon_vma with attacker controlled memory.

At this point, we’ve discussed how to turn our use-after-free into a down_read_trylock() call on an attacker-controlled pointer. The implementation of down_read_trylock() is as follows:

struct rw_semaphore {
	atomic_long_t count;
	atomic_long_t owner;
	struct optimistic_spin_queue osq; /* spinner MCS lock */
	raw_spinlock_t wait_lock;
	struct list_head wait_list;
};

...

static inline int __down_read_trylock(struct rw_semaphore *sem)
{
	long tmp;

	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);

	tmp = atomic_long_read(&sem->count);
	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
						    tmp + RWSEM_READER_BIAS)) {
			rwsem_set_reader_owned(sem);
			return 1;
		}
	}
	return 0;
}

It was helpful to emulate the down_read_trylock() in unicorn to determine how it behaves when given different sem->count values. Assuming this code is operating on inert and unchanging memory, it will increment sem->count by 0x100 if the 3 least significant bits and the most significant bit are all unset. That means it is difficult to modify a kernel pointer and we cannot modify any non 8-byte aligned values (as they’ll have one or more of the bottom three bits set). Additionally, this semaphore is later unlocked, causing whatever write we perform to be reverted in the imminent future. Furthermore, at this point we don’t have an established strategy for determining the KASLR slide nor figuring out the addresses of any objects we might want to overwrite with our newfound primitive. It turns out that regardless of any randomization the kernel presently has in place, there’s a straightforward strategy for exploiting this bug even given such a constrained arbitrary write.

Stack corruption…

On x86-64 Linux, when the CPU performs certain interrupts and exceptions, it will swap to a respective stack that is mapped to a static and non-randomized virtual address, with a different stack for the different exception types. A brief documentation of those stacks and their parent structure, the cpu_entry_area, can be found here. These stacks are most often used on entry into the kernel from userland, but they’re used for exceptions that happen in kernel mode as well. We’ve recently seen KCTF entries where attackers take advantage of the non-randomized cpu_entry_area stacks in order to access data at a known virtual address in kernel accessible memory even in the presence of SMAP and KASLR. You could also use these stacks to forge attacker-controlled data at a known kernel virtual address. This works because the attacker task’s general purpose register contents are pushed directly onto this stack when the switch from userland to kernel mode occurs due to one of these exceptions. This also occurs when the kernel itself generates an Interrupt Stack Table exception and swaps to an exception stack — except in that case, kernel GPR’s are pushed instead. These pushed registers are later used to restore kernel state once the exception is handled. In the case of a userland triggered exception, register contents are restored from the task stack.

One example of an IST exception is a DB exception which can be triggered by an attacker via a hardware breakpoint, the associated registers of which are described here. Hardware breakpoints can be triggered by a variety of different memory access types, namely reads, writes, and instruction fetches. These hardware breakpoints can be set using ptrace(2), and are preserved during kernel mode execution in a task context such as during a syscall. That means that it’s possible for an attacker-set hardware breakpoint to be triggered in kernel mode, e.g. during a copy_to/from_user call. The resulting exception will save and restore the kernel context via the aforementioned non-randomized exception stack, and that kernel context is an exceptionally good target for our arbitrary write primitive.

Any of the registers that copy_to/from_user is actively using at the time it handles the hardware breakpoint are corruptible by using our arbitrary-write primitive to overwrite their saved values on the exception stack. In this case, the size of the copy_user call is the intuitive target. The size value is consistently stored in the rcx register, which will be saved at the same virtual address every time the hardware breakpoint is hit. After corrupting this saved register with our arbitrary write primitive, the kernel will restore rcx from the exception stack once it returns back to copy_to/from_user. Since rcx defines the number of bytes copy_user should copy, this corruption will cause the kernel to illicitly copy too many bytes between userland and the kernel.

…begets stack corruption

The attack strategy starts as follows:

  1. Fork a process Y from process X.
  2. Process X ptraces process Y, then sets a hardware breakpoint at a known virtual address [addr] in process Y.
  3. Process Y makes a large number of calls to uname(2), which calls copy_to_user from a kernel stack buffer to [addr]. This causes the kernel to constantly trigger the hardware watchpoint and enter the DB exception handler, using the DB exception stack to save and restore copy_to_user state
  4. Simultaneously make many arbitrary writes at the known location of the DB exception stack’s saved rcx value, which is Process Y’s copy_to_user’s saved length.

The DB exception stack is used rarely, so it’s unlikely that we corrupt any unexpected kernel state via a spurious DB exception while spamming our arbitrary write primitive. The technique is also racy, but missing the race simply means corrupting stale stack-data. In that case, we simply try again. In my experience, it rarely takes more than a few seconds to win the race successfully.

Upon successful corruption of the length value, the kernel will copy much of the current task’s stack back to userland, including the task-local stack cookie and return addresses. We can subsequently invert our technique and attack a copy_from_user call instead. Instead of copying too many bytes from the kernel task stack to userland, we elicit the kernel to copy too many bytes from userland to the kernel task stack! Again we use a syscall, prctl(2), that performs a copy_from_user call to a kernel stack buffer. Now by corrupting the length value, we generate a stack buffer overflow condition in this function where none previously existed. Since we’ve already leaked the stack cookie and the KASLR slide, it is trivially easy to bypass both mitigations and overwrite the return address.

Completing a ROP chain for the kernel is left as an exercise to the reader.

Fetching the KASLR slide with prefetch

Upon reporting this bug to the Linux kernel security team, our suggestion was to start randomizing the location of the percpu cpu_entry_area (CEA), and consequently the associated exception and syscall entry stacks. This is an effective mitigation against remote attackers but is insufficient to prevent a local attacker from taking advantage. 6 years ago, Daniel Gruss et al. discovered a new more reliable technique for exploiting the TLB timing side channel in x86 CPU’s. Their results demonstrated that prefetch instructions executed in user mode retired at statistically significant different latencies depending on whether the requested virtual address to be prefetched was mapped vs unmapped, even if that virtual address was only mapped in kernel mode. kPTI was helpful in mitigating this side channel, however, most modern CPUs now have innate protection for Meltdown, which kPTI was specifically designed to address, and thusly kPTI (which has significant performance implications) is disabled on modern microarchitectures. That decision means it is once again possible to take advantage of the prefetch side channel to defeat not only KASLR, but also the CPU entry area randomization mitigation, preserving the viability of the CEA stack corruption exploit technique against modern X86 CPUs.

There are surprisingly few fast and reliable examples of this prefetch KASLR bypass technique available in the open source realm, so I made the decision to write one.

Implementation

The meat of implementing this technique effectively is in serially reading the processor’s time stamp counter before and after performing a prefetch. Daniel Gruss helpfully provided highly effective and open source code for doing just that. The only edit I made (as suggested by Jann Horn) was to swap to using lfence instead of cpuid as the serializing instruction, as cpuid is emulated in VM environments. It also became apparent in practice that there was no need to perform any cache-flushing routines in order to witness the side-channel effect. It is simply enough to time every prefetch attempt.

Generating prefetch timings for all 512 possible KASLR slots yields quite a bit of fuzzy data in need of analyzing. To minimize noise, multiple samples of each tested address are taken, and the minimum value from that set of samples is used in the results as the representative value for an address. On the Tiger Lake CPU this test was primarily performed on, no more than 16 samples per slot were needed to generate exceptionally reliable results. Low-resolution minimum prefetch time slot identification narrows down the area to search in while avoiding false positives for the higher resolution edge-detection code which finds the precise address at which prefetch dramatically drops in run-time. The result of this effort is a PoC which can correctly identify the KASLR slide on my local machine with 99.999% accuracy (95% accuracy in a VM) while running faster than it takes to grep through kallsyms for the kernel base address:

This prefetch code does indeed work to find the locations of the randomized CEA regions in Peter Ziljstra’s proposed patch. However, the journey to that point results in code that demonstrates another deeply significant issue — KASLR is comprehensively compromised on x86 against local attackers, and has been for the past several years, and will be for the indefinite future. There are presently no plans in place to resolve the myriad microarchitectural issues that lead to side channels like this one. Future work is needed in this area in order to preserve the integrity of KASLR, or alternatively, it is probably time to accept that KASLR is no longer an effective mitigation against local attackers and to develop defensive code and mitigations that accept its limitations.

Conclusion

This exploit demonstrates a highly reliable and agnostic technique that can allow a broad spectrum of uncontrolled arbitrary write primitives to achieve kernel code execution on x86 platforms. While it is possible to mitigate this exploit technique from a remote context, an attacker in a local context can utilize known microarchitectural side-channels to defeat the current mitigations. Additional work in this area might be valuable to continue to make exploitation more difficult, such as performing in-stack randomization so that the stack offset of the saved state changes on every taken IST exception. For now however, this remains a viable and powerful exploit strategy on x86 Linux.

Linux Kernel: Exploiting a Netfilter Use-after-Free in kmalloc-cg

Linux Kernel: Exploiting a Netfilter Use-after-Free in kmalloc-cg

Original text by Sergi Martinez

Overview

It’s been a while since our last technical blogpost, so here’s one right on time for the Christmas holidays. We describe a method to exploit a use-after-free in the Linux kernel when objects are allocated in a specific slab cache, namely the 

kmalloc-cg
 series of SLUB caches used for cgroups. This vulnerability is assigned CVE-2022-32250 and exists in Linux kernel versions 5.18.1 and prior.

The use-after-free vulnerability in the Linux kernel netfilter subsystem was discovered by NCC Group’s Exploit Development Group (EDG). They published a very detailed write-up with an in-depth analysis of the vulnerability and an exploitation strategy that targeted Linux Kernel version 5.13. Additionally, Theori published their own analysis and exploitation strategy, this time targetting the Linux Kernel version 5.15. We strongly recommend having a thorough read of both articles to better understand the vulnerability prior to reading this post, which almost exclusively focuses on an exploitation strategy that works on the latest vulnerable version of the Linux kernel, version 5.18.1.

The aforementioned exploitation strategies are different from each other and from the one detailed here since the targeted kernel versions have different peculiarities. In version 5.13, allocations performed with either the 

GFP_KERNEL
 flag or the 
GFP_KERNEL_ACCOUNT
 flag are served by the 
kmalloc-*
 slab caches. In version 5.15, allocations performed with the 
GFP_KERNEL_ACCOUNT
 flag are served by the 
kmalloc-cg-*
 slab caches. While in both 5.13 and 5.15 the affected object, 
nft_expr,
 is allocated using 
GFP_KERNEL,&nbsp;
the difference in exploitation between them arises because a commonly used heap spraying object, the System V message structure (
struct msg_msg)
, is served from 
kmalloc-*
 in 5.13 but from 
kmalloc-cg-*
 in 5.15. Therefore, in 5.15, 
struct msg_msg
 cannot be used to exploit this vulnerability.

In 5.18.1, the object involved in the use-after-free vulnerability, 

nft_expr,&nbsp;
is itself allocated with 
GFP_KERNEL_ACCOUNT
 in the 
kmalloc-cg-*
 slab caches. Since the exploitation strategies presented by the NCC Group and Theori rely on objects allocated with  
GFP_KERNEL,&nbsp;
they do not work against the latest vulnerable version of the Linux kernel.

The subject of this blog post is to present a strategy that works on the latest vulnerable version of the Linux kernel.

Vulnerability

Netfilter sets can be created with a maximum of two associated expressions that have the 

NFT_EXPR_STATEFUL
 flag. The vulnerability occurs when a set is created with an associated expression that does not have the 
NFT_EXPR_STATEFUL
 flag, such as the 
dynset
 and 
lookup
 expressions. These two expressions have a reference to another set for updating and performing lookups, respectively. Additionally, to enable tracking, each set has a bindings list that specifies the objects that have a reference to them.

During the allocation of the associated 

dynset
 or 
lookup
 expression objects, references to the objects are added to the bindings list of the referenced set. However, when the expression associated to the set does not have the 
NFT_EXPR_STATEFUL
 flag, the creation is aborted and the allocated expression is destroyed. The problem occurs during the destruction process where the bindings list of the referenced set is not updated to remove the reference, effectively leaving a dangling pointer to the freed expression object. Whenever the set containing the dangling pointer in its bindings list is referenced again and its bindings list has to be updated, a use-after-free condition occurs.

Exploitation

Before jumping straight into exploitation details, first let’s see the definition of the structures involved in the vulnerability: 

nft_set
nft_expr
nft_lookup
, and 
nft_dynset
.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L502

struct nft_set {
        struct list_head           list;                 /*     0    16 */
        struct list_head           bindings;             /*    16    16 */
        struct nft_table *         table;                /*    32     8 */
        possible_net_t             net;                  /*    40     8 */
        char *                     name;                 /*    48     8 */
        u64                        handle;               /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u32                        ktype;                /*    64     4 */
        u32                        dtype;                /*    68     4 */
        u32                        objtype;              /*    72     4 */
        u32                        size;                 /*    76     4 */
        u8                         field_len[16];        /*    80    16 */
        u8                         field_count;          /*    96     1 */

        /* XXX 3 bytes hole, try to pack */

        u32                        use;                  /*   100     4 */
        atomic_t                   nelems;               /*   104     4 */
        u32                        ndeact;               /*   108     4 */
        u64                        timeout;              /*   112     8 */
        u32                        gc_int;               /*   120     4 */
        u16                        policy;               /*   124     2 */
        u16                        udlen;                /*   126     2 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        unsigned char *            udata;                /*   128     8 */

        /* XXX 56 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        const struct nft_set_ops  * ops __attribute__((__aligned__(64))); /*   192     8 */
        u16                        flags:14;             /*   200: 0  2 */
        u16                        genmask:2;            /*   200:14  2 */
        u8                         klen;                 /*   202     1 */
        u8                         dlen;                 /*   203     1 */
        u8                         num_exprs;            /*   204     1 */

        /* XXX 3 bytes hole, try to pack */

        struct nft_expr *          exprs[2];             /*   208    16 */
        struct list_head           catchall_list;        /*   224    16 */
        unsigned char              data[] __attribute__((__aligned__(8))); /*   240     0 */

        /* size: 256, cachelines: 4, members: 29 */
        /* sum members: 176, holes: 3, sum holes: 62 */
        /* sum bitfield members: 16 bits (2 bytes) */
        /* padding: 16 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */
} __attribute__((__aligned__(64)));

The 

nft_set
 structure represents an nftables set, a built-in generic infrastructure of nftables that allows using any supported selector to build sets, which makes possible the representation of maps and verdict maps (check the corresponding nftables wiki entry for more details).

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L347

/**
 *	struct nft_expr - nf_tables expression
 *
 *	@ops: expression ops
 *	@data: expression private data
 */
struct nft_expr {
	const struct nft_expr_ops	*ops;
	unsigned char			data[]
		__attribute__((aligned(__alignof__(u64))));
};

The 

nft_expr
 structure is a generic container for expressions. The specific expression data is stored within its 
data
 member. For this particular vulnerability the relevant expressions are 
nft_lookup
 and 
nft_dynset
, which are used to perform lookups on sets or update dynamic sets respectively.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nft_lookup.c#L18

struct nft_lookup {
        struct nft_set *           set;                  /*     0     8 */
        u8                         sreg;                 /*     8     1 */
        u8                         dreg;                 /*     9     1 */
        bool                       invert;               /*    10     1 */

        /* XXX 5 bytes hole, try to pack */

        struct nft_set_binding     binding;              /*    16    32 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 48, cachelines: 1, members: 5 */
        /* sum members: 43, holes: 1, sum holes: 5 */
        /* paddings: 1, sum paddings: 4 */
        /* last cacheline: 48 bytes */
};

nft_lookup
 expressions have to be bound to a given set on which the lookup operations are performed.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nft_dynset.c#L15

struct nft_dynset {
        struct nft_set *           set;                  /*     0     8 */
        struct nft_set_ext_tmpl    tmpl;                 /*     8    12 */

        /* XXX last struct has 1 byte of padding */

        enum nft_dynset_ops        op:8;                 /*    20: 0  4 */

        /* Bitfield combined with next fields */

        u8                         sreg_key;             /*    21     1 */
        u8                         sreg_data;            /*    22     1 */
        bool                       invert;               /*    23     1 */
        bool                       expr;                 /*    24     1 */
        u8                         num_exprs;            /*    25     1 */

        /* XXX 6 bytes hole, try to pack */

        u64                        timeout;              /*    32     8 */
        struct nft_expr *          expr_array[2];        /*    40    16 */
        struct nft_set_binding     binding;              /*    56    32 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 88, cachelines: 2, members: 11 */
        /* sum members: 81, holes: 1, sum holes: 6 */
        /* sum bitfield members: 8 bits (1 bytes) */
        /* paddings: 2, sum paddings: 5 */
        /* last cacheline: 24 bytes */
};

nft_dynset
 expressions have to be bound to a given set on which the add, delete, or update operations will be performed.

When a given 

nft_set
 has expressions bound to it, they are added to the 
nft_set.bindings
 double linked list. A visual representation of an 
nft_set
 with 2 expressions is shown in the diagram below.

The 

binding
 member of the 
nft_lookup
 and 
nft_dynset
 expressions is defined as follows:

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L576

/**
 *	struct nft_set_binding - nf_tables set binding
 *
 *	@list: set bindings list node
 *	@chain: chain containing the rule bound to the set
 *	@flags: set action flags
 *
 *	A set binding contains all information necessary for validation
 *	of new elements added to a bound set.
 */
struct nft_set_binding {
	struct list_head		list;
	const struct nft_chain		*chain;
	u32				flags;
};

The important member in our case is the 

list
 member. It is of type 
struct list_head
, the same as the 
nft_lookup.binding
 and 
nft_dynset.binding
 members. These are the foundation for building a double linked list in the kernel. For more details on how linked lists in the Linux kernel are implemented refer to this article.

With this information, let’s see what the vulnerability allows to do. Since the UAF occurs within a double linked list let’s review the common operations on them and what that implies in our scenario. Instead of showing a generic example, we are going to use the linked list that is build with the 

nft_set
 and the expressions that can be bound to it.

In the diagram shown above, the simplified pseudo-code for removing the 

nft_lookup
 expression from the list would be:

nft_lookup.binding.list->prev->next = nft_lookup.binding.list->next
nft_lookup.binding.list->next->prev = nft_lookup.binding.list->prev

This code effectively writes the address of 

nft_dynset.binding
 in 
nft_set.bindings.next
, and the address of 
nft_set.bindings
 in 
nft_dynset.binding.list-&gt;prev
. Since the 
binding
 member of 
nft_lookup
 and 
nft_dynset
 expressions are defined at different offsets, the write operation is done at different offsets.

With this out of the way we can now list the write primitives that this vulnerability allows, depending on which expression is the vulnerable one:

  • nft_lookup
    : Write an 8-byte address at offset 24 (
    binding.list-&gt;next
    ) or offset 32 (
    binding.list-&gt;prev
    ) of a freed 
    nft_lookup
     object.
  • nft_dynset
    : Write an 8-byte address at offset 64 (
    binding.list-&gt;next
    ) or offset 72 (
    binding.list-&gt;prev
    ) of a freed 
    nft_dynset
     object.

The offsets mentioned above take into account the fact that 

nft_lookup
 and 
nft_dynset
 expressions are bundled in the 
data
 member of an 
nft_expr
 object (the data member is at offset 8).

In order to do something useful with the limited write primitves that the vulnerability offers we need to find objects allocated within the same slab caches as the 

nft_lookup
 and 
nft_dynset
 expression objects that have an interesting member at the listed offsets.

As mentioned before, in Linux kernel 5.18.1 the 

nft_expr
 objects are allocated using the 
GFP_KERNEL_ACCOUNT
 flag, as shown below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nf_tables_api.c#L2866

static struct nft_expr *nft_expr_init(const struct nft_ctx *ctx,
				      const struct nlattr *nla)
{
	struct nft_expr_info expr_info;
	struct nft_expr *expr;
	struct module *owner;
	int err;

	err = nf_tables_expr_parse(ctx, nla, &expr_info);
	if (err < 0)
            goto err1;
        err = -ENOMEM;

        expr = kzalloc(expr_info.ops->size, GFP_KERNEL_ACCOUNT);
	if (expr == NULL)
	    goto err2;

	err = nf_tables_newexpr(ctx, &expr_info, expr);
	if (err < 0)
            goto err3;

        return expr;
err3:
        kfree(expr);
err2:
        owner = expr_info.ops->type->owner;
	if (expr_info.ops->type->release_ops)
	    expr_info.ops->type->release_ops(expr_info.ops);

	module_put(owner);
err1:
	return ERR_PTR(err);
}

Therefore, the objects suitable for exploitation will be different from those of the publicly available exploits targetting version 5.13 and 5.15.

Exploit Strategy

The ultimate primitives we need to exploit this vulnerability are the following:

  • Memory leak primitive: Mainly to defeat KASLR.
  • RIP control primitive: To achieve kernel code execution and escalate privileges.

However, neither of these can be achieved by only using the 8-byte write primitive that the vulnerability offers. The 8-byte write primitive on a freed object can be used to corrupt the object replacing the freed allocation. This can be leveraged to force a partial free on either the 

nft_set
nft_lookup
 or the 
nft_dynset
 objects.

Partially freeing 

nft_lookup
 and 
nft_dynset
 objects can help with leaking pointers, while partially freeing an 
nft_set
 object can be pretty useful to craft a partial fake 
nft_set
 to achieve RIP control, since it has an 
ops
 member that points to a function table.

Therefore, the high-level exploitation strategy would be the following:

  1. Leak the kernel image base address.
  2. Leak a pointer to an 
    nft_set
     object.
  3. Obtain RIP control.
  4. Escalate privileges by overwriting the kernel’s 
    MODPROBE_PATH
     global variable.
  5. Return execution to userland and drop a root shell.

The following sub-sections describe how this can be achieved.

Partial Object Free Primitive

A partial object free primitive can be built by looking for a kernel object allocated with 

GFP_KERNEL_ACCOUNT
 within kmalloc-cg-64 or kmalloc-cg-96, with a pointer at offsets 24 or 32 for kmalloc-cg-64 or at offsets 64 and 72 for kmalloc-cg-96. Afterwards, when the object of interest is destroyed, 
kfree()
 has to be called on that pointer in order to partially free the targeted object.

One of such objects is the 

fdtable
 object, which is meant to hold the file descriptor table for a given process. Its definition is shown below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/fdtable.h#L27

struct fdtable {
        unsigned int               max_fds;              /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct file * *            fd;                   /*     8     8 */
        long unsigned int *        close_on_exec;        /*    16     8 */
        long unsigned int *        open_fds;             /*    24     8 */
        long unsigned int *        full_fds_bits;        /*    32     8 */
        struct callback_head       rcu __attribute__((__aligned__(8))); /*    40    16 */

        /* size: 56, cachelines: 1, members: 6 */
        /* sum members: 52, holes: 1, sum holes: 4 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

The size of an 

fdtable
 object is 56, is allocated in the kmalloc-cg-64 slab and thus can be used to replace 
nft_lookup
 objects. It has a member of interest at offset 24 (
open_fds
), which is a pointer to an unsigned long integer array. The allocation of 
fdtable
 objects is done by the kernel function 
alloc_fdtable()
, which can be reached with the following call stack.

alloc_fdtable()
 |  
 +- dup_fd()
    |
    +- copy_files()
      |
      +- copy_process()
        |
        +- kernel_clone()
          |
          +- fork() syscall

Therefore, by calling the 

fork()
 system call the current process is copied and thus the currently open files. This is done by allocating a new file descriptor table object (
fdtable
), if required, and copying the currently open file descriptors to it. The allocation of a new 
fdtable
 object only happens when the number of open file descriptors exceeds 
NR_OPEN_DEFAULT
, which is defined as 64 on 64-bit machines. The following listing shows this check.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/fs/file.c#L316

/*
 * Allocate a new files structure and copy contents from the
 * passed in files structure.
 * errorp will be valid only when the returned files_struct is NULL.
 */
struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp)
{
        struct files_struct *newf;
        struct file **old_fds, **new_fds;
        unsigned int open_files, i;
        struct fdtable *old_fdt, *new_fdt;

        *errorp = -ENOMEM;
        newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
        if (!newf)
                goto out;

        atomic_set(&newf->count, 1);

        spin_lock_init(&newf->file_lock);
        newf->resize_in_progress = false;
        init_waitqueue_head(&newf->resize_wait);
        newf->next_fd = 0;
        new_fdt = &newf->fdtab;

[1]

        new_fdt->max_fds = NR_OPEN_DEFAULT;
        new_fdt->close_on_exec = newf->close_on_exec_init;
        new_fdt->open_fds = newf->open_fds_init;
        new_fdt->full_fds_bits = newf->full_fds_bits_init;
        new_fdt->fd = &newf->fd_array[0];

        spin_lock(&oldf->file_lock);
        old_fdt = files_fdtable(oldf);
        open_files = sane_fdtable_size(old_fdt, max_fds);

        /*
         * Check whether we need to allocate a larger fd array and fd set.
         */

[2]

        while (unlikely(open_files > new_fdt->max_fds)) {
                spin_unlock(&oldf->file_lock);

                if (new_fdt != &newf->fdtab)
                        __free_fdtable(new_fdt);

[3]

                new_fdt = alloc_fdtable(open_files - 1);
                if (!new_fdt) {
                        *errorp = -ENOMEM;
                        goto out_release;
                }

[Truncated]

        }

[Truncated]

        return newf;

out_release:
        kmem_cache_free(files_cachep, newf);
out:
        return NULL;
}

At [1] the 

max_fds
 member of 
new_fdt
 is set to 
NR_OPEN_DEFAULT
. Afterwards, at [2] the loop executes only when the number of open files exceeds the 
max_fds
 value. If the loop executes, at [3] a new 
fdtable
 object is allocated via the 
alloc_fdtable()
 function.

Therefore, to force the allocation of 

fdtable
 objects in order to replace a given free object from kmalloc-cg-64 the following steps must be taken:

  1. Create more than 64 open file descriptors. This can be easily done by calling the 
    dup()
     function to duplicate an existing file descriptor, such as the 
    stdout
    . This step should be done before triggering the free of the object to be replaced with an 
    fdtable
     object, since the 
    dup()
     system call also ends up allocating 
    fdtable
     objects that can interfere.
  2. Once the target object has been freed, fork the current process a large number of times. Each 
    fork()
     execution creates one 
    fdtable
     object.

The free of the 

open_fds
 pointer is triggered when the 
fdtable
 object is destroyed in the 
__free_fdtable()
 function.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/fs/file.c#L34

static void __free_fdtable(struct fdtable *fdt)
{
        kvfree(fdt->fd);
        kvfree(fdt->open_fds);
        kfree(fdt);
}

Therefore, the partial free via the overwritten 

open_fds
 pointer can be triggered by simply terminating the child process that allocated the 
fdtable
 object.

Leaking Pointers

The exploit primitive provided by this vulnerability can be used to build a leaking primitive by overwriting the vulnerable object with an object that has an area that will be copied back to userland. One such object is the System V message represented by the 

msg_msg
structure, which is allocated in 
kmalloc-cg-*
 slab caches starting from kernel version 5.14.

The 

msg_msg
 structure acts as a header of System V messages that can be created via the userland 
msgsnd()
 function. The content of the message can be found right after the header within the same allocation. System V messages are a widely used exploit primitive for heap spraying.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/msg.h#L9

struct msg_msg {
        struct list_head           m_list;               /*     0    16 */
        long int                   m_type;               /*    16     8 */
        size_t                     m_ts;                 /*    24     8 */
        struct msg_msgseg *        next;                 /*    32     8 */
        void *                     security;             /*    40     8 */

        /* size: 48, cachelines: 1, members: 5 */
        /* last cacheline: 48 bytes */
};

Since the size of the allocation for a System V message can be controlled, it is possible to allocate it in both kmalloc-cg-64 and kmalloc-cg-96 slab caches.

It is important to note that any data to be leaked must be written past the first 48 bytes of the message allocation, otherwise it would overwrite the 

msg_msg
 header. This restriction discards the 
nft_lookup
 object as a candidate to apply this technique to as it is only possible to write the pointer either at offset 24 or offset 32 within the object. The ability of overwriting the 
msg_msg.m_ts
 member, which defines the size of the message, helps building a strong out-of-bounds read primitive if the value is large enough. However, there is a check in the code to ensure that the 
m_ts
 member is not negative when interpreted as a signed long integer and heap addresses start with 
0xffff
, making it a negative long integer. 

Leaking an 
nft_set
 Pointer

Leaking a pointer to an 

nft_set
 object is quite simple with the memory leak primitive described above. The steps to achieve it are the following:

1. Create a target set where the expressions will be bound to.

2. Create a rule with a lookup expression bound to the target set from step 1.

3. Create a set with an embedded 

nft_dynset
 expression bound to the target set. Since this is considered an invalid expression to be embedded to a set, the 
nft_dynset
 object will be freed but not removed from the target set bindings list, causing a UAF.

4. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed 

nft_dynset
 object (via 
msgsnd()
 function). Tag all the messages at offset 24 so the one corrupted with the 
nft_set
 pointer can later be identified.

5. Remove the rule created, which will remove the entry of the 

nft_lookup
 expression from the target set’s bindings list. Removing this from the list effectively writes a pointer to the target 
nft_set
 object where the original 
binding.list.prev
 member was (offset 72). Since the freed 
nft_dynset
 object was replaced by a System V message, the pointer to the 
nft_set
 will be written at offset 24 within the message data.

6. Use the userland 

msgrcv()
 function to read the messages and check which one does not have the tag anymore, as it would have been replaced by the pointer to the 
nft_set
.

Leaking a Kernel Function Pointer

Leaking a kernel pointer requires a bit more work than leaking a pointer to an 

nft_set
 object. It requires being able to partially free objects within the target set bindings list as a means of crafting use-after-free conditions. This can be done by using the partial object free primitive using 
fdtable
 object already described. The steps followed to leak a pointer to a kernel function are the following.

1. Increase the number of open file descriptors by calling 

dup()
 on 
stdout
 65 times.

2. Create a target set where the expressions will be bound to (different from the one used in the `

nft_set
` adress leak).

3. Create a set with an embedded 

nft_lookup
 expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the 
nft_lookup
 object will be freed but not removed from the target set bindings list, causing a UAF.

4. Spray 

fdtable
 objects in order to replace the freed 
nft_lookup
 from step 3.

5. Create a set with an embedded 

nft_dynset
 expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the 
nft_dynset
 object will be freed but not removed from the target set bindings list, causing a UAF. This addition to the bindings list will write the pointer to its binding member into the 
open_fds
 member of the 
fdtable
 object (allocated in step 4) that replaced the 
nft_lookup
 object.

6. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed 

nft_dynset
 object (via 
msgsnd()
 function). Tag all the messages at offset 8 so the one corrupted can be identified.

7. Kill all the child processes created in step 4 in order to trigger the partial free of the System V message that replaced the 

nft_dynset
 object, effectively causing a UAF to a part of a System V message.

8. Spray 

time_namespace
 objects in order to replace the partially freed System V message allocated in step 7. The reason for using the 
time_namespace
 objects is explained later.

9. Since the System V message header was not corrupted, find the System V message whose tag has been overwritten. Use 

msgrcv()
 to read the data from it, which is overlapping with the newly allocated 
time_namespace
 object. The offset 40 of the data portion of the System V message corresponds to 
time_namespace.ns-&gt;ops
 member, which is a function table of functions defined within the kernel core. Armed with this information and the knowledge of the offset from the kernel base image to this function it is possible to calculate the kernel image base address.

10. Clean-up the child processes used to spray the 

time_namespace
 objects.

time_namespace
 objects are interesting because they contain an 
ns_common
 structure embedded in them, which in turn contains an 
ops
 member that points to a function table with functions defined within the kernel core. The 
time_namespace
 structure definition is listed below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/time_namespace.h#L19

struct time_namespace {
        struct user_namespace *    user_ns;              /*     0     8 */
        struct ucounts *           ucounts;              /*     8     8 */
        struct ns_common           ns;                   /*    16    24 */
        struct timens_offsets      offsets;              /*    40    32 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        struct page *              vvar_page;            /*    72     8 */
        bool                       frozen_offsets;       /*    80     1 */

        /* size: 88, cachelines: 2, members: 6 */
        /* padding: 7 */
        /* last cacheline: 24 bytes */
};

At offset 16, the 

ns
 member is found. It is an 
ns_common
 structure, whose definition is the following.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/ns_common.h#L9

struct ns_common {
        atomic_long_t              stashed;              /*     0     8 */
        const struct proc_ns_operations  * ops;          /*     8     8 */
        unsigned int               inum;                 /*    16     4 */
        refcount_t                 count;                /*    20     4 */

        /* size: 24, cachelines: 1, members: 4 */
        /* last cacheline: 24 bytes */
};

At offset 8 within the 

ns_common
 structure the 
ops
 member is found. Therefore, 
time_namespace.ns-&gt;ops
 is at offset 24.

Spraying 

time_namespace
 objects can be done by calling the 
unshare()
 system call and providing the 
CLONE_NEWUSER
 and 
CLONE_NEWTIME
. In order to avoid altering the execution of the current process the 
unshare()
 executions can be done in separate processes created via 
fork()
.

clone_time_ns()
  |
  +- copy_time_ns()
    |
    +- create_new_namespaces()
      |
      +- unshare_nsproxy_namespaces()
        |
        +- unshare() syscall

The 

CLONE_NEWTIME
 flag is required because of a check in the function 
copy_time_ns()
 (listed below) and 
CLONE_NEWUSER
 is required to be able to use the 
CLONE_NEWTIME
 flag from an unprivileged user.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/kernel/time/namespace.c#L133

/**
 * copy_time_ns - Create timens_for_children from @old_ns
 * @flags:      Cloning flags
 * @user_ns:    User namespace which owns a new namespace.
 * @old_ns:     Namespace to clone
 *
 * If CLONE_NEWTIME specified in @flags, creates a new timens_for_children;
 * adds a refcounter to @old_ns otherwise.
 *
 * Return: timens_for_children namespace or ERR_PTR.
 */
struct time_namespace *copy_time_ns(unsigned long flags,
        struct user_namespace *user_ns, struct time_namespace *old_ns)
{
        if (!(flags & CLONE_NEWTIME))
                return get_time_ns(old_ns);

        return clone_time_ns(user_ns, old_ns);
}

RIP Control

Achieving RIP control is relatively easy with the partial object free primitive. This primitive can be used to partially free an 

nft_set
 object whose address is known and replace it with a fake 
nft_set
 object created with a System V message. The 
nft_set
 objects contain an 
ops
 member, which is a function table of type 
nft_set_ops
. Crafting this function table and triggering the right call will lead to RIP control.

The following is the definition of the 

nft_set_ops
 structure.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L389

struct nft_set_ops {
        bool                       (*lookup)(const struct net  *, const struct nft_set  *, const u32  *, const struct nft_set_ext  * *); /*     0     8 */
        bool                       (*update)(struct nft_set *, const u32  *, void * (*)(struct nft_set *, const struct nft_expr  *, struct nft_regs *), const struct nft_expr  *, struct nft_regs *, const struct nft_set_ext  * *); /*     8     8 */
        bool                       (*delete)(const struct nft_set  *, const u32  *); /*    16     8 */
        int                        (*insert)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *, struct nft_set_ext * *); /*    24     8 */
        void                       (*activate)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *); /*    32     8 */
        void *                     (*deactivate)(const struct net  *, const struct nft_set  *, cstimate *); /*    88     8 */
        int                        (*init)(const struct nft_set  *, const struct nft_set_desc  *, const struct nlattr  * const *); /*    96     8 */
        void                       (*destroy)(const struct nft_set  *); /*   onst struct nft_set_elem  *); /*    40     8 */
        bool                       (*flush)(const struct net  *, const struct nft_set  *, void *); /*    48     8 */
        void                       (*remove)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *); /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        void                       (*walk)(const struct nft_ctx  *, struct nft_set *, struct nft_set_iter *); /*    64     8 */
        void *                     (*get)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *, unsigned int); /*    72     8 */
        u64                        (*privsize)(const struct nlattr  * const *, const struct nft_set_desc  *); /*    80     8 */
        bool                       (*estimate)(const struct nft_set_desc  *, u32, struct nft_set_e104     8 */
        void                       (*gc_init)(const struct nft_set  *); /*   112     8 */
        unsigned int               elemsize;             /*   120     4 */

        /* size: 128, cachelines: 2, members: 16 */
        /* padding: 4 */
};

The 

delete
 member is executed when an item has to be removed from the set. The item removal can be done from a rule that removes an element from a set when certain criteria is matched. Using the 
nft
 command, a very simple one can be as follows:

nft add table inet test_dynset
nft add chain inet test_dynset my_input_chain { type filter hook input priority 0\;}
nft add set inet test_dynset my_set { type ipv4_addr\; }
nft add rule inet test_dynset my_input_chain ip saddr 127.0.0.1 delete @my_set { 127.0.0.1 }

The snippet above shows the creation of a table, a chain, and a set that contains elements of type 

ipv4_addr
 (i.e. IPv4 addresses). Then a rule is added, which deletes the item 
127.0.0.1
 from the set 
my_set
 when an incoming packet has the source IPv4 address 
127.0.0.1
. Whenever a packet matching that criteria is processed via nftables, the 
delete
 function pointer of the specified set is called.

Therefore, RIP control can be achieved with the following steps. Consider the target set to be the 

nft_set
 object whose address was already obtained.

  1. Add a rule to the table being used for exploitation in which an item is removed from the target set when the source IP of incoming packets is 
    127.0.0.1
    .
  2. Partially free the 
    nft_set
     object from which the address was obtained.
  3. Spray System V messages containing a partially fake 
    nft_set
     object containing a fake 
    ops
     table, with a given value for the 
    ops-&gt;delete
     member.
  4. Trigger the call of 
    nft_set-&gt;ops-&gt;delete
     by locally sending a network packet to 
    127.0.0.1
    . This can be done by simply opening a TCP socket to 
    127.0.0.1
     at any port and issuing a 
    connect()
     call.

Escalating Privileges

Once the control of the RIP register is achieved and thus the code execution can be redirected, the last step is to escalate privileges of the current process and drop to an interactive shell with root privileges.

A way of achieving this is as follows:

  1. Pivot the stack to a memory area under control. When the 
    delete
     function is called, the RSI register contains the address of the memory region where the nftables register values are stored. The values of such registers can be controlled by adding an 
    immediate
     expression in the rule created to achieve RIP control.
  2. Afterwards, since the nftables register memory area is not big enough to fit a ROP chain to overwrite the 
    MODPROBE_PATH
     global variable, the stack is pivoted again to the end of the fake 
    nft_set
     used for RIP control.
  3. Build a ROP chain to overwrite the 
    MODPROBE_PATH
     global variable. Place it at the end of the 
    nft_set
     mentioned in step 2.
  4. Return to userland by using the KPTI trampoline.
  5. Drop to a privileged shell by leveraging the overwritten 
    MODPROBE_PATH
     global variable
    .

The stack pivot gadgets and ROP chain used can be found below.

// ROP gadget to pivot the stack to the nftables registers memory area

0xffffffff8169361f: push rsi ; add byte [rbp+0x310775C0], al ; rcr byte [rbx+0x5D], 0x41 ; pop rsp ; ret ;


// ROP gadget to pivot the stack to the memory allocation holding the target nft_set

0xffffffff810b08f1: pop rsp ; ret ;

When the execution flow is redirected, the RSI register contains the address otf the nftables’ registers memory area. This memory can be controlled and thus is used as a temporary stack, given that the area is not big enough to hold the entire ROP chain. Afterwards, using the second gadget shown above, the stack is pivoted towards the end of the fake 

nft_set
 object.

// ROP chain used to overwrite the MODPROBE_PATH global variable

0xffffffff8148606b: pop rax ; ret ;
0xffffffff8120f2fc: pop rdx ; ret ;
0xffffffff8132ab39: mov qword [rax], rdx ; ret ;

It is important to mention that the stack pivoting gadget that was used performs memory dereferences, requiring the address to be mapped. While experimentally the address was usually mapped, it negatively impacts the exploit reliability.

Wrapping Up

We hope you enjoyed this reading and could learn something new. If you are hungry for more make sure to check our other blog posts.

We wish y’all a great Christmas holidays and a happy new year! Here’s to a 2023 with more bugs, exploits, and write ups!

New Linux malware evades detection using multi-stage deployment

New Linux malware evades detection using multi-stage deployment

original text by Bill Toulas

A new stealthy Linux malware known as Shikitega has been discovered infecting computers and IoT devices with additional payloads.

The malware exploits vulnerabilities to elevate its privileges, adds persistence on the host via crontab, and eventually launches a cryptocurrency miner on infected devices.

Shikitega is quite stealthy, managing to evade anti-virus detection using a polymorphic encoder that makes static, signature-based detection impossible.

An intricate infection chain

While the initial infection method is not known at this time, researchers at AT&T who discovered Shikitega say the malware uses a multi-step infection chain where each layer delivers only a few hundred bytes, activating a simple module and then moving to the next one.

«Shiketega malware is delivered in a sophisticated way, it uses a polymorphic encoder, and it gradually delivers its payload where each step reveals only part of the total payload.,» explains AT&T’s report.

The infection begins with a 370 bytes ELF file, which is the dropper containing encoded shellcode.

The ELF file that initiates the infection chain (AT&T)

The encoding is performed using the polymorphic XOS additive feedback encoder ‘Shikata Ga Nai,’ previously analyzed by Mandiant.

“Using the encoder, the malware runs through several decode loops, where one loop decodes the next layer until the final shellcode payload is decoded and executed,” continues the report.

“The encoder stud is generated based on dynamic instruction substitution and dynamic block ordering. In addition, registers are selected dynamically.”

Shikata Ga Nai decryption loops (AT&T)

After the decryption is completed, the shellcode is executed to contact the malware’s command and control servers (C2) and receive additional shellcode (commands) stored and run directly from memory.

One of these commands downloads and executes ‘Mettle,’ a small and portable Metasploit Meterpreter payload that gives the attackers further remote control and code execution options on the host.

Downloaded shellcode fetching Mettle (AT&T)

Mettle fetches yet a smaller ELF file, which exploits CVE-2021-4034 (aka PwnKit) and CVE-2021-3493 to elevate privileges and download the final stage payload, a cryptocurrency miner, as root.

Exploiting PwnKit to elevate privileges to root (AT&T)

Persistence for the crypto miner is achieved by downloading five shell scripts that add four cronjobs, two for the root user and two for the current user.

The five shell scripts and their functions (AT&T)

The crontabs are an effective persistence mechanism, so all downloaded files are wiped to reduce the likelihood of the malware being discovered.

The crypto miner is XMRig version 6.17.0, focusing on mining the anonymity-focused and hard-to-trace Monero.


Shikitega infection chain overview
 (AT&T)

To further reduce the chances of raising alarms on network security products, the threat actors behind Shikitega use legitimate cloud hosting services to host their command and control infrastructure.

This choice costs more money and puts the operators at risk of being traced and identified by law enforcement but offers better stealthiness in the compromised systems.

The AT&T team reports a sharp rise in Linux malware this year, advising system admins to apply the available security updates, use EDR on all endpoints, and take regular backups of most important data.

For now, Shikitega appears focused on Monero mining, but the threat actors may decide that other, more potent payloads can be more profitable in the long run.