Analysis of VirtualBox CVE-2023-21987 and CVE-2023-21991

Analysis of VirtualBox CVE-2023-21987 and CVE-2023-21991

Original text by qriousec


Hi, I am Trung (xikhud). Last month, I joined Qrious Secure team as a new member, and my first target was to find and reproduce the security bugs that @bienpnn used at the Pwn2Own Vancouver 2023 to escape the VirtualBox VM.

Since VirtualBox is an open-source software, I can just download the source code from their homepage. The version of VirtualBox at the time of the Pwn2Own competition was 7.0.6.

Exploring VirtualBox

Building VirtualBox

The very first thing I did is to build the VirtualBox and to have a debugging environment. VirtualBox’s developers have published a very detail guide to build it. My setup is below:

  • Host: Windows 10
  • Guest: Windows 10. VirtualBox will be built on this machine.
  • Guest 2 (the guest inside the VirtualBox VM): LUbuntu 18.04.3

If you are new to VirtualBox exploitation, you may wonder why I need to install a nested VM. The reason is that VirtualBox contains both kernel mode and user mode components, so I have to install it inside a VM to debug its kernel things.

The official building guide offers using VS2010 or VS2019 to build VirtualBox, but you have to use VS2019 to build the version 7.0.6.

You can use any other operating system for Guest 2. I choose LUbuntu because it is lightweight. (I have a potato computer lol).

Learning VirtualBox source code

VirtualBox source code is large, I can’t just read all of them in a short amount of time. Instead, I find blog posts about pwning VirtualBox on Google and read them. These posts not only show how to exploit VirtualBox but also describe how VirtualBox works, its architecture and stuff like that. These are the very good write-ups that I also recommend you to read if you want to start learning VirtualBox exploitation:

The VirtualBox architecture is as follow (the picture is taken from Chen Nan’s slide at HITB2021)

The simple rule I learned is that when the guest wants to emulate a device, it send a request to the host’s kernel drivers (R0) first. The host’s kernel have two choices:

  • It can handle that request
  • Or it can return 
    . This value means that it doesn’t want to handle the request and the request will be handled by the host’s user mode components (R3).

The source code for R0 and R3 is usually in the same file, the only different thing is the preprocessors.

  • #define IN_RING3
     corresponds to R3 components
  • #define IN_RING0
     corresponds to R0 components
  • #define IN_RC
    : I don’t know what this is, maybe someone knows can tell me …

For example, let’s look at the code in the 


In the image above, when the R0 component receives this request, it will pass to R3 component. The return code (

) is 
. According to the source code comment, it is “Reason for leaving RZ: MMIO write”. There are other similar values: 
, …

If you want to know more detail about VirtualBox architechture, I suggest you to read the slide by Chen Nan. You can also watch his video here.

After having a basic understanding about VirtualBox, the next thing I did is to find some attack vectors. Usually, with VirtualBox, the attack scenario will be an untrusted code running within the guest machine. It will communicate with the host to compromise it. There are two methods a guest OS can talk to the host:

  • Using memory mapped I/O
  • Using port I/O

These are usually the entry points of an attack, so I look at them first when auditing.

The memory mapped region can be created by these functions:

  • PDMDevHlpMmioCreateAndMap
  • PDMDevHlpMmioCreateExAndMap
  • ...

The IO port can be created by:

  • PDMDevHlpIoPortCreateFlagsAndMap
  • PDMDevHlpPCIIORegionCreateIo
  • PDMDevHlpPCIIORegionCreateMmio2Ex
  • ...

With memory mapped, we can use the 

 or similar instructions to communicate with the host. Meanwhile, we use 
 instruction when we work with IO port.

Now I have more understanding about VirtualBox, I can start to look for bugs now. To reduce the time, @bienpnn gave me 2 hints:

  • The OOB write bug is in the TPM components
  • The OOB read bug is in the VGA components

Knowing that, I open the source code and read files in 


The OOB write bug

At Pwn2Own, the TPM 2.0 is enabled. It is required to run Windows 11 inside VirtualBox. You will have to enable it manually in the VirtualBox GUI, if you don’t, then the exploit here won’t work.

The TPM module is initialized by the two functions 

 (R3) and 
 (R0). Both functions register 
 to handle read/write to memory mapped region.

    rc = PDMDevHlpMmioCreateAndMap(pDevIns, pThis->GCPhysMmio, TPM_MMIO_SIZE, tpmMmioWrite, tpmMmioRead,
                                   "TPM MMIO", &pThis->hMmio);

The memory region is at 
, which is 
 by default.
To confirm the communication works as expected, I put a (R0) breakpoint at 
 and write a small C code to run inside the VirtualBox.
void *map_mmio(void *where, size_t length)
    int fd = open("/dev/mem", O_RDWR | O_SYNC);
    if (fd == -1) { /* error */ }
    void *addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (off_t)where);
    if (addr == NULL) { /* error */ }
    return addr;

int main()
    volatile uint8_t* mmio_tpm = (uint8_t *)map_mmio((void *)0xfed40000, 0x5000);
    mmio_tpm[0x200] = 0xFF;
    return 0;

The breakpoint hit! It works. This is the signature of the 


static DECLCALLBACK(VBOXSTRICTRC) tpmMmioWrite(PPDMDEVINS pDevIns, void *pvUser, RTGCPHYS off, void const *pv, unsigned cb);

At the time the breakpoint hit, 

 (which is the offset from the start of the memory mapped buffer), 
 is the number of byte to read, in this case it is 
 since we only write 1 byte, 
 is the host buffer contains the values supplied by the guest OS, in this case it contains 
 only. If we want to write more bytes, we can write it in C like this:

*(uint32_t*)(mmio_tpm + 0x200) = 0xFFAABBCC;

In assembly form, it will be something like this:

mov dword ptr [rdx], 0xFFAABBCC 

In this case, 

 will be 


 function looks fine, after confirming the is no bug in it, I look at 

static DECLCALLBACK(VBOXSTRICTRC) tpmMmioRead(PPDMDEVINS pDevIns, void *pvUser, RTGCPHYS off, void *pv, unsigned cb)
    /* truncated */
    uint64_t u64;
    /* truncated */
        rc = tpmMmioFifoRead(pDevIns, pThis, pLoc, bLoc, uReg, &u64, cb);
    /* truncated */
    return rc;

                                    uint8_t bLoc, uint32_t uReg, uint64_t *pu64, size_t cb)
    /* ... */
    /* Special path for the data buffer. */
    if (   (   (   uReg >= TPM_FIFO_LOCALITY_REG_DATA_FIFO
               && uReg < TPM_FIFO_LOCALITY_REG_DATA_FIFO + sizeof(uint32_t))
            || (   uReg >= TPM_FIFO_LOCALITY_REG_XDATA_FIFO
                && uReg < TPM_FIFO_LOCALITY_REG_XDATA_FIFO + sizeof(uint32_t)))
        && bLoc == pThis->bLoc
        && pThis->enmState == DEVTPMSTATE_CMD_COMPLETION)
        if (pThis->offCmdResp <= pThis->cbCmdResp - cb)
            memcpy(pu64, &pThis->abCmdResp[pThis->offCmdResp], cb);
            pThis->offCmdResp += (uint32_t)cb;
            memset(pu64, 0xff, cb);
        return VINF_SUCCESS;

You can see that there is a branch of code that does a 

 into the 
, which is a stack variable of 
 function. To be able to reach this branch, 
 must have appropriate values. But don’t worry because we can control all of them, we can also control 
. There is no check to make sure 
cb &lt;= sizeof(uint64_t)
, so maybe there is a stack buffer overflow here? Now I have to find a way to make 
 larger than 
 (8). I google and found that some AVX-512 instructions can read up to 512 bits (64 bytes) memory. Since my CPU doesn’t support AVX-512, I try AVX2 instead:

__m256 z = _mm256_load_ps((const float*)off);

Indeed, it works! 

 is now 
 and I can overwrite 0x18 bytes after 
 variable. But the is a problem: 
 is behind the return address of 
. Let’s look at the stack when RIP is at the very first instruction of 

2: kd> dq @rsp
ffffbb80`814920a8  fffff804`d2432993 ffff8901`0ecc0000
ffffbb80`814920b8  000fffff`fffff000 ffff8901`0edc6760
ffffbb80`814920c8  fffff804`d243418b 00000000`00000020
ffffbb80`814920d8  fffff804`d2458b1d ffffe289`26bf7000
ffffbb80`814920e8  00000000`00000020 ffff8901`0ede4000
ffffbb80`814920f8  00000000`00000080 ffff8901`0ecc0000
ffffbb80`81492108  fffff804`d243313f ffffe289`26b87188
ffffbb80`81492118  fffff804`d2451c8b 00000000`fed40080

Remember that the return address is at 

. Now let’s run until RIP is at 
call VBoxDDR0!tpmMmioFifoRead

2: kd> dq @rsp
ffffbb80`81492060  ffff8901`0ecc0000 00000000`00000000
ffffbb80`81492070  00000000`00000060 fffff804`d253ba1b
ffffbb80`81492080  00000000`00000080 ffffbb80`814920b0 <-- pu64
ffffbb80`81492090  00000000`00000020 00000000`00000018
ffffbb80`814920a0  ffff8901`0ede4140 fffff804`d2432993 <-- R.A
ffffbb80`814920b0  ffff8901`0ecc0000 ffffe289`26b87188
ffffbb80`814920c0  00000000`00000080 fffff804`d243418b
ffffbb80`814920d0  00000000`00000020 fffff804`d2458b1d

Based on the x64 Windows calling convention, the 5th argument is at 

, so the address of 
, which is behind the return address (
). Why does this happen? I don’t really know, but I guess this is some kind of compiler optimization. Let’s check 
 in IDA:

unsigned __int64 pu64; // [rsp+50h] [rbp+8h] BYREF

The assembly code:

.text:000000014002BA10 000   mov     [rsp+10h], rbx
.text:000000014002BA15 000   mov     [rsp+18h], rsi
.text:000000014002BA1A 000   push    rdi
.text:000000014002BA1B 008   sub     rsp, 40h
.text:000000014002BA1F 048   mov     rdx, [rcx+10h]  ; pThis

 is at 
, but the function only allocate 0x48 bytes for the stack. Clearly, 
 is outside of the stack frame range. So in which function stack frame does this variable belong to? Well, it is right next to the return address, so it is in the shadow space. Turned out that, the shadow space is used to make debugging easier. But we are using the “Release” build, so it will use the shadow space as if it is a normal space. We can overwrite 0x18 bytes after the 
 variable. Unfortunately, there is no data after 
 so we can’t do anything. I’m stuck now. Maybe if my CPU supports AVX-512, I can do something? Until now, @bienpnn told me that there is an instruction which can read up to 512 bytes. It is 
, which is used to restore x87 FPU, MMX, XMM, and MXCSR state. Knowing this, I tried this code:


And then, VirtualBox.exe crashed! That’s good. But wait, why does it crash without first hitting the breakpoint at 

? Turned out that all the request with 
cb >= 0x80
 will be handled by R3 code. This is the comment in 

 * If someone is doing FXSAVE, FXRSTOR, XSAVE, XRSTOR or other stuff dealing with
 * large amounts of data, just go to ring-3 where we don't need to deal with partial
 * successes.  No chance any of these will be problematic read-modify-write stuff.
 * Also drop back if the ring-0 registration entry isn't actually used.

Let’s trigger this bug again. But this time we will set a breakpoint at 

 instead. And now I can see a stack buffer overflow.

Really nice. Now we have RIP controlled, but don’t know where to jump. We need a leak.

The OOB read bug

The OOB read bug is inside 

 module. There are a lot of files belong to this module, but I choose to read 
 first, since the name looks like the main file of VGA module. I look at the 2 construction functions to see which IO port or memory mapped is used. I found that the 
 will handle the MMIO request, it will then call 
. And inside this function, I found the code below (we can control 

pThis->latch = !pThis->svga.fEnabled            ? ((uint32_t *)pThisCC->pbVRam)[addr]
             : addr < VMSVGA_VGA_FB_BACKUP_SIZE ? ((uint32_t *)pThisCC->svga.pbVgaFrameBufferR3)[addr] : UINT32_MAX;

 so we only care about this line:

addr < VMSVGA_VGA_FB_BACKUP_SIZE ? ((uint32_t *)pThisCC->svga.pbVgaFrameBufferR3)[addr] : UINT32_MAX;

 is the size of 
. Maybe you can see what’s wrong here.

((uint32_t *)pThisCC->svga.pbVgaFrameBufferR3)[addr]

*(uint32_t *)(pThisCC->svga.pbVgaFrameBufferR3 + sizeof(uint32_t) * addr)
// note that the type of pThisCC->svga.pbVgaFrameBufferR3 is uint8_t[]

he code checks if 

, but actually uses 
4 * addr
 for indexing. It means that we have an OOB read here. Untill now, I thought that it will be easy because with a leak and a stack buffer overflow, I would easily do a ROP chain. But I regret soon when I see that the heap layout is not static, it changes everytime I open a new VirtualBox process. The reason for this is because VirtualBox is a very complex software, so heap allocations are made everywhere, which changes the shape of the heap.


Now I need a reliable way to have a leak. For this, I will use heap spraying technique. So my plan is to poison the heap with a lot of objects that I control, and (hopefully) some of the objects will be right behind the 

 buffer so that I can use the OOB read to leak information. sauercl0ud team had already written a nice blog post about exploiting VirtualBox. Inside the post, they sprayed the heap with 
 objects, I will just use 
 too, because why not 😀 ?

What is HGCM?

HGCM is an abbreviation for “Host/Guest Communication Manager”. This is the module used for communication between the host and the guest. For example, they need to talk to each other in order to implement the “Shared Clipboard”, “Shared Folder”, “Drag and drop” services.

Here’s how it works. The guest inside VirtualBox will have to install additional drivers, a.k.a the guest additions. When the guest wants to use one of the service above, it will send a message to the host through IO port, the message is represented by the 


0:035> dt VBoxC!HGCMMsgCall
   +0x000 __VFN_table : Ptr64 
   +0x008 m_cRefs          : Int4B
   +0x00c m_enmObjType     : HGCMOBJ_TYPE
   +0x010 m_u32Version     : Uint4B
   +0x014 m_u32Msg         : Uint4B
   +0x018 m_pThread        : Ptr64 HGCMThread
   +0x020 m_pfnCallback    : Ptr64     int 
   +0x028 m_pNext          : Ptr64 HGCMMsgCore
   +0x030 m_pPrev          : Ptr64 HGCMMsgCore
   +0x038 m_fu32Flags      : Uint4B
   +0x03c m_rcSend         : Int4B
   +0x040 pCmd             : Ptr64 VBOXHGCMCMD
   +0x048 pHGCMPort        : Ptr64 PDMIHGCMPORT
   +0x050 pcCounter        : Ptr64 Uint4B
   +0x058 u32ClientId      : Uint4B
   +0x05c u32Function      : Uint4B
   +0x060 cParms           : Uint4B
   +0x068 paParms          : Ptr64 VBOXHGCMSVCPARM
   +0x070 tsArrival        : Uint8B

This object is perfect because:

  • It has a 
     pointer -> We can leak a library address
  • It has 
    , which points to next and previous 
     in a doubly linked list -> Also good, can be used to leak heap address.

Now I will spray the heap with a lot of 

. This code is just copied from Sauercl0ud blog:

void spray()
    int rc;
    for (int i = 0; i < 64; ++i)
        int32_t clientId;
        rc = hgcm_connect("VBoxGuestPropSvc", &clientId);
        for (int j = 0; j < 16 - 1; ++j)
            char pattern[0x70];
            char out[2];
            rc = wait_prop(clientId, pattern, strlen(pattern) + 1, out, sizeof(out)); // call VBoxGuestPropSvc HGCM service, this will allocate a HGCMMsgCall

After some observation, I realize that the 

 is usually 
. Only the 
 part is randomized, I will use this information to identify a 
 on the heap. My approach is simple: I just keep reading a qword (8 bytes) each time, called 
. I will then check if 
(X &amp; 0xFFFF) == 0xAD90
(X &gt;&gt; 40) == 0x7F
. If this is true, we likely to reach a 
, and X is the 
 pointer. To leak heap address, I will do like this (this idea is also taken from Sauercl0ud blog):

  • Find a 
     on the heap. Let’s call this object 
     and let’s call 
     the offset from the 
     buffer to this object.
  • Find another 
     are the same as above, and 
    b &gt; a
  • If 
    A-&gt;m_pNext - B-&gt;m_pPrev == b - a
    , then it’s likely that 
     is the address of 
    . It means that 
    A-&gt;m_pNext - b
     is the address of 

Actually I don’t need a heap leak to make a ROP chain, only a DLL leak is enough. But I want to show you this method so that you can make a longer ROP chain in case you need it.

Now I have enough information to write an exploit.

Testing out the exploitation idea

I implement the idea above, and run the exploit for 20 times and not a single time success. That’s 0% of success rate, very bad. Most of the time, VirtualBox just crashes. I attached a debugger and ran the exploit again, the crash happened when trying to read an address that had not been mapped. Turned out that I could read up to 

 bytes (1.5MB) after the 
 buffer, but most of the time there is only about ~ 
 bytes that had been mapped. Another crash I found is when the exploit was trying to read an address inside a guard page. Another problem I had is that the exploit run really slow, because the OOB bug only lets me read 1 byte at a time. I need to improve the speed of the exploit as well.

Parsing heap header to avoid unmmaped pages and increase speed

Until now, I have a new idea: parsing the heap chunk headers on the heap to gain more information. First thing I want to do is to read some information about a chunk, for example, the size of the chunk, is it freed or in used? If I can do this, maybe I will be able to skip some unnecessary chunks. To make this idea come true, I have to learn some Windows heap internal. I recommend you to read these:

Basically, a heap chunk is represented by 


0:035> dt _HEAP_ENTRY
   +0x008 Size             : Uint2B
   +0x00a Flags            : UChar
   +0x00c PreviousSize     : Uint2B

 is the size of the chunk (include the header itself), 
 is the size of the previous chunk (in memory), and 
 contains extra information about a chunk, for example: is it free or in used.


) is the size of a chunk in blocks, not in bytes. 1 block is 16 bytes in length.

Parsing heap header is easy. 16 bytes after 

 is the chunk header of a chunk, so I can read it, get the 
 and just do it again … But there is a problem: the chunk header is encoded, it is xorred with 
. I will give you an example.

0:042> !heap -i 26855e40000             
Heap context set to the heap 0x0000026855e40000
0:042> db 0000026859f1f010 L0x10
00000268`59f1f010  0c 00 02 02 2e 2e 00 00-dd e0 1a 38 cd be b2 10  ...........8....
0:042> !heap -i 0000026859f1f010 
Detailed information for block entry 0000026859f1f010
Assumed heap       : 0x0000026855e40000 (Use !heap -i NewHeapHandle to change)
Header content     : 0x381AE0DD 0x10B2BECD (decoded : 0x08010801 0x10B28201)
Owning segment     : 0x00000268593f0000 (offset b2)
Block flags        : 0x1 (busy )
Total block size   : 0x801 units (0x8010 bytes)
Requested size     : 0x8000 bytes (unused 0x10 bytes)
Previous block size: 0x8201 units (0x82010 bytes)
Block CRC          : OK - 0x8  
Previous block     : 0x0000026859e9d000
Next block         : 0x0000026859f27020

The output of 

!heap -i
 said that the header content is 
0x381AE0DD 0x10B2BECD
, but after decoded it is 
0x08010801 0x10B28201
. Let’s confirm this

0:042> db 0000026859f1f010 L0x10
00000268`59f1f010  0c 00 02 02 2e 2e 00 00-dd e0 1a 38 cd be b2 10  ...........8....
0:042> dt _HEAP
   +0x000 Segment          : _HEAP_SEGMENT
   +0x080 Encoding         : _HEAP_ENTRY // <-- The key to decode
0:042> dq 26855e40000+0x80 L2
00000268`55e40080  00000000`00000000 00003ccc`301be8dc

So the key is 

0x00003ccc301be8dc ^ 0x10b2becd381ae0dd = 0x10b2820108010801
, which is exactly what shown in 
!heap -i

So to parse a chunk header, we need to leak the 

. I can’t not directly leak it but I have an idea to calculate it. 
 has the size of 
 bytes (include the header), so the chunk right behind it (let’s call this chunk 
) must have 
 equals to 
. Knowing this, I can calculate 2 bytes key to decode 

KeyToDecodePreviousSize = A->PreviousSize ^ 0x8001

Next, I find another chunk after chunk 

, let’s call this chunk 
 is likely to be a valid chunk if:

(B->PreviousSize ^ KeyToDecodePreviousSize) << 4 == Distance between A and B

B->PreviousSize ^ KeyToDecodePreviousSize
 is also the value of 
A->Size ^ KeyToDecodeSize
, so:

KeyToDecodeSize = B->PreviousSize ^ KeyToDecodePreviousSize ^ A->Size

Now I am able to decode 

. What about the 
? I don’t know a good way to decode it, so I just run VirtualBox multiple times and observe that most of the time chunk 
 is in used. So if any other chunk has the LSB bit of 
 equals to the LSB bit of 
, then it is also in used and vice versa. With this informaton, I can walk the heap easily, the algorithm looks like this:

uint32_t curOffset = 0x80000;
while (curOffset < 0x200000) { // 0x200000 is the maximum we can touch
    readHeapEntry(curOffset, &hE);
    if (isInUsed(&hE))
        findSprayedObjects(curOffset, &hE);
    curOffset += ((hE.Size ^ KeyToDecodeSize) << 4); // 1 block is 16 bytes

Now my exploit runs a lot faster, also the success rate is increased a little. But sometime the exploit still crashed VirtualBox. I attach Windbg and see that it was trying to access a guard page, and this guard page is inside an in-used chunk. After a few days of researching, I finally knew that chunk was a 


How does LFH work and what is a 

Quoted from Microsoft:

Heap fragmentation is a state in which available memory is broken into small, noncontiguous blocks. When a heap is fragmented, memory allocation can fail even when the total available memory in the heap is enough to satisfy a request, because no single block of memory is large enough. The low-fragmentation heap (LFH) helps to reduce heap fragmentation. The LFH is not a separate heap. Instead, it is a policy that applications can enable for their heaps. When the LFH is enabled, the system allocates memory in certain predetermined sizes

When an application makes more than 17 allocations of the same size, the LFH will be turned on (for that size only). We spray a lot of objects (more than 17), so they will all be served by the LFH. Basically this is how LFH works:

  • A big chunk will be allocated, this is a 
     struct. A 
     contains some metadata and a lot of small chunks.
  • Any heap allocation after that will return a (freed) small chunk in the 

 is represented by a 

   +0x000 SFreeListEntry   : _SINGLE_LIST_ENTRY
   +0x000 SubSegment       : Ptr64 _HEAP_SUBSEGMENT
   +0x008 Reserved         : Ptr64 Void
   +0x010 SizeIndexAndPadding : Uint4B
   +0x010 SizeIndex        : UChar
   +0x011 GuardPagePresent : UChar
   +0x012 PaddingBytes     : Uint2B
   +0x014 Signature        : Uint4B
   +0x018 EncodedOffsets   : _HEAP_USERDATA_OFFSETS
   +0x020 BusyBitmap       : _RTL_BITMAP_EX
   +0x030 BitmapData       : [1] Uint8B

There is 2 important things to note here:

  • The 
     is not encoded like the 
    . A 
     is also a regullar chunk, so it is in the “user data” part of another 
  • Signature
     is always 
    , so we can easily find it in the heap.
  • GuardPagePresent
    : if this is non zero, the 
     has a page guard at the end, so we can skip 
     bytes at the end, preventing crashes.
  • BusyBitmap
     contains the address of 
    . This can be used as a reliable way to leak heap address too

Knowing that all the 

 sprayed by us will be served by LFH, I will only find them in a 
. This makes the exploit run a lot of faster than the first exploit I made.

More success rate, more speed

I also noted that each time I want to send a HGCM message, I have to create a 

. Since there are many 
s being allocated when spraying, I also look for their 

One more thing is that every chunk address will have to be the multiple of 

, so I only read qwords at these locations to find 
. This will also increase the speed of my exploit.


I would like to give a special thanks to my mentor @bienpnn, who was actively helping me throughout the project. This is my first time exploiting a real Windows software so it is really fun. After this project, I learned more about Windows heap internal, how a hypervisor works, how to debug Windows kernel and a ton of other knowledge. I hope this post can help you if you are about to target VirtualBox, and see you in another blog post!

Exploiting CVE-2021-3490 for Container Escapes

Exploiting CVE-2021-3490 for Container Escapes

Original text by Karsten König

Today, containers are the preferred approach to  deploy software or create build environments in CI/CD lifecycles. However, since the emergence of container solutions and environments like Docker and Kubernetes, security researchers have consistently found ways to escape from containers once they are compromised. Most attacks are based on configuration errors. But it is also possible to escalate privileges and escape to the container’s host system by exploiting vulnerabilities in the host’s operating system.

This blog shows how to modify an existing Linux kernel exploit in order to use it for container escapes and how the CrowdStrike Falcon® platform can help to prevent and hunt for similar threats.

Original Technique

Before we outline the modifications required to turn the exploit into a container escape, we first look at what the original exploit achieved.

Valentina Palmiotti published a full exploit for CVE-2021-3490 that can be used to locally escalate privileges to root on affected systems. The vulnerability was rooted in the eBPF subsystem of the Linux kernel and fixed in version 5.10.37. eBPF allows user space processes to load custom programs into the kernel and attach them to so-called events, thus giving user space the ability to observe kernel internals and, in specifically supported cases, to implement custom logic for networking, access control and other tasks. These eBPF programs have to pass a verifier before being loaded, which is supposed to guarantee that the code does not contain loops and does not write to memory outside of its dedicated area. This step should ensure that eBPF programs terminate and are not able to manipulate kernel memory, which would potentially allow attackers to escalate privileges. However, this verifier contained several vulnerabilities in the past. CVE-2021-3490  is one of them and can ultimately be used to achieve a kernel read and write primitive.

Building on the kernel read primitive, it is possible to leak a kernel pointer. eBPF programs can communicate with processes running in user space using so-called “eBPF maps.”  Every eBPF map is described by a 

struct bpf_map
 object, which contains a  field 
 pointing to a 
struct bpf_map_ops
. That struct contains several function pointers for working with the eBPF map. eBPF maps come in different kinds with different definitions for 
 stored at known offsets. For array maps, 
 will be set to point to the kernel symbol 
. The exploit will leak that address and then use it as a starting point to further scan the kernel’s memory space and read pointers from the kernel’s symbol table.

The kernel exports pointers to certain variables, objects and functions in a symbol table to make them accessible by kernel modules. This table is called 

. In order to look up the actual name of a stored symbol address, a second table, called 
, is utilized. A pointer to the string in 
 that contains the name is stored as part of every 
 entry, right after the pointer to the symbol itself. To find the address of a kernel symbol, the exploit first reads memory from kernel space starting at the leaked address of 
 using the arbitrary read primitive. This is done until the string containing the symbol name of interest is found in 


 is mapped after 
, the previously read memory region should contain the pointer to the string in 
. Therefore, the exploit then proceeds to search for that pointer, and the pointer to the actual symbol is stored right before it.

One clarification has to be made about the above code excerpt: There are two different formats of 

. Which one is used is decided when the Linux kernel is built from source by the configuration parameter 
. In one case, the actual addresses of the symbols and 
 entries are stored. However, in many kernel builds it does not actually contain pointers but offsets such that the address where the offset is stored plus the offset itself is the symbol’s address or the string’s address in 

Nevertheless, using this technique, the exploit identifies the address of 

. This object is a 
struct pid_namespace
 and describes the default process ID namespace new processes are started in.

Namespaces have become a  fundamental feature of Linux and are crucial to the idea of container environments. They allow separating system resources between different processes such that one process can observe a completely different set than others. For example, mount namespaces control observable filesystem mount points such that two processes can have different views of the filesystem. This allows a container’s filesystem to have a different root directory than the host. Process ID namespaces on the other side give processes a completely unique process tree. The first process in a process ID namespace always has the identifier (PID) 1. It is considered as the 

 process that initializes the operating system and from which new processes originate. Therefore, if this process is stopped, all other processes in the particular process ID namespace are stopped as well.

By identifying 

, it is possible to enumerate all 
struct task
 objects of the processes running in that namespace as those are stored in a traversable radix tree in the field 
. The exploit identifies the correct 
 object by its PID. Those 
 objects store a pointer to a 
struct cred
 object that contains the UID and GID (user and group identifier) associated with the process and therefore holds the granted permissions. By overwriting the 
 object of a process, it is possible to escalate privileges by setting the UID and GID to 0, which is associated with the 

However, this approach does not work if a container was compromised and the attacker’s goal is to escape into the container’s host environment.

Why This Doesn’t Work in Containers

Linux kernel exploits are an alternative method to escape container environments to the host in case no mistakes in the container configuration were made. They can be used because containers share the host’s kernel and therefore its vulnerabilities, regardless of the Linux distribution the container is based on. However, exploit developers have to pay attention to some obstacles compared to privilege escalation outside of container environments.

First, container solutions are able to restrict the capabilities of processes running inside a container. For example, the capability 

 is normally not granted to processes running in containers, which can therefore not mount file systems or execute various other privileged actions. Moreover, it is possible to restrict the set of syscalls a userland process can call by utilizing 
. For example, in the default configuration of Docker, an exploit would not be able to use eBPF at all. Nevertheless, in the default Kubernetes configuration, 
 does not restrict the available syscalls at all. For the remainder of this post, though, we will assume that the container is configured such that eBPF could be used by userland processes.

Second, on a more practical note, the techniques of the original exploit described above will not work out-of-the-box. As already described, containers rely heavily on namespaces. Because containers typically have their own associated process ID namespace, it is not as straightforward to identify the exploit process running in the container by its PID, because, for example, the exploit may have PID 42 from the container’s perspective but PID 1337 from the host’s perspective. However, the parent namespace can still observe all processes running in child namespaces. Therefore, those processes have a PID in both parent and child namespace. Ultimately, the initial process ID namespace described by 

 can observe any process running on a particular system. Nevertheless, even if we identify the 
 structure of our exploit process within a container, overwriting its 
 object as described previously will simply elevate privileges within the container but not allow container escape.

Changes for Container Escapes

It is possible to modify the exploit so that a container escape is conducted and privileges are escalated to 

 on the host. To easily find the exploit process in a container, an exploit can search for the symbol 
 symbols in 
 stores the offset to the running process’s 
 object based on the address stored in 
. Because 
 is unique per CPU core, the process must be pinned before on one core using the 
 system call

Using this technique it is possible to identify the correct 

 object without traversing the radix tree of all processes stored in 

This allows the attacker to overwrite the correct 

 object and therefore obtain 
privileges. Due to the usage of namespaces, the observable file system is still that of the container, though. Nevertheless, it is possible to overcome that obstacle as well. The 
object contains a pointer to a 
struct fs_struct
 object. This object contains information about the observable file system, i.e.,which directory is considered as the processes’ file system root. Using the leaked pointer to 
, it is possible to traverse the process radix tree and identify the host’s 
 process, which has PID 1. Next, it is possible to retrieve the 
pointer from this process’s 
 object. Lastly, while overwriting the 
 object of the exploit process, the 
 pointer must be overwritten as well using the 
 pointer. The exploit process can then observe the complete host file system.

One last addition must be made. As stated above, containers normally have limited capabilities. Capabilities are used to restrict the permissions of processes running in containers. To obtain full privileges, the exploit also has to overwrite the capabilities mask of the exploit’s process in the 

 object. How exactly the values must be set to obtain full capabilities without any restrictions can be investigated in the definition of the 
 process’ credentials

The technique described in this blog to identify the 

 object of the exploit’s process only works on Linux kernel version before 5.15, as 
 is no longer exported as a symbol to 
. Nevertheless, alternative methods exist to find the correct 
 object, e.g., by traversing the radix tree of all processes from 
 and matching on features of the exploit process other than the PID, such as the 
 member of 
struct task
 that contains the executable name.

Container Escape Mitigations

Detecting this and similar exploits is very hard as they are data-only and misuse only legitimate system calls. The CrowdStrike Falcon platform can assist in preventing attacks using similar techniques for privilege escalation. As a defense-in-depth strategy, the following steps can be taken to harden Linux hosts and container environments to prevent exploitation of CVE-2021-3490 and future attacks.

  1. Upgrade the kernel version. With a critical kernel vulnerability like CVE-2021-3490, it is paramount that available fixes are applied by upgrading the kernel version.
  2. Provide only required capabilities to the container. By limiting the capabilities of the container, the root account of the container becomes limited in its capabilities, which significantly reduces the chances of container escape and exploitation of kernel vulnerabilities. For example, to exploit the CVE-2021-3490 using the described technique, the attacker needs CAP_BPF or CAP_SYS_ADMIN granted. Note that privileged containers have those capabilities. Therefore, you should monitor your environment for such containers with CrowdStrike Falcon® Cloud Workload Protection (CWP), as discussed in point 4 below.
  3. Use a seccomp profile. While Kubernetes does not apply a seccomp profile without configuration, Docker’s default seccomp profiles protect against a number of dangerous system calls that can help attackers to break out of the container environment. Correct Seccomp profiles can help significantly reduce the container attack surface. CVE-2021-3490 requires the 
     system call to exploit the vulnerability, which is blocked in Docker’s default seccomp profile. Hence, exploitation of CVE-2021-3490 in a container environment using a strong seccomp profile would fail.
  4. Monitor host and containerized environment for a breach. In case a privileged workload or a host is compromised by attackers, the organization needs state-of-the-art monitoring and detection capabilities to prevent and detect advanced persistent threats (APTs), eCrime and nation-state actors. CrowdStrike can help with this. Falcon Cloud Workload Protection identifies any indicators of misconfiguration (IOMs) in your containerized environment to uncover a weakness. Falcon Cloud Workload Protection prevents and detects malicious activity on your host and containers to prevent and detect — in real time — breaches by eCrime and nation-state adversaries. For example, if a privileged container or a container without a seccomp profile is executed, the following notifications would appear:

Also, Falcon CWP helps to hunt for threats using the eBPF subsystem to escalate privileges by logging if the 

 system call was used by a process.


Container technology is a good solution to separate and fine-tune resources to different processes. However, while existing solutions add another layer of security due to the restriction of capabilities and available syscalls, the available attack surface inside a container still contains the host’s kernel. Every eased restriction — for example, allowing the use of eBPF — will increase the attack surface. If a threat actor is able to take advantage of a vulnerability inside the host’s kernel and an exploit is available, the host can be compromised, regardless of other security layers and restrictions such as namespaces.

This blog showed exactly that: Not much effort is needed to turn a full exploit chain for a local privilege escalation into one that is able to escape containers as well. The basic rules of network hygiene (patch early and often) not only apply to containers but to the hosts that deploy those in a cloud environment as well. Moreover, solutions such as Docker and Kubernetes can reduce the attack surface drastically if configured properly. CrowdStrike Falcon Cloud Workload Protectioncan assist in identifying and hunting for weaknesses in the deployed configuration that could lead to a compromise.

Additional Resources