CVE-2021-1647: Windows Defender mpengine remote code execution

Microsoft Defender Remote Code Execution Vulnerability

Original text by Maddie Stone

The Basics

Disclosure or Patch Date: 12 January 2021

Product: Microsoft Windows Defender

Advisory: https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-1647

Affected Versions: Version 1.1.17600.5 and previous

First Patched Version: Version 1.1.17700.4

Issue/Bug Report: N/A

Patch CL: N/A

Bug-Introducing CL: N/A

Reporter(s): Anonymous

The Code

Proof-of-concept:

Exploit sample: 6e1e9fa0334d8f1f5d0e3a160ba65441f0656d1f1c99f8a9f1ae4b1b1bf7d788

Did you have access to the exploit sample when doing the analysis? Yes

The Vulnerability

Bug class: Heap buffer overflow

Vulnerability details:

There is a heap buffer overflow when Windows Defender (

mpengine.dll
) processes the section table when unpacking an ASProtect packed executable. Each section entry has two values: the virtual address and the size of the section. The code in 
CAsprotectDLLAndVersion::RetrieveVersionInfoAndCreateObjects
 only checks if the next section entry’s address is lower than the previous one, not if they are equal. This means that if you have a section table such as the one used in this exploit sample: 
[ (0,0), (0,0), (0x2000,0), (0x2000,0x3000) ]
, 0 bytes are allocated for the section at address 0x2000, but when it sees the next entry at 0x2000, it simply skips over it without exiting nor updating the size of the section. 0x3000 bytes will then be copied to that section during the decompression, leading to the heap buffer overflow.


<strong>if</strong> ( next_sect_addr <strong>&gt;</strong> sect_addr )<em>// current va is greater than prev (not also eq)</em>
{
    sect_addr <strong>=</strong> next_sect_addr;
    sect_sz <strong>=</strong> (next_sect_sz <strong>+</strong> 0xFFF) <strong>&amp;</strong> 0xFFFFF000;
}
<em>// if next_sect_addr &lt;= sect_addr we continue on to next entry in the table </em>

&#91;...]
            new_sect_alloc <strong>=</strong> operator new&#91;](sect_sz <strong>+</strong> sect_addr);<em>// allocate new section</em>
&#91;...]

Patch analysis: There are quite a few changes to the function 

CAsprotectDLLAndVersion::RetrieveVersionInfoAndCreateObjects
 between version 1.1.17600.5 (vulnerable) and 1.1.17700.4 (patched). The directly related change was to add an 
else
 branch to the comparison so that if any entry in the section array has an address less than or equal to the previous entry, the code will error out and exit rather than continuing to decompress.

Thoughts on how this vuln might have been found (fuzzing, code auditing, variant analysis, etc.):

It seems possible that this vulnerability was found through fuzzing or manual code review. If the ASProtect unpacking code was included from an external library, that would have made the process of finding this vulnerability even more straightforward for both fuzzing & review.

(Historical/present/future) context of bug:

The Exploit

(The terms exploit primitiveexploit strategyexploit technique, and exploit flow are defined here.)

Exploit strategy (or strategies):

  1. The heap buffer overflow is used to overwrite the data in an object stored as the first field in the 
    lfind_switch
     object which is allocated in the 
    lfind_switch::switch_out
     function.
  2. The two fields that were overwritten in the object pointed to by the 
    lfind_switch
     object are used as indices in 
    lfind_switch::switch_in
    . Due to no bounds checking on these indices, another out-of-bounds write can occur.
  3. The out of bounds write in step 2 performs an 
    or
     operation on the field in the 
    VMM_context_t
     struct (the virtual memory manager within Windows Defender) that stores the length of a table that tracks the virtual mapped pages. This field usually equals the number of pages mapped * 2. By performing the ‘or’ operations, the value in the that field is increased (for example from 0x0000000C to 0x0003030c. When it’s increased, it allows for an additional out-of-bounds read & write, used for modifying the memory management struct to allow for arbitrary r/w.

Exploit flow:

The exploit uses «primitive bootstrapping» to to use the original buffer overflow to cause two additional out-of-bounds writes to ultimately gain arbitrary read/write.

Known cases of the same exploit flow: Unknown.

Part of an exploit chain? Unknown.

The Next Steps

Variant analysis

Areas/approach for variant analysis (and why):

  • Review ASProtect unpacker for additional parsing bugs.
  • Review and/or fuzz other unpacking code for parsing and memory issues.

Found variants: N/A

Structural improvements

What are structural improvements such as ways to kill the bug class, prevent the introduction of this vulnerability, mitigate the exploit flow, make this type of vulnerability harder to exploit, etc.?

Ideas to kill the bug class:

  • Building 
    mpengine.dll
     with ASAN enabled should allow for this bug class to be caught.
  • Open sourcing unpackers could allow more folks to find issues in this code, which could potentially detect issues like this more readily.

Ideas to mitigate the exploit flow:

  • Adding bounds checking to anywhere indices are used. For example, if there had been bounds checking when using indices in 
    lfind_switch::switch_in
    , it would have prevented the 2nd out-of-bounds write which allowed this exploit to modify the 
    VMM_context_t
     structure.

Other potential improvements:

It appears that by default the Windows Defender emulator runs outside of a sandbox. In 2018, there was this article that Windows Defender Antivirus can now run in a sandbox. The article states that when sandboxing is enabled, you will see a content process 

MsMpEngCp.exe
 running in addition to 
MsMpEng.exe
. By default, on Windows 10 machines, I only see 
MsMpEng.exe
 running as 
SYSTEM
. Sandboxing the anti-malware emulator by default, would make this vulnerability more difficult to exploit because a sandbox escape would then be required in addition to this vulnerability.

0-day detection methods

What are potential detection methods for similar 0-days? Meaning are there any ideas of how this exploit or similar exploits could be detected as a 0-day?

  • Detecting these types of 0-days will be difficult due to the sample simply dropping a new file with the characteristics to trigger the vulnerability, such as a section table that includes the same virtual address twice. The exploit method also did not require anything that especially stands out.

Other References

Compiling C without webassembly

Compiling C without webassembly

Original text by Surma

A compiler is just a part of Emscripten. What if we stripped away all the bells and whistles and used just the compiler?

Emscripten is a compiler toolchain for C/C++ targeting WebAssembly. But it does so much more than just compiling. Emscripten’s goal is to be a drop-in replacement for your off-the-shelf C/C++ compiler and make code that was not written for the web run on the web. To achieve this, Emscripten emulates an entire POSIX operating system for you. If your program uses 

, Emscripten will bundle the code to emulate a filesystem. If you use OpenGL, Emscripten will bundle code that creates a C-compatible GL context backed by WebGL. That requires a lot of work and also amounts in a lot of code that you need send over the wire. What if we just,… didn’t?

The compiler in Emscripten’s toolchain, the program that translates C code to WebAssembly byte-code, is LLVM. LLVM is a modern, modular compiler framework. LLVM is modular in the sense that it never compiles one language straight to machine code. Instead, it has a front-end compiler that compiles your code to an intermediate representation (IR). This IR is called LLVM, as the IR is modeled around a Low-level Virtual Machine, hence the name of the project. The back-end compiler then takes care of translating the IR to the host’s machine code. The advantage of this strict separation is that adding support for a new architecture “merely” requires adding a new back-end compiler. WebAssembly, in that sense, is just one of many targets that LLVM supports and has been available behind a flag for a while. Since version 8 of LLVM the WebAssembly target is available by default. If you are on MacOS, you can install LLVM using homebrew:


$ brew install llvm
$ brew link --force llvm

To make sure you have WebAssembly support, we can go and check the back-end compiler:


$ llc --version
LLVM (http://llvm.org/):
  LLVM version 8.0.0
  Optimized build.
  Default target: x86_64-apple-darwin18.5.0
  Host CPU: skylake

  Registered Targets:
    # … OMG so many architectures …
    systemz    - SystemZ
    thumb      - Thumb
    thumbeb    - Thumb (big endian)
    wasm32     - WebAssembly 32-bit # 🎉🎉🎉
    wasm64     - WebAssembly 64-bit
    x86        - 32-bit X86: Pentium-Pro and above
    x86-64     - 64-bit X86: EM64T and AMD64
    xcore      - XCore

Seems like we are good to go!

Compiling C the hard way

Note: We’ll be looking at some low-level file formats like raw WebAssembly here. If you are struggling with that, that is ok. You don’t need to understand this entire blog post to make good use of WebAssembly. If you are here for the copy-pastables, look at the compiler invocation in the “Optimizing” section. But if you are interested, keep going! I also wrote an introduction to Raw Webassembly and WAT previously which covers the basics needed to understand this post.

Warning: I’ll bend over backwards here for a bit and use human-readable formats for every step of the process (as much as possible). Our program for this journey is going to be super simple to avoid edge cases and distractions:


// Filename: add.c
int add(int a, int b) {
  return a*a + b;
}

What a mind-boggling feat of engineering! Especially because it’s called “add” but doesn’t actually add. More importantly: This program makes no use of C’s standard library and only uses `int` as a type.

Turning C into LLVM IR

The first step is to turn our C program into LLVM IR. This is the job of the front-end compiler 

clang
 that got installed with LLVM:


clang \
  --target=wasm32 \ # Target WebAssembly
  -emit-llvm \ # Emit LLVM IR (instead of host machine code)
  -c \ # Only compile, no linking just yet
  -S \ # Emit human-readable assembly rather than binary
  add.c

And as a result we get 

add.ll
 containing the LLVM IR. I’m only showing this here for completeness sake. When working with WebAssembly, or even with 
clang
 when developing C, you never get into contact with LLVM IR.


; ModuleID = 'add.c'
source_filename = "add.c"
target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"
target triple = "wasm32"

; Function Attrs: norecurse nounwind readnone
define hidden i32 @add(i32, i32) local_unnamed_addr #0 {
  %3 = mul nsw i32 %0, %0
  %4 = add nsw i32 %3, %1
  ret i32 %4
}

attributes #0 = { norecurse nounwind readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 8.0.0 (tags/RELEASE_800/final)"}

LLVM IR is full of additional meta data and annotations, allowing the back-end compiler to make more informed decisions when generating machine code.

Turning LLVM IR into object files

The next step is invoking LLVMs backend compiler 

llc
 to turn the LLVM IR into an object file:


llc \
  -march=wasm32 \ # Target WebAssembly
  -filetype=obj \ # Output an object file
  add.ll

The output, 

add.o
, is effectively a valid WebAssembly module and contains all the compiled code of our C file. However, most of the time you won’t be able to run object files as essential parts are still missing.

If we omitted 

-filetype=obj
 we’d get LLVM’s assembly format for WebAssembly, which is human-readable and somewhat similar to WAT. However, the tool that can consume these files, 
llvm-mc
, does not fully support this text format yet and often fails to consume the output of 
llc
. So instead we’ll disassemble the object files after the fact. Object files are target-specific and therefore need target-specific tool to inspect them. In the case of WebAssembly, the tool is 
wasm-objdump
, which is part of the WebAssembly Binary Toolkit, or wabt for short.


$ brew install wabt # in case you haven’t
$ wasm-objdump -x add.o

add.o:  file format wasm 0x1

Section Details:

Type&#91;1]:
 - type&#91;0] (i32, i32) -&gt; i32
Import&#91;3]:
 - memory&#91;0] pages: initial=0 &lt;- env.__linear_memory
 - table&#91;0] elem_type=funcref init=0 max=0 &lt;- env.__indirect_function_table
 - global&#91;0] i32 mutable=1 &lt;- env.__stack_pointer
Function&#91;1]:
 - func&#91;0] sig=0 &lt;add&gt;
Code&#91;1]:
 - func&#91;0] size=75 &lt;add&gt;
Custom:
 - name: "linking"
  - symbol table &#91;count=2]
   - 0: F &lt;add&gt; func=0 binding=global vis=hidden
   - 1: G &lt;env.__stack_pointer&gt; global=0 undefined binding=global vis=default
Custom:
 - name: "reloc.CODE"
  - relocations for section: 3 (Code) &#91;1]
   - R_WASM_GLOBAL_INDEX_LEB offset=0x000006(file=0x000080) symbol=1 &lt;env.__stack_pointer&gt;

The output shows that our 

add()
 function is in this module, but it also contains custom sections filled with metadata and, surprisingly, a couple of imports. In the next phase, called linking, the custom sections will be analyzed and removed and the imports will be resolved by the linker.

Linking

Traditionally, the linker’s job is to assembles multiple object file into the executable. LLVM’s linker is called 

lld
, but it has to be invoked with one of the target-specific symlinks. For WebAssembly there is 
wasm-ld
.


wasm-ld \
  --no-entry \ # We don’t have an entry function
  --export-all \ # Export everything (for now)
  -o add.wasm \
  add.o

The output is a 262 bytes WebAssembly module.

Running it

Of course the most important part is to see that this actually works. As we did in the previous blog post, we can use a couple lines of inline JavaScript to load and run this WebAssembly module.


&lt;!DOCTYPE html&gt;

&lt;script type="module"&gt;
  async function init() {
    const { instance } = await WebAssembly.instantiateStreaming(
      fetch("./add.wasm")
    );
    console.log(instance.exports.add(4, 1));
  }
  init();
&lt;/script&gt;

If nothing went wrong, you shoud see a 

17
 in your DevTool’s console. We just successfully compiled C to WebAssembly without touching Emscripten. It’s also worth noting that we don’t have any glue code that is required to setup and load the WebAssembly module.

Compiling C the slightly less hard way

The numbers of steps we currently have to do to get from C code to WebAssembly is a bit daunting. As I said, I was bending over backwards for educational purposes. Let’s stop doing that and skip all the human-readable, intermediate formats and use the C compiler as the swiss-army knife it was designed to be:


clang \
  --target=wasm32 \
  -nostdlib \ # Don’t try and link against a standard library
  -Wl,--no-entry \ # Flags passed to the linker
  -Wl,--export-all \
  -o add.wasm \
  add.c

This will produce the same 

.wasm
 file as before, but with a single command.

Optimizing

Let’s take a look at the WAT of our WebAssembly module by running 

wasm2wat
:


(module
  (type (;0;) (func))
  (type (;1;) (func (param i32 i32) (result i32)))
  (func $__wasm_call_ctors (type 0))
  (func $add (type 1) (param i32 i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32)
    global.get 0
    local.set 2
    i32.const 16
    local.set 3
    local.get 2
    local.get 3
    i32.sub
    local.set 4
    local.get 4
    local.get 0
    i32.store offset=12
    local.get 4
    local.get 1
    i32.store offset=8
    local.get 4
    i32.load offset=12
    local.set 5
    local.get 4
    i32.load offset=12
    local.set 6
    local.get 5
    local.get 6
    i32.mul
    local.set 7
    local.get 4
    i32.load offset=8
    local.set 8
    local.get 7
    local.get 8
    i32.add
    local.set 9
    local.get 9
    return)
  (table (;0;) 1 1 anyfunc)
  (memory (;0;) 2)
  (global (;0;) (mut i32) (i32.const 66560))
  (global (;1;) i32 (i32.const 66560))
  (global (;2;) i32 (i32.const 1024))
  (global (;3;) i32 (i32.const 1024))
  (export "memory" (memory 0))
  (export "__wasm_call_ctors" (func $__wasm_call_ctors))
  (export "__heap_base" (global 1))
  (export "__data_end" (global 2))
  (export "__dso_handle" (global 3))
  (export "add" (func $add)))

Wowza that’s a lot of WAT. To my suprise, the module uses memory (indicated by the 

i32.load
 and 
i32.store
 operations), 8 local variables and a couple of globals. If you think you’d be able to write a shorter version by hand, you’d probably be right. The reason this program is so big is because we didn’t have any optimizations enabled. Let’s change that:


 clang \
   --target=wasm32 \
+  -O3 \ # Agressive optimizations
+  -flto \ # Add metadata for link-time optimizations
   -nostdlib \
   -Wl,--no-entry \
   -Wl,--export-all \
+  -Wl,--lto-O3 \ # Aggressive link-time optimizations
   -o add.wasm \
   add.c

Note: Technically, link-time optimizations don’t bring us any gains here as we are only linking a single file. In bigger projects, LTO will help you keep your file size down.

After running the commands above, our 

.wasm
 file went down from 262 bytes to 197 bytes and the WAT is much easier on the eye, too:


(module
  (type (;0;) (func))
  (type (;1;) (func (param i32 i32) (result i32)))
  (func $__wasm_call_ctors (type 0))
  (func $add (type 1) (param i32 i32) (result i32)
    local.get 0
    local.get 0
    i32.mul
    local.get 1
    i32.add)
  (table (;0;) 1 1 anyfunc)
  (memory (;0;) 2)
  (global (;0;) (mut i32) (i32.const 66560))
  (global (;1;) i32 (i32.const 66560))
  (global (;2;) i32 (i32.const 1024))
  (global (;3;) i32 (i32.const 1024))
  (export "memory" (memory 0))
  (export "__wasm_call_ctors" (func $__wasm_call_ctors))
  (export "__heap_base" (global 1))
  (export "__data_end" (global 2))
  (export "__dso_handle" (global 3))
  (export "add" (func $add)))

Calling into the standard library.

Now, C without the standard library (called “libc”) is pretty rough. It seems logical to look into adding a libc as a next step, but I’m going to be honest: It’s not going to be easy. I actually won’t link against any libc in this blog post. There are a couple of libc implementations out there that we could grab, most notably glibcmusl and dietlibc. However, most of these libraries expect to run on a POSIX operating system, which implements a specific set of syscalls (calls to the system’s kernel). Since we don’t have a kernel interface in JavaScript, we’d have to implement these POSIX syscalls ourselves, probably by calling out to JavaScript. This is quite the task and I am not going to do that here. The good news is: This is exactly what Emscripten does for you.

Not all of libc’s functions rely on syscalls, of course. Functions like 

strlen()
sin()
 or even 
memset()
 are implemented in plain C. That means you could use these functions or even just copy/paste their implementation from one of the libraries above.

Dynamic memory

With no libc at hand, fundamental C APIs like 

malloc()
 and 
free()
 are not available. In our unoptimized WAT above we have seen that the compiler will make use of memory if necessary. That means we can’t just use the memory however we like without risking corruption. We need to understand how that memory is used.

LLVM’s memory model

The way the WebAssembly memory is segmented is dictated by 

wasm-ld
 and might take C veterans a bit by surprise. Firstly, address 
0
 is technically valid in WebAssembly, but will often still be handled as an error case by a lot of C code. Secondly, the stack comes first and grows downwards (towards lower addresses) and the heap cames after and grows upwards. The reason for this is that WebAssembly memory can grow at runtime. That means there is no fixed end to place the stack or the heap at.

The layout that 

wasm-ld
 uses is the following:

A depiction of the wasm-ld’d memory layout.
The stack grows downwards and the heap grows upwards. The stack starts at 
__data_end
, the heap starts at 
__heap_base
. Because the stack is placed first, it is limited to a maximum size set at compile time, which is 
__heap_base - __data_end
.

If we look back at the globals section in our WAT we can find these symbols defined. 

__heap_base
 is 66560 and 
__data_end
 is 1024. This means that the stack can grow to a maximum of 64KiB, which is not a lot. Luckily, 
wasm-ld
 allows us to configure this value:


 clang \
   --target=wasm32 \
   -O3 \
   -flto \
   -nostdlib \
   -Wl,--no-entry \
   -Wl,--export-all \
   -Wl,--lto-O3 \
+  -Wl,-z,stack-size=$&#91;8 * 1024 * 1024] \ # Set maximum stack size to 8MiB
   -o add.wasm \
   add.c

Building an allocator

We now know that the heap region starts at 

__heap_base
 and since there is no 
malloc()
 function, we know that the memory region from there on upwards is ours to control. We can place data in there however we like and don’t have to fear corruption as the stack is growing the other way. Leaving the heap as a free-for-all can get hairy quickly, though, so usually some sort of dynamic memory management is needed. One option is to pull in a full 
malloc()
 implementation like Doug Lea’s 
malloc
 implementation
, which is used by Emscripten today. There is also a couple of smaller implementations with different tradeoffs.

But why don’t we write our own 

malloc()
? We are in this deep, we might as well. One of the simplest allocators is a bump allocator. The advantages: It’s super fast, extremely small and simple to implement. The downside: You can’t free memory. While this seems incredibly useless at first sight, I have encountered use-cases while working on Squoosh where this would have been an excellent choice. The concept of a bump allocator is that we store the start address of unused memory as a global. If the program requests n bytes of memory, we advance that marker by n and return the previous value:


extern unsigned char __heap_base;

unsigned int bump_pointer = &amp;__heap_base;
void* malloc(int n) {
  unsigned int r = bump_pointer;
  bump_pointer += n;
  return (void *)r;
}

void free(void* p) {
  // lol
}

The globals we saw in the WAT are actually defined by 

wasm-ld
 which means we can access them from our C code as normal variables if we declare them as 
extern
We just wrote our own 
malloc()
 in, like, 5 lines of C 😱

Note: Our bump allocator is not fully compatible with C’s 

malloc()
. For example, we don’t make any alignment guarantees. But it’s good enough and it works, so 🤷‍♂️.

Using dynamic memory

To prove that this actually works, let’s build a C function that takes an arbitrary-sized array of numbers and calculates the sum. Not very exciting, but it does force us to use dynamic memory, as we don’t know the size of the array at build time:


int sum(int a&#91;], int len) {
  int sum = 0;
  for(int i = 0; i &lt; len; i++) {
    sum += a&#91;i];
  }
  return sum;
}

The 

sum()
 function is hopefully straight forward. The more interesting question is how we can pass an array from JavaScript to WebAssembly — after all, WebAssembly only understands numbers. The general idea is to use 
malloc()
 from JavaScript to allocate a chunk of memory, copy the values into that chunk and pass the address (a number!) to where the array is located:


&lt;!DOCTYPE html&gt;

&lt;script type="module"&gt;
  async function init() {
    const { instance } = await WebAssembly.instantiateStreaming(
      fetch("./add.wasm")
    );

    const jsArray = &#91;1, 2, 3, 4, 5];
    // Allocate memory for 5 32-bit integers
    // and return get starting address.
    const cArrayPointer = instance.exports.malloc(jsArray.length * 4);
    // Turn that sequence of 32-bit integers
    // into a Uint32Array, starting at that address.
    const cArray = new Uint32Array(
      instance.exports.memory.buffer,
      cArrayPointer,
      jsArray.length
    );
    // Copy the values from JS to C.
    cArray.set(jsArray);
    // Run the function, passing the starting address and length.
    console.log(instance.exports.sum(cArrayPointer, cArray.length));
  }
  init();
&lt;/script&gt;

When running this you should see a very happy 

15
 in the DevTools console, which is indeed the sum of all the number from 1 to 5.

You made it to the end. Congratulations! Again, if you feel a bit overwhelmed, that’s okay: This is not required reading. You do not need to understand all of this to be a good web developer or even to make good use of WebAssembly. But I did want to share this journey with you as it really makes you appreciate all the work that a project like Emscripten does for you. At the same time, it gave me an understanding of how small purely computational WebAssembly modules can be. The Wasm module for the array summing ended up at just 230 bytes, including an allocator for dynamic memory. Compiling the same code with Emscripten would yield 100 bytes of WebAssembly accompanied by 11K of JavaScript glue code. It took a lot of work to get there, but there might be situations where it is worth it.