Did you have access to the exploit sample when doing the analysis? Yes
The Vulnerability
Bug class: Heap buffer overflow
Vulnerability details:
There is a heap buffer overflow when Windows Defender (
mpengine.dll
) processes the section table when unpacking an ASProtect packed executable. Each section entry has two values: the virtual address and the size of the section. The code in
only checks if the next section entry’s address is lower than the previous one, not if they are equal. This means that if you have a section table such as the one used in this exploit sample:
[ (0,0), (0,0), (0x2000,0), (0x2000,0x3000) ]
, 0 bytes are allocated for the section at address 0x2000, but when it sees the next entry at 0x2000, it simply skips over it without exiting nor updating the size of the section. 0x3000 bytes will then be copied to that section during the decompression, leading to the heap buffer overflow.
<strong>if</strong> ( next_sect_addr <strong>></strong> sect_addr )<em>// current va is greater than prev (not also eq)</em>
{
sect_addr <strong>=</strong> next_sect_addr;
sect_sz <strong>=</strong> (next_sect_sz <strong>+</strong> 0xFFF) <strong>&</strong> 0xFFFFF000;
}
<em>// if next_sect_addr <= sect_addr we continue on to next entry in the table </em>
between version 1.1.17600.5 (vulnerable) and 1.1.17700.4 (patched). The directly related change was to add an
else
branch to the comparison so that if any entry in the section array has an address less than or equal to the previous entry, the code will error out and exit rather than continuing to decompress.
Thoughts on how this vuln might have been found (fuzzing, code auditing, variant analysis, etc.):
It seems possible that this vulnerability was found through fuzzing or manual code review. If the ASProtect unpacking code was included from an external library, that would have made the process of finding this vulnerability even more straightforward for both fuzzing & review.
(Historical/present/future) context of bug:
The Exploit
(The terms exploit primitive, exploit strategy, exploit technique, and exploit flow are defined here.)
Exploit strategy (or strategies):
The heap buffer overflow is used to overwrite the data in an object stored as the first field in the
lfind_switch
object which is allocated in the
lfind_switch::switch_out
function.
The two fields that were overwritten in the object pointed to by the
lfind_switch
object are used as indices in
lfind_switch::switch_in
. Due to no bounds checking on these indices, another out-of-bounds write can occur.
The out of bounds write in step 2 performs an
or
operation on the field in the
VMM_context_t
struct (the virtual memory manager within Windows Defender) that stores the length of a table that tracks the virtual mapped pages. This field usually equals the number of pages mapped * 2. By performing the ‘or’ operations, the value in the that field is increased (for example from 0x0000000C to 0x0003030c. When it’s increased, it allows for an additional out-of-bounds read & write, used for modifying the memory management struct to allow for arbitrary r/w.
Exploit flow:
The exploit uses «primitive bootstrapping» to to use the original buffer overflow to cause two additional out-of-bounds writes to ultimately gain arbitrary read/write.
Known cases of the same exploit flow: Unknown.
Part of an exploit chain? Unknown.
The Next Steps
Variant analysis
Areas/approach for variant analysis (and why):
Review ASProtect unpacker for additional parsing bugs.
Review and/or fuzz other unpacking code for parsing and memory issues.
Found variants: N/A
Structural improvements
What are structural improvements such as ways to kill the bug class, prevent the introduction of this vulnerability, mitigate the exploit flow, make this type of vulnerability harder to exploit, etc.?
Ideas to kill the bug class:
Building
mpengine.dll
with ASAN enabled should allow for this bug class to be caught.
Open sourcing unpackers could allow more folks to find issues in this code, which could potentially detect issues like this more readily.
Ideas to mitigate the exploit flow:
Adding bounds checking to anywhere indices are used. For example, if there had been bounds checking when using indices in
lfind_switch::switch_in
, it would have prevented the 2nd out-of-bounds write which allowed this exploit to modify the
VMM_context_t
structure.
Other potential improvements:
It appears that by default the Windows Defender emulator runs outside of a sandbox. In 2018, there was this article that Windows Defender Antivirus can now run in a sandbox. The article states that when sandboxing is enabled, you will see a content process
MsMpEngCp.exe
running in addition to
MsMpEng.exe
. By default, on Windows 10 machines, I only see
MsMpEng.exe
running as
SYSTEM
. Sandboxing the anti-malware emulator by default, would make this vulnerability more difficult to exploit because a sandbox escape would then be required in addition to this vulnerability.
0-day detection methods
What are potential detection methods for similar 0-days? Meaning are there any ideas of how this exploit or similar exploits could be detected as a 0-day?
Detecting these types of 0-days will be difficult due to the sample simply dropping a new file with the characteristics to trigger the vulnerability, such as a section table that includes the same virtual address twice. The exploit method also did not require anything that especially stands out.
A compiler is just a part of Emscripten. What if we stripped away all the bells and whistles and used just the compiler?
Emscripten is a compiler toolchain for C/C++ targeting WebAssembly. But it does so much more than just compiling. Emscripten’s goal is to be a drop-in replacement for your off-the-shelf C/C++ compiler and make code that was not written for the web run on the web. To achieve this, Emscripten emulates an entire POSIX operating system for you. If your program uses
, Emscripten will bundle the code to emulate a filesystem. If you use OpenGL, Emscripten will bundle code that creates a C-compatible GL context backed by WebGL. That requires a lot of work and also amounts in a lot of code that you need send over the wire. What if we just,… didn’t?
The compiler in Emscripten’s toolchain, the program that translates C code to WebAssembly byte-code, is LLVM. LLVM is a modern, modular compiler framework. LLVM is modular in the sense that it never compiles one language straight to machine code. Instead, it has a front-end compiler that compiles your code to an intermediate representation (IR). This IR is called LLVM, as the IR is modeled around a Low-level Virtual Machine, hence the name of the project. The back-end compiler then takes care of translating the IR to the host’s machine code. The advantage of this strict separation is that adding support for a new architecture “merely” requires adding a new back-end compiler. WebAssembly, in that sense, is just one of many targets that LLVM supports and has been available behind a flag for a while. Since version 8 of LLVM the WebAssembly target is available by default. If you are on MacOS, you can install LLVM using homebrew:
$ brew install llvm
$ brew link --force llvm
To make sure you have WebAssembly support, we can go and check the back-end compiler:
Registered Targets:
# … OMG so many architectures …
systemz - SystemZ
thumb - Thumb
thumbeb - Thumb (big endian)
wasm32 - WebAssembly 32-bit # 🎉🎉🎉
wasm64 - WebAssembly 64-bit
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
xcore - XCore
Seems like we are good to go!
Compiling C the hard way
Note: We’ll be looking at some low-level file formats like raw WebAssembly here. If you are struggling with that, that is ok. You don’t need to understand this entire blog post to make good use of WebAssembly. If you are here for the copy-pastables, look at the compiler invocation in the “Optimizing” section. But if you are interested, keep going! I also wrote an introduction to Raw Webassembly and WAT previously which covers the basics needed to understand this post.
Warning: I’ll bend over backwards here for a bit and use human-readable formats for every step of the process (as much as possible). Our program for this journey is going to be super simple to avoid edge cases and distractions:
// Filename: add.c
int add(int a, int b) {
return a*a + b;
}
What a mind-boggling feat of engineering! Especially because it’s called “add” but doesn’t actually add. More importantly: This program makes no use of C’s standard library and only uses `int` as a type.
Turning C into LLVM IR
The first step is to turn our C program into LLVM IR. This is the job of the front-end compiler
clang
that got installed with LLVM:
clang \
--target=wasm32 \ # Target WebAssembly
-emit-llvm \ # Emit LLVM IR (instead of host machine code)
-c \ # Only compile, no linking just yet
-S \ # Emit human-readable assembly rather than binary
add.c
And as a result we get
add.ll
containing the LLVM IR. I’m only showing this here for completeness sake. When working with WebAssembly, or even with
clang
when developing C, you never get into contact with LLVM IR.
, is effectively a valid WebAssembly module and contains all the compiled code of our C file. However, most of the time you won’t be able to run object files as essential parts are still missing.
If we omitted
-filetype=obj
we’d get LLVM’s assembly format for WebAssembly, which is human-readable and somewhat similar to WAT. However, the tool that can consume these files,
llvm-mc
, does not fully support this text format yet and often fails to consume the output of
llc
. So instead we’ll disassemble the object files after the fact. Object files are target-specific and therefore need target-specific tool to inspect them. In the case of WebAssembly, the tool is
function is in this module, but it also contains custom sections filled with metadata and, surprisingly, a couple of imports. In the next phase, called linking, the custom sections will be analyzed and removed and the imports will be resolved by the linker.
Linking
Traditionally, the linker’s job is to assembles multiple object file into the executable. LLVM’s linker is called
lld
, but it has to be invoked with one of the target-specific symlinks. For WebAssembly there is
wasm-ld
.
wasm-ld \
--no-entry \ # We don’t have an entry function
--export-all \ # Export everything (for now)
-o add.wasm \
add.o
The output is a 262 bytes WebAssembly module.
Running it
Of course the most important part is to see that this actually works. As we did in the previous blog post, we can use a couple lines of inline JavaScript to load and run this WebAssembly module.
in your DevTool’s console. We just successfully compiled C to WebAssembly without touching Emscripten. It’s also worth noting that we don’t have any glue code that is required to setup and load the WebAssembly module.
Compiling C the slightly less hard way
The numbers of steps we currently have to do to get from C code to WebAssembly is a bit daunting. As I said, I was bending over backwards for educational purposes. Let’s stop doing that and skip all the human-readable, intermediate formats and use the C compiler as the swiss-army knife it was designed to be:
clang \
--target=wasm32 \
-nostdlib \ # Don’t try and link against a standard library
-Wl,--no-entry \ # Flags passed to the linker
-Wl,--export-all \
-o add.wasm \
add.c
This will produce the same
.wasm
file as before, but with a single command.
Optimizing
Let’s take a look at the WAT of our WebAssembly module by running
Wowza that’s a lot of WAT. To my suprise, the module uses memory (indicated by the
i32.load
and
i32.store
operations), 8 local variables and a couple of globals. If you think you’d be able to write a shorter version by hand, you’d probably be right. The reason this program is so big is because we didn’t have any optimizations enabled. Let’s change that:
Note: Technically, link-time optimizations don’t bring us any gains here as we are only linking a single file. In bigger projects, LTO will help you keep your file size down.
After running the commands above, our
.wasm
file went down from 262 bytes to 197 bytes and the WAT is much easier on the eye, too:
Now, C without the standard library (called “libc”) is pretty rough. It seems logical to look into adding a libc as a next step, but I’m going to be honest: It’s not going to be easy. I actually won’t link against any libc in this blog post. There are a couple of libc implementations out there that we could grab, most notably glibc, musl and dietlibc. However, most of these libraries expect to run on a POSIX operating system, which implements a specific set of syscalls (calls to the system’s kernel). Since we don’t have a kernel interface in JavaScript, we’d have to implement these POSIX syscalls ourselves, probably by calling out to JavaScript. This is quite the task and I am not going to do that here. The good news is: This is exactly what Emscripten does for you.
Not all of libc’s functions rely on syscalls, of course. Functions like
strlen()
,
sin()
or even
memset()
are implemented in plain C. That means you could use these functions or even just copy/paste their implementation from one of the libraries above.
Dynamic memory
With no libc at hand, fundamental C APIs like
malloc()
and
free()
are not available. In our unoptimized WAT above we have seen that the compiler will make use of memory if necessary. That means we can’t just use the memory however we like without risking corruption. We need to understand how that memory is used.
LLVM’s memory model
The way the WebAssembly memory is segmented is dictated by
wasm-ld
and might take C veterans a bit by surprise. Firstly, address
0
is technically valid in WebAssembly, but will often still be handled as an error case by a lot of C code. Secondly, the stack comes first and grows downwards (towards lower addresses) and the heap cames after and grows upwards. The reason for this is that WebAssembly memory can grow at runtime. That means there is no fixed end to place the stack or the heap at.
The layout that
wasm-ld
uses is the following:
The stack grows downwards and the heap grows upwards. The stack starts at
__data_end
, the heap starts at
__heap_base
. Because the stack is placed first, it is limited to a maximum size set at compile time, which is
__heap_base - __data_end
.
If we look back at the globals section in our WAT we can find these symbols defined.
__heap_base
is 66560 and
__data_end
is 1024. This means that the stack can grow to a maximum of 64KiB, which is not a lot. Luckily,
function, we know that the memory region from there on upwards is ours to control. We can place data in there however we like and don’t have to fear corruption as the stack is growing the other way. Leaving the heap as a free-for-all can get hairy quickly, though, so usually some sort of dynamic memory management is needed. One option is to pull in a full
implementation, which is used by Emscripten today. There is also a couple of smaller implementations with different tradeoffs.
But why don’t we write our own
malloc()
? We are in this deep, we might as well. One of the simplest allocators is a bump allocator. The advantages: It’s super fast, extremely small and simple to implement. The downside: You can’t free memory. While this seems incredibly useless at first sight, I have encountered use-cases while working on Squoosh where this would have been an excellent choice. The concept of a bump allocator is that we store the start address of unused memory as a global. If the program requests n bytes of memory, we advance that marker by n and return the previous value:
extern unsigned char __heap_base;
unsigned int bump_pointer = &__heap_base;
void* malloc(int n) {
unsigned int r = bump_pointer;
bump_pointer += n;
return (void *)r;
}
void free(void* p) {
// lol
}
The globals we saw in the WAT are actually defined by
wasm-ld
which means we can access them from our C code as normal variables if we declare them as
extern
. We just wrote our own
malloc()
in, like, 5 lines of C 😱
Note: Our bump allocator is not fully compatible with C’s
malloc()
. For example, we don’t make any alignment guarantees. But it’s good enough and it works, so 🤷♂️.
Using dynamic memory
To prove that this actually works, let’s build a C function that takes an arbitrary-sized array of numbers and calculates the sum. Not very exciting, but it does force us to use dynamic memory, as we don’t know the size of the array at build time:
int sum(int a[], int len) {
int sum = 0;
for(int i = 0; i < len; i++) {
sum += a[i];
}
return sum;
}
The
sum()
function is hopefully straight forward. The more interesting question is how we can pass an array from JavaScript to WebAssembly — after all, WebAssembly only understands numbers. The general idea is to use
malloc()
from JavaScript to allocate a chunk of memory, copy the values into that chunk and pass the address (a number!) to where the array is located:
const jsArray = [1, 2, 3, 4, 5];
// Allocate memory for 5 32-bit integers
// and return get starting address.
const cArrayPointer = instance.exports.malloc(jsArray.length * 4);
// Turn that sequence of 32-bit integers
// into a Uint32Array, starting at that address.
const cArray = new Uint32Array(
instance.exports.memory.buffer,
cArrayPointer,
jsArray.length
);
// Copy the values from JS to C.
cArray.set(jsArray);
// Run the function, passing the starting address and length.
console.log(instance.exports.sum(cArrayPointer, cArray.length));
}
init();
</script>
When running this you should see a very happy
15
in the DevTools console, which is indeed the sum of all the number from 1 to 5.
You made it to the end. Congratulations! Again, if you feel a bit overwhelmed, that’s okay: This is not required reading. You do not need to understand all of this to be a good web developer or even to make good use of WebAssembly. But I did want to share this journey with you as it really makes you appreciate all the work that a project like Emscripten does for you. At the same time, it gave me an understanding of how small purely computational WebAssembly modules can be. The Wasm module for the array summing ended up at just 230 bytes, including an allocator for dynamic memory. Compiling the same code with Emscripten would yield 100 bytes of WebAssembly accompanied by 11K of JavaScript glue code. It took a lot of work to get there, but there might be situations where it is worth it.