BlobRunner is a simple tool to quickly debug shellcode extracted during malware analysis.
BlobRunner allocates memory for the target file and jumps to the base (or offset) of the allocated memory. This allows an analyst to quickly debug into extracted artifacts with minimal overhead and effort.
To use BlobRunner, you can download the compiled executable from the releases page or build your own using the steps below.
Building the executable is straight forward and relatively painless. Requirements
Download and install Microsoft Visual C++ Build Tools or Visual Studio
Open Visual Studio Command Prompt
Navigate to the directory where BlobRunner is checked out
Build the executable by running:
Building BlobRunner x64
Building the x64 version is virtually the same as above, but simply uses the x64 tooling.
Open x64 Visual Studio Command Prompt
Navigate to the directory where BlobRunner is checked out
Add a breakpoint before the jump into the shellcode
Step into the shellcode
Debug into file at a specific offset.
BlobRunner.exe shellcode.bin --offset 0x0100
Debug into file and don’t pause before the jump. Warning: Ensure you have a breakpoint set before the jump.
BlobRunner.exe shellcode.bin --nopause
Debugging x64 Shellcode
Inline assembly isn’t supported by the x64 compiler, so to support debugging into x64 shellcode the loader creates a suspended thread which allows you to place a breakpoint at the thread entry, before the thread is resumed.
Remote Debugging Shell Blobs (IDAPro)
The process is virtually identical to debugging shellcode locally — with the exception that the you need to copy the shellcode file to the remote system. If the file is copied to the same path you are running win32_remote.exe from, you just need to use the file name for the parameter. Otherwise, you will need to specify the path to the shellcode file on the remote system.
You can quickly generate shellcode samples using the Metasploit tool msfvenom.
Generating a simple Windows exec payload.
msfvenom -a x86 --platform windows -p windows/exec cmd=calc.exe -o test2.bin
Feedback / Help
Any questions, comments or requests you can find us on twitter: @seanmw or @herrcore
Hello Friends, this is the 2nd part of the Linux x86 Buffer Overflows. First of all I want to apologize for such a long delay after the First blog of this series, there were personal and professional issues going on in my life (Well who hasn’t got them? Meh?).
In the previous part we learned about the Process Memory, Stack Region, Stack Operations, Stack Registers, Attempting Buffer Overflow, Overwriting Buffer, ESP and EIP. In case you missed it, here is the link to the Part 1:
-> We successfully exploited the insufficient bound checking by giving input which was larger than the buffer size, which led to the overflow.
-> We successfully attempted to overwrite the buffer and the EIP (Instruction Pointer).
-> We used GDB debugger to examine the buffer for our input and where it was placed on the stack.
Till now, we already know how to attempt a buffer overflow on a vulnerable binary. We already have the control of EIP and ESP.
So, Whats Next ?
Till now we have only overwritten the buffer. We haven’t really exploit anything yet. You may have read articles or PoC where Buffer Overflow is used to escalate privileges or it is used to get a reverse shell over the network from a vulnerable HTTP Server. Buffer Overflow is mainly used to execute arbitrary commands on the vulnerable system.
So, we will try to execute arbitrary commands by using this exploit technique.
In this blog, we will mainly focus on how to overwrite the return address to manipulate the flow of control and program execution.
Lets see this in more detail….
We will use a simple program to manipulate its intended output by changing the RETURN address.
(The following example is same as the AlephOne’s famous article on Stack Smashing with some modifications for a better and easy understanding)
printf(«The value of random is: %d\n»,random);
This is a very simple program where the variable random is assigned a value ‘0’, then a function named function is called, the flow is transferred to the function()and it returns to the main() after executing its instructions. Then the variable random is assigned a value ‘1’ and then the current value of random is printed on the screen.
Simple program, we already know its output i.e 1.
But, what if we want to skip the instruction random=1; ?
That means, when the function() call returns to the main(), it will not perform the assignment of value ‘1’ to random and directly print the value of random as ‘0’.
So how can we achieve this?
Lets fire up GDB and see how the stack looks like for this program.
(I won’t be explaining each and every instruction, I have already done this in the previous blog. I will only go through the instructions which are important to us. For better understanding of how the stack is setup, refer to the previous blog.)
In GDB, first we disassemble the main function and look where the flow of the program is returned to main() after the function() call is over and also the address where the assignment of value ‘1’ to random takes place.
-> At the address 0x08048430, we can see the value ‘0’ is assigned to random.
-> Then at the address 0x08048437 the function() is called and the flow is redirected to the actual place where the function() resides in memory at 0x804840b address.
-> After the execution of function() is complete, it returns the flow to the main()and the next instruction which gets executed is the assignment of value ‘1’ to random which is at address 0x0804843c.
->If we want to skip past the instruction random=1; then we have to redirect the flow of program from 0x08048437 to 0x08048443. i.e after the completion of function() the EIP should point to 0x08048443 instead of 0x0804843c.
To achieve this, we will make some slight modifications to the function(), so that the return address should point to the instruction which directly prints the randomvariable as ‘0’.
Lets take a look at the execution of the function() and see where the EIP points.
-> As of now there is only a char array of 5 bytes inside the function(). So, nothing important is going on in this function call.
-> We set a breakpoint at function() and step through each instruction to see what the EIP points after the function() is exectued completely.
-> Through the command info regsiters eip we can see the EIP is pointing to the address 0x0804843c which is nothing but the instruction random=1;
Lets make this happen!!!
From the previous blog, we came to know about the stack layout. We can see how the top of the stack is aligned, 8 bytes for the buffer, 4 bytes for the EBP, 4 bytes for the RET address and so on (If there is a value passed as parameter in a function then it gets pushed first onto the stack when the function is called).
-> When the function() is called, first the RET address gets pushed onto the stack so the program flow can be maintained after the execution of function().
->Then the EBP gets pushed onto the stack, so the EBP can act as the offset reference point for other instructions to access and calculate the memory addresses.
-> We have char array buffer of 5 bytes, but the memory is allocated in terms of WORD. Basically 1 word=4 bytes. Even if you declare a variable or array of 2 bytes it will be allocated as 4 bytes or 1 word. So, here 5 bytes=2 words i.e 8 bytes.
-> So, in our case when inside the function(), the stack will probably be setup as buffer + EBP + RET
-> Now we know that after 12 bytes, the flow of the program will return to main()and the instruction random = 1; will be executed. We want to skip past this instruction by manipulating the RET address to somehow point to the address after 0x0804843c.
->We will now try to manipulate the RET address by doing something like below :-
Here, we are introducing an int pointer ret, and we are assigning it the starting address of buffer + 12 .i.e (wait lets see this in a diagram)
-> So, the ret pointer now ends exactly at the start where the return value is actually stored. Now, we just want to add ret pointer with some value which will change the return value in such a manner that it will skip past the return = 1;instruction.
Lets just go back to the disassembled main function and calculate how much we have to add to the ret pointer to actually skip past the assignment instruction.
-> So, we already know we want to skip pass the instruction at address 0x0804843c to address 0x08048443. By subtracting these two address we get 7. So we will add 7 to the return address so that it will point to the instruction at address 0x08048443.
Since we are adding code and re-compiling the program, the memory addresses will get changed.
Lets take a look at the new memory addresses.
-> Now the assignment instruction random = 1; is at address 0x08048452 and the next address which we want the return address to point is 0x08048459. The difference between these addresses is still 7. So we are good till now.
Hopefully by executing this we will get the output as “The value of random is: 0”
Why did it gave the Segmentation fault (core dumped)? That means maybe the process is trying to access a memory that doesn’t exist or something.
Note:- Well this was really tricky for me atleast. Computers works in mysterious ways, it is not just mathematics, numbers and science ( Yeah I am being dramatic)
Lets fire up our binary in GDB and see whats wrong.
-> As you can see we set a breakpoint at line 4 and run the program. The breakpoint hits and the program halts. Then we examine the stack top and what we see, the distance between the start of the buffer and the start of return address is not 12, its 13. (It doesn’t even make sense mathematically). Three full block adds up for 12 bytes (4 bytes each) and 1 byte for that ‘A’ i.e 41.
-> Our buffer had value “ABCDE” in it. And A in ASCII hex is 41. From 41 to just right before of the return address 0x08048452, the count is 13 which traditionally is not correct. I still don’t know why the difference is 13 instead of 12, if anyone knows kindly tell me in the comments section.
-> So we just change our code from ret = buffer + 12; to ret = buffer + 13;
Now the final code will look like this:
printf(«The value of random is: %d\n»,random);
Now, we re-compile it and run and hope its now going to give a desired output.
Before continuing lets see this again in GDB:
> So, we can see now after 7 is added to the ret pointer, the return address is now 0x08048459 which means we have successfully skipped past the assignment instruction and overwritten the return address to manipulate our way around the program flow to alter the program execution.
Till now we learnt a very important part of buffer overflow i.e how to manipulate and overwrite the return address to alter the program execution flow.
Now, lets another example where we can overwrite the return address in more similar ‘buffer overflow’ fashion.
printf(«calling function pointer, jumping to 0x%08x\n»,fp);
Small overview of a program:-
->There is a function win() which has some message saying “code flow successfully changed”. So we assume this is the goal.
-> Then there is the main(), int pointer fp is declared and defined to ‘0’ along with char array buffer with size 64. The program asks for user input using the getsfunction and the input is stored in the array buffer.
-> Then there is an if statement where if fp is not equal to zero then it calls the function pointer jumping to some memory location.
Now we already know that gets function is vulnerable to buffer overflow (visit previous blog) and we have to execute the win() function. So by not wasting time lets find out the where the overflow happens.
We know that the char array buffer is of 64 bytes. So lets feed input accordingly.
So, as we can see after the 64th byte the overflow occurs and the 65th ‘A’ i.e 41 in ASCII Hex is overwriting the EIP.
Lets confirm this in GDB:
-> As we see the buffer gets overflowed and the EIP is overwritten with ‘A’ i.e 41.
Now lets find the address of the win() function. We can find this with GDB or by objdump.
Objdump is a tool which is used to display information about object files. It comes with almost every Linux distribution.
The -d switch stands for disassemble. When we use this, every function used in a binary is listed, but we just wanted the address of win() function so we just used grep to find our required string in the output.
So it tells us that the win() function’s starting address is 0804846b. To achieve our goal we just have to pass this address after the 64th byte so that the EIP will get overwritten by this address and execute the win() function for us.
But since our machine is little endian (Find more about it here)we have to pass the address in a slightly different manner (basically in reverse). So the address 0804846b will become \x6b\x84\x04\x08 where the \x stands for HEX value.
Hello Friends, this series of blog posts will purely focus on Buffer Overflows. When I started my journey in Infosec, this one topic fascinated me as much as it frightened me. When I read some of the blogs related to Buffer Overflows, it really seemed as some High-level gibber jabber containing C code, Assembly and some black terminals (Yeah I am talking about GDB). Over the period of time and preparing for OSCP, I started to learn about Buffer Overflows in detail referring to the endless materials on Web scattered over different planets. So, I will try to explain Buffer Overflow in depth and detail so everyone reading this blog can understand what actually a Buffer Overflow is.
In this blog, we will understand the basic fundamentals behind the Buffer Overflow vulnerability. Buffer Overflow is a memory corruption attack which involves memory, stack, buffers to name a few. We will go through each of this and understand why really Buffer Overflows takes place in the first place. We will focus on 32-bit architecture.
A little heads up: This blog is going to be a lengthy one because to understand the concepts of Buffer Overflow, we have to understand the process memory, stack, stack operations, assembly language and how to use a debugger. I will try to explain as much as I can. So I strongly suggest to stick till the end and it will surely clear your concepts. Also, from the next blog I will try to keep it short :p :p
So lets get started….
Before getting to stack and buffers, it is really important to understand the Process Memory Organization. Process Memory is the main playground where it all happens. In theory, the Process memory where the program/process resides is quite a complex place. We will see the basic part which we need for Buffer Overflow. It consists of three main regions: Text, Data and Stack.
Text: The Text region contains the Program Code (basically instructions) of the executable or the process which will be executed. The Text area is marked as read-only and writing any data to this area will result into Segmentation Violation (Memory Protection Mechanism).
Data: The Data region consists of the variables which are declared inside the program. This area has both initialized (data defined while coding) and uninitialized (data declared while coding) data. Static variables are stored in this section.
Stack: While executing a program, there are many function and procedure calls along with many JUMP instructions which after the functions work is done has to return to its next intended place. To carry out this operation, to execute a program, the memory has an area called Stack. Whenever the CALL instruction is used to call a function, the stack is used. The Stack is basically a data structure which works on the LIFO (Last In First Out) principle. That means the last object entering the stack is the first object to get out. We will see how a stack works in detail below.
So Lets see an Assembly Code Skeleton to understand more about how the program is executed in Process Memory.
Here, as we can see the assembly program skeleton has three sections: .data, .bss and .txt
-> .txt section contains the assembly code instructions which resides in the Text section of the process memory.
-> .data section contains the initialized data or lets say defined variables or data types which resides in the Data section of the process memory.
-> .bss section contains the uninitialized data or lets say declared variables that will be used later in the program execution which also resides in the Data section of the process memory
However in a traditionally compiled program there may be many sections other than this.
Now the most important part….
As we discussed earlier all the dirty work is done on the Stack. Stack is nothing but a region of memory where data is temporarily stored to carry out some operations. There are mainly two operations performed by stack: PUSH and POP. PUSH is used to add the object on top of the stack, POP is used to remove the object from top of the stack.
But why stack is used in the first place?
Most of the programs have functions and procedures in them. So during the program execution flow, when the function is called, the flow of control is passed to the stack. That means when the function is called, all the operations which will take place inside the function will be carried out on the Stack. Now, when we talk about flow of control, after when the execution of the function is done, the stack has to return the flow of control to the next instruction after which the function was called in the program. This is a very important feature of the stack.
So, lets see how the Stack works. But before that, lets get familiar with some stack pointers and registers which actually carries out everything on the Stack.
ESP (Extended Stack Pointer): ESP is a stack register which points to the top of the stack.
EBP (Extended Base Pointer): EBP is a stack register which points to the top of the current frame when a function is called. EBP generally points to the return address. EBP is really essential in the stack operations because when the function is called, function arguments and local variables are stored onto the stack. As the stack grows the offset of both the function arguments and variables changes with respect to ESP. So ESP is changed many times and it is difficult for the compiler to keep track, hence EBP was introduced. When the function is called, the value of ESP is copied into EBP, thus making EBP the offset reference point for other instructions to access and calculate the memory addresses.
EIP (Extended Instruction Pointer): EIP is a stack register which tells the processor about the address of the next instruction to execute.
RET (return) address: Return address is basically the address to which the flow of control has to be passed after the stack operation is finished.
Stack Frame: A stack frame is a region on stack which contains all the function parameters, local variables, return address, value of instruction pointer at the time of a function call.
Okay, so now lets see the stack closely by executing a C program. For this blog post we will be using the program challenge ‘stack0’ from Protostar exploit series, which is a stack based buffer overflow challenge series.
You can find more about Protostar exploit series here -> https://exploit-exercises.com/protostar/
printf(«you have changed the ‘modified’ variable\n»);
Whats the program about ?
An int variable modified is declared and a char array buffer of 64 bytes is declared. Then modified is set to 0 and user input is accepted in buffer using gets() function. The value of modified is checked, if its anything other than 0 then the message “you have changed the ‘modified’ variable” will be printed or else “Try again?” will be printed.
Lets execute the program once and see what happens.
So after executing the program, it asks for user input, after giving the string “IamGrooooot” it displays “Try again?”. By seeing the output it is clear that modified variable is still 0.
Now lets try to debug this executable in a debugger, we will be using GDB throughout this blogpost series (Why? Because its freaking awesome).
Just type gdb ./executable_name to execute the program in the debugger.
The first thing we do after firing up gdb is disassembling the main() function of our program.
We can see the main() function now, its in assembly language. So lets try to understand what actually this means.
From the address 0x080483f4 the main() function is getting started.
-> Since there is no arguments passed in the main function, directly the EBP is pushed on the stack.
As we know EBP is the base pointer on the stack, the Stack pushes some starting address onto the EBP and saves it for later purposes.
-> Next, the value of ESP is moved into EBP. The value of ESP is saved into EBP. This is done so that most of the operations carried out by the function arguments and the variable changes the ESP and it is difficult for the compiler to keep track of all the changes in ESP.
-> Many times the compiler add some instructions for better alignment of the stack. The instruction at the address 0x080483f7 is masking the ESP and adding some trailing zeros to the ESP. This instruction is not important to us.
-> The next instruction at address 0x080483fa is subtracting 0x60 hex value from the ESP. Now the ESP is pointing to a far lower address from the EBP(As we know the stack grows down the memory). This gap between the ESP and EBP is basically the Stack frame, where the operations needed to execute the program is done.
-> Now the instruction at address 0x080483fd is from where all our C code will make sense. Here we can see the instruction movl $0x0,0x5c(%esp) where the value 0 is moved into ESP + the offset 0x5c, that means 0 is moved to the address [ESP+0x5c]. This is same as in our program, modified=0.
-> From address 0x08048405 to 0x0804840c are the instructions with accepts the user input.
The instruction lea 0x1c(%esp),%eax is loading the effective address i.e [ESP+0x1c] into EAX register. This address is pointing to the char array buffer on the stack. The lea and mov instructions are almost same, the only difference is the leainstruction copies the address of the register and offset into the destination instead of the content(which the mov instruction does).
-> The instruction mov %eax,(%esp) is copying the address stored in EAX register into ESP. So the top of the stack is pointing to the address 0x1c(%esp). Another important thing is that the function parameters are stored in the ESP. Assuming that the next instruction is calling gets function, the gets function will write the data to a char array. In this case the ESP is pointing to the address of the char buffer and the gets function writes the data to ESP(i.e the char array at the address 0x1c(%esp) ).
-> From the address 0x08048411 to 0x0804842e, the if condition is carried out. The instruction mov 0x5c(%esp),%eax is copying the value of modified variable i.e 0 into EAX. Then the test instruction is checking whether the value of modifiedis changed or not. If the value is not changed i.e the je instruction’s output is equal then the flow of control will jump to 0x08048427 where the message “Try again?” will be printed. If the value is changed then flow of control will be normal and the next instruction will be executed, thus printing the message “you have changed the ‘modified’ variable”.
-> Now all the opertions are done. But the stack is as it is and for the program to be completed the flow control has to be passed to the RET address which is stored on the stack. But till now only variables and addresses has been pushed on the stack but nothing has been popped. At the address 0x08048433 the leave instruction is executed. The leave instruction is used to “free” the stack frame. If we see the disassemble main() the first two instructions push %ebp and push %esp,%ebp, these instructions basically sets up the stack frame. Now the leave instruction does exactly the opposite of what the first two instructions did, mov %ebp,%esp and pop %ebp. So these two instructions free ups stack frame and the EIP points to the RET address which will give the flow of control to the address which was next after the called function. In this case the RET value is not pointing to anything because our program ends here.
So till now we have seen what our C code in Assembly means, it is really important to understand these things because when we debug or lets say Reverse Engineer some binary and stuff, this understanding of how closely the memory and stack works really comes in handy.
Now we will see the stack operations in GDB. For those who will be doing debugging and reversing for the first time it may feel overwhelming seeing all these instructions(believe me, I used to go nuts sometimes), so for the moment we will focus only on the part which is required for this series to understand, like where is our input being written, how the memory addresses can be overwritten and all those stuffs….
Here comes the mighty GDB !!!!
In GDB we will list the program, so we can know where we want to set a breakpoint.
We will set the breakpoint for line number 7,11,13Lets run it in GDB..-> After we run it in GDB, we can see the breakpoints which we set is now hit, basically it is interrupting the program execution and halting the program flow at the given breakpoint.
-> We step to next instruction by typing s in the prompt, we can again see the next breakpoint gets hit.
-> At this point we check both our stack registers ESP and EIP. We look these two registers in two different ways. We check the ESP using the x switch which is for examine memory. We can see the ESP is pointing to the stack address 0xbffff0b0and the value it contains is 0x00000000 i.e 0. We can easily assume that, this 0 is the same 0 which gets assigned to the modified variable.
-> We check the EIP by the command info registers eip (you can see info about all the registers by simply typing info registers). We see the EIP is pointing to the memory address 0x8048405 which is nothing but the address of next instruction to be executed.
Now we step next and see what happens.-> When we step through the next instruction it asks us for the input. Now we give the input as random sets of ‘A’ and then we can see our third breakpoint which we set earlier is hit.
-> We again check the ESP and EIP. We clearly see the ESP is changed and pointing to different stack address. EIP now is pointing to the next instruction which is going to be executed.
Now lets see, the input which we gave where does it goes?-> By using the examine memory switch i.e x we see the contents of ESP. We are viewing the 24 words of content on the stack in Hex (thats why x/24xw $esp). We can see our input ‘A’ that is 41 in hex (according to the ASCII standards) on the stack(highlighted portion). So we can see that our input is being written on the stack.
Again lets step through the next instruction.-> In the previous step we saw that the instruction if(modified !=0) is going to be executed. Now lets go back to the section where we saw the assembly instructions equivalent of the C program in detail. We can see the instruction test %eax,%eaxwill be executed. So we already know it will compare if the modified value has changed or not.
How do we see that ?
-> We simply check the EAX register by typing info registers eax. We can see that the value of EAX is 0x0 i.e 0, that means the value of modified variable is still 0. Now we know that the message “Try again?” is going to be printed. The EIP also confirms this, the output of EIP points to 0x8048427 which if you look at the disassembled main function, you can see that it is calling the second puts function which has the message “Try again?”.
Lets step and move towards the exit of the program.-> When we step through the next instruction, it gives us the message “Try again?”. Then we check the EIP it points to 0x8048433, which is nothing but the leaveinstruction. Again we step through, we can the program exiting and terminated with the exit system call that is in the libc.s0.6 shared library files.
Till now, we saw the how the process memory works, how the programs gets loaded into the memory and how the stack operations are done when the program gets executed. Now this was more than enough to understand what Buffer Overflow really is.
Finally we can get started with the topic. Lets ask ourselves two very simple questions:
What is a Buffer?
-> Buffer is a temporary storage place in memory to store data.
What is Buffer Overflow?
-> When a data written to a buffer that is larger than the actual buffer size and due to improper bounds checking it gets overflowed and overwrites the adjacent memory addresses/locations.
So, its time to get our hands dirty by smashing the stack. But before that, for this blogpost series we will only focus on the Stack based overflows. Also the examples which we are going to see may not be vulnerable to buffer overflow because the newer system kernels handle all these things in a very effifcient way. If you are using a newer system, for e.g Ubuntu 16.04 LTS you have to go and disable the ASLR bit to off as it is set to protect from the Memory Corruption attacks.
To disable it simply type : echo 0 > /proc/sys/kernel/randomize_va_space in your terminal.
Also you have to compile the program using gcc with stack smashing detect feature disabled. To do this compile your program using:
We will use the earlier program which we used for understanding the stack. This program is vulnerable to buffer overflow. As we can see the the program is using gets function to accept the user input. Now, this gets function has some serious security issues. Lets see the man page for gets.As, we can see the highlighted part it says “Never use gets(). Because it is impossible to tell without knowing the data in advance how many characters gets() will read, and because gets() will continue to store characters past the end of the buffer, it is extremely dangerous to use. It has been used to break computer security. Use fgets() instead”.
From here we can understand there is actually no bound checks happening when the user input is taken through the gets function (Extremely Dangerous Right?).
Now lets look what the program is and what the challenge of the program is all about.
-> We already discussed that if the modified variable is not 0, then the message “you have changed the ‘modified’ variable” will be printed. But if we look at the program, there is no way the modified variable’s value can be changed. So how it can be done?
-> The line where it takes user input and writes into char array buffer is actually our way to go and change the value of modified. The gets(buffer); is the vulnerable code.
In the code we can see the modified variable assignment and the input of char array buffer is next to each other,
This means when this two instructions will be executed by the processor the modified variable and buffer array will be adjacent to each other in the stack frame.
So, what does this means? -> When we feed input to the buffer more than it is capable of, the extra input which we feed will get overwritten to the adjacent memory location, in this case the memory location pointing to the modified variable. Thus, the modified variable will be no longer be 0 and the success message will be printed.
Due to buffer overflow the above scenario was possible. Lets see in more detail.
First we will try to execute the program with some random input and see where the overflow happens.-> We already know the char array buffer is of 64 bytes. So we try to enter 60, 62, 64 times ‘A’ to our program. As, we can see the modified variable is not changed and the failure message is printed.
->But when we enter 65 A’s to our program, the value of modified variable mysteriously changes and the success message is printed.
->The buffer overflow has happened after the 64th byte of input and it overwrites the memory location after that i.e where the modified variable is stored.
Lets load our program in GDB and see how the modified variable’s memory location got overwritten.-> As we can see, the breakpoints 1 and 2 got hit, then we check the value of modified by typing x/xw $esp+0x5c (stack register + offset). If we see in the disassembled main function we can see the value 0 gets assigned to modifiedvariable through this instruction: movl $0x0,0x5c(%esp), that’s why we checked the value of modified variable by giving the offset along with the stack register ESP. The value of modified variable at stack location 0xbffff10c is 0x00000000.
->After stepping through, its asking us to enter the input. Now we know 65th byte is the point where the buffer gets overflowed. So we enter 65 A’s and then check the stack frame.
-> As we can see our input A i.e 41 is all over the stack. But we are only concerned with the adjacent memory location where the modified variable is there. By quickly checking the modified variable we can see the value of the stack address pointing to the modified variable 0xbffff10c is changed from 0x00000000 to 0x00000041.
-> This means when the buffer overflow took place it overwritten the adjacent memory location 0xbffff10c to 0x00000041.
-> As we step through the next instruction we can see the success message “you have changed the ‘modified’ variable” printed on the screen.
This was all possible because there was insufficient bounds checking when the user input was being written in the char array buffer. This led to overflow and the adjacent memory location (modified variable) got overwritten.
We successfully learned the fundamentals of process memory, Stack operations and Buffer Overflow in detail. Now, this was only the concept of how buffer overflow takes place. We still haven’t exploited this vulnerability to actually exploit the system. In the next blog we will see how to execute arbitrary commands through Shellcode using this Buffer Overflow vulnerablity.
Till then, go and learn as much as possible about Assembly and GDB, because we are going to use this extensively in the future blogposts.
WoW64 — aka Windows (32-bit) on Windows (64-bit) — is a subsystem that enables 32-bit Windows applications to run on 64-bit Windows. Most people today are familiar with WoW64 on Windows x64, where they can run x86 applications. WoW64 has been with us since Windows XP, and x64 wasn’t the only architecture where WoW64 has been available — it was available on IA-64architecture as well, where WoW64 has been responsible for emulating x86. Newly, WoW64 is also available on ARM64, enabling emulation of both x86 and ARM32 appllications.
wow64.dll: translation of Nt* system calls (ntoskrnl.exe / ntdll.dll)
wow64win.dll: translation of NtGdi*, NtUser* and other GUI-related system calls (win32k.sys / win32u.dll)
Emulation support DLLs:
wow64cpu.dll: support for running x86 programs on x64
wowarmhw.dll: support for running ARM32 programs on ARM64
xtajit.dll: support for running x86 programs on ARM64
Besides Nt* system call translation, the wow64.dll provides the core emulation infrastructure.
If you have previous experience with reversing WoW64 on x64, you can notice that it shares plenty of common code with WoW64 subsystem on ARM64. Especially if you peeked into WoW64 of recent x64 Windows, you may have noticed that it actually contains strings such as SysArm32 and that some functions check against IMAGE_FILE_MACHINE_ARMNT (0x1C4) machine type:
WoW on x64 systems cannot emulate ARM32 though — it just apparently shares common code. But SysX8664 and SysArm64 sound particularly interesting!
Those similarities can help anyone who is fluent in x86/x64, but not that much in ARM. Also, HexRays decompiler produce much better output for x86/x64 than for ARM32/ARM64.
Initially, my purpose with this blogpost was to get you familiar with how WoW64 works for ARM32 programs on ARM64. But because WoW64 itself changed a lot with Windows 10, and because WoW64 shares some similarities between x64 and ARM64, I decided to briefly get you through how WoW64 works in general.
Everything presented in this article is based on Windows 10 — insider preview, build 18247.
Througout this article I’ll be using some terms I’d like to explain beforehand:
ntdll or ntdll.dll — these will be always refering to the native ntdll.dll (x64 on Windows x64, ARM64 on Windows ARM64, …), until said otherwise or until the context wouldn’t indicate otherwise.
ntdll32 or ntdll32.dll — to make an easy distinction between native and WoW64 ntdll.dll, any WoW64 ntdll.dll will be refered with the *32 suffix.
emu or emu.dll — these will represent any of the emulation support DLLs (one of wow64cpu.dll,wowarmhw.dll, xtajit.dll)
module!FunctionName — refers to a symbol FunctionName within the module. If you’re familiar with WinDbg, you’re already familiar with this notation.
CHPE — “compiled-hybrid-PE”, a new type of PE file, which looks as if it was x86 PE file, but has ARM64 code within them. CHPE will be tackled in more detail in the x86 on ARM64 section.
The terms emulation and binary-translation refer to the WoW64 workings and they may be used interchangeably.
This section shows some points of interest in the ntoskrnl.exe regarding to the WoW64 initialization. If you’re interested only in the user-mode part of the WoW64, you can skip this part to the Initialization of the WoW64 process.
Initalization of WoW64 begins with the initialization of the kernel:
nt!PsLocateSystemDlls routine takes a pointer named nt!PspSystemDlls, and then calls nt!PspLocateSystemDll in a loop. Let’s figure out what’s going on here:
nt!PspSystemDlls appears to be array of pointers to some structure, which holds some NTDLL-related data. The order of these NTDLLs corresponds with this enum (included in the PDB):
Remember this enum, we’ll be needing it again in a while.
Now, let’s look how such structure looks like:
The nt!PspLocateSystemDll function intializes fields of this structure. The layout of this structure isn’t unfortunatelly in the PDB, but you can find a reconstructed version in the appendix.
Now let’s get back to the nt!Phase1Initialization — there’s more:
nt!PspInitializeSystemDlls routine takes a pointer named nt!NtdllExportInformation. Let’s look at it:
It looks like it’s some sort of array, again, ordered by the enum _SYSTEM_DLL_TYPE mentioned earlier. Let’s examine NtdllExports:
Nothing unexpected — just tuples of function name and function pointer. Did you notice the difference in the number after the NtdllExports field? On x64 there is 19 meanwhile on ARM64 there is 14. This number represents number of items in NtdllExports — and indeed, there is slightly different set of them:
We can see that ARM64 is missing Ums (User-Mode Scheduling) and Enclave functions. Also, we can see that ARM64 has one extra function: KiUserCallbackDispatcherReturn.
On the other hand, all NtdllWow*Exports contain the same set of function names:
Notice names of second fields of these “structures”: PsWowX86SharedInformation,PsWowChpeX86SharedInformation, … If we look at the address of those fields, we can see that they’re part of another array:
Those addresses are actually targets of the pointers in the NtdllWow*Exports structure. Also, those functions combined with PsWow*SharedInformation might give you hint that they’re related to this enum (included in the PDB):
Notice how the order of the SharedNtdll32BaseAddress corellates with the empty field in the previous screenshot (highlighted). The set of WoW64 NTDLL functions is same on both x64 and ARM64.
(The C representation of this data can be found in the appendix.)
Now we can tell what the nt!PspInitializeSystemDlls function does — it gets image base of each NTDLL (nt!PsQuerySystemDllInfo), resolves all Ntdll*Exports for them (nt!RtlFindExportedRoutineByName). Also, only for all WoW64 NTDLLs (if ((SYSTEM_DLL_TYPE)SystemDllType > PsNativeSystemDll)) it assigns the image base to the SharedNtdll32BaseAddress field of the PsWow*SharedInformation array (nt!PspWow64GetSharedInformation).
Kernel (create process)
Let’s talk briefly about process creation. As you probably already know, the native ntdll.dll is mapped as a first DLL into each created process. This applies for all architectures — x86, x64 and also for ARM64. The WoW64 processes aren’t exception to this rule — the WoW64 processes share the same initialization code path as native processes.
If you ever wondered how is the first user-mode instruction of the newly created process executed, now you know the answer — a “synthetic” user-mode exception is dispatched, with ExceptionRecord.ExceptionAddress = &PspLoaderInitRoutine, where PspLoaderInitRoutinepoints to the ntdll!LdrInitializeThunk. This is the first function that is executed in every process — including WoW64 processes.
Initialization of the WoW64 process
The fun part begins!
NOTE: Initialization of the wow64.dll is same on both x64 and ARM64. Eventual differences will be mentioned.
The ntdll!LdrpLoadWow64 function is called when the ntdll!UseWOW64 global variable is TRUE, which is set when NtCurrentTeb()->WowTebOffset != NULL.
It constructs the full path to the wow64.dll, loads it, and then resolves following functions:
NOTE: The resolution of these pointers is wrapped between pair of ntdll!LdrProtectMrdata calls, responsible for protecting (1) and unprotecting (0) the .mrdata section — in which these pointers reside. MRDATA (Mutable Read Only Data) are part of the CFG (Control-Flow Guard) functionality. You can look at Alex’s slides for more information.
When these functions are successfully located, the ntdll.dll finally transfers control to the wow64.dll by calling wow64!Wow64LdrpInitialize. Let’s go through the sequence of calls that eventually bring us to the entry-point of the “emulated” application.
Wow64InfoPtr is the first initialized variable in the wow64.dll. It contains data shared between 32-bit and 64-bit execution mode and its structure is not documented, although you can find this structure partialy restored in the appendix.
RtlWow64GetCpuAreaInfo is an internal ntdll.dll function which is called a lot during emulation. It is mainly used for fetching the machine type and architecture-specific CPU context (the CONTEXTstructure) of the emulated process. This information is fetched into an undocumented structure, which we’ll be calling WOW64_CPU_AREA_INFO. Pointer to this structure is then given to the ProcessInit function.
Wow64DetectMachineTypeInternal determines the machine type of the executed process and returns it. Wow64SelectSystem32PathInternal selects the “emulated” System32 directory based on that machine type, e.g. SysWOW64 for x86 processes or SysArm32 for ARM32 processes.
You can also notice calls to CpuNotifyMapViewOfSection function. As the name suggests, it is also called on each “emulated” call of NtMapViewOfSection. This function:
If these checks pass, CpupResolveReverseImports function is called. This function checks if the mapped image exports the Wow64Transition symbol and if so, it assigns there a 32-bit pointer value returned by emu!BTCpuGetBopCode.
The Wow64Transition is mostly known to be exported by SysWOW64\ntdll.dll, but there are actually multiple of Windows’ WoW DLLs which exports this symbol (will be mentioned later). You might be already familiar with the term “Heaven’s Gate” — this is where the Wow64Transition will point to on Windows x64 — a simple far jump instruction which switches into long-mode (64-bit) enabled code segment. On ARM64, the Wow64Transition points to a “nop” function.
NOTE: Because there are no checks on the ImageName, the Wow64Transition symbol is resolved for all executable images that passes the checks mentioned earlier. If you’re wondering whether Wow64Transition would be resolved for your custom executable or DLL — it indeed would!
The initialization then continues with thread-specific initialization by calling ThreadInit. This is followed by pair of calls ThunkStartupContext64TO32(CpuArea.MachineType, CpuArea.Context, NativeContext) and Wow64SetupInitialCall(&CpuArea) — these functions perform the necessary setup of the architecture-specific WoW64 CONTEXT structure to prepare start of the execution in the emulated environment. This is done in the exact same way as if ntoskrnl.exe would actually executed the emulated application — i.e.:
setting the instruction pointer to the address of ntdll32!LdrInitializeThunk
setting the stack pointer below the WoW64 CONTEXT structure
setting the 1st parameter to point to that CONTEXT structure
setting the 2nd parameter to point to the base address of the ntdll32
Finally, the RunCpuSimulation function is called. This function just calls BTCpuSimulate from the binary-translator DLL, which contains the actual emulation loop that never returns.
wow64.dll has also it’s own .mrdata section and ProcessInit begins with unprotecting it. It then tries to load the wow64log.dll from the constructed system directory. Note that this DLL is never present in any released Windows installation (it’s probably used internally by Microsoft for debugging of the WoW64 subsystem). Therefore, load of this DLL will normally fail. This isn’t problem, though, because no critical functionality of the WoW64 subsystem depends on it. If the load would actually succeed, the wow64.dll would try to find following exported functions there:
If any of these functions wouldn’t be exported, the DLL would be immediately unloaded.
If we’d drop custom wow64log.dll (which would export functions mentioned above) into the %SystemRoot%\System32 directory, it would actually get loaded into every WoW64 process. This way we could drop a custom logging DLL, or even inject every WoW64 process with native DLL!
Then, certain important values are fetched from the LdrSystemDllInitBlock array. These contains base address of the ntdll32.dll, pointer to functions like ntdll32!KiUserExceptionDispatcher, ntdll32!KiUserApcDispatcher, …, control flow guard information and others.
Finally, the Wow64pInitializeFilePathRedirection is called, which — as the name suggests — initializes WoW64 path redirection. The path redirection is completely implemented in the wow64.dll and the mechanism is basically based on string replacement. The path redirection can be disabled and enabled by calling kernel32!Wow64DisableWow64FsRedirection & kernel32!Wow64RevertWow64FsRedirection function pairs. Both of these functions internally call ntdll32!RtlWow64EnableFsRedirectionEx, which directly operates on NtCurrentTeb()->TlsSlots[/* 8 */ WOW64_TLS_FILESYSREDIR] field.
Next, a ServiceTables array is initialized. You might be already familiar with the KSERVICE_TABLE_DESCRIPTOR from the ntoskrnl.exe, which contains — among other things — a pointer to an array of system functions callable from the user-mode. ntoskrnl.exe contains 2 of these tables: one for ntoskrnl.exe itself and one for the win32k.sys, aka the Windows (GUI) subsystem. wow64.dll has 4 of them!
The WOW64_SERVICE_TABLE_DESCRIPTOR has the exact same structure as the KSERVICE_TABLE_DESCRIPTOR, except that it is extended:
NOTE:wow64.dll directly depends (by import table) on two DLLs: the native ntdll.dll and wow64win.dll. This means that wow64win.dll is loaded even into “non-Windows-subsystem” processes, that wouldn’t normally load user32.dll.
These two symbols mentioned above are the only symbols that wow64.dll requires wow64win.dll to export.
Let’s have a look at sdwhnt32 service table:
There is nothing surprising for those who already dealt with service tables in ntoskrnl.exe.sdwhnt32JumpTable contains array of the system call functions, which are traditionaly prefixed. WoW64 “system calls” are prefixed with wh*, which honestly I don’t have any idea what it stands for — although it might be the case as with Zw* prefix — it stands for nothing and is simply used as an unique distinguisher.
The job of these wh* functions is to correctly convert any arguments and return values from the 32-bit version to the native, 64-bit version. Keep in mind that that it not only includes conversion of integers and pointers, but also content of the structures. Interesting note might be that each of the wh* functions has only one argument, which is pointer to an array of 32-bit values. This array contains the parameters passed to the 32-bit system call.
As you could notice, in those 4 service tables there are “system calls” that are not present in the ntoskrnl.exe. Also, I mentioned earlier that the Wow64Transition is resolved in multiple DLLs. Currently, these DLLs export this symbol:
kernel32.dll and kernelbase.dll
The ntdll.dll and win32u.dll are obvious and they represent the same thing as their native counterparts. The service tables used by kernel32.dll and user32.dll contain functions for transformation of particular csrss.exe calls into their 64-bit version.
It’s also worth noting that at the end of the ntdll.dll system table, there are several functions with NtWow64* calls, such as NtWow64ReadVirtualMemory64, NtWow64WriteVirtualMemory64 and others. These are special functions which are provided only to WoW64 processes.
One of these special functions is also NtWow64CallFunction64. It has it’s own small dispatch table and callers can select which function should be called based on its index:
NOTE: I’ll be talking about one of these functions — namely Wow64CallFunctionTurboThunkControl — later in the Disabling Turbo thunks section.
This function is similar to the kernel’s nt!KiSystemCall64 — it does the dispatching of the system call. This function is exported by the wow64.dll and imported by the emulation DLLs. Wow64SystemServiceEx accepts 2 arguments:
The system call number
Pointer to an array of 32-bit arguments passed to the system call (as mentioned earlier)
The system call number isn’t just an index, but also contains index of a system table which needs to be selected (this is also true for ntoskrnl.exe):
This function then selects ServiceTables[ServiceTableIndex] and calls the appropriate wh*function based on the SystemCallNumber.
NOTE: In case the wow64log.dll has been successfully loaded, the Wow64SystemServiceEx function calls Wow64LogSystemServiceWrapper (wrapper around wow64log!Wow64LogSystemService function): once before the actual system call and one immediately after. This can be used for instrumentation of each WoW64 system call! The structure passed to Wow64LogSystemService contains every important information about the system call — it’s table index, system call number, the argument list and on the second call, even the resulting NTSTATUS! You can find layout of this structure in the appendix(WOW64_LOG_SERVICE).
Finally, as have been mentioned, the WOW64_SERVICE_TABLE_DESCRIPTOR structure differs from KSERVICE_TABLE_DESCRIPTOR in that it contains ErrorCase table. The code mentioned above is actually wrapped in a SEH__try/__except block. If whService raise an exception, the __exceptblock calls Wow64HandleSystemServiceError function. The function looks if the corresponding service table which raised the exception has non-NULLErrorCase and if it does, it selects the appropriate WOW64_ERROR_CASE for the system call. If the ErrorCase is NULL, the values from ErrorCaseDefault are used. The NTSTATUS of the exception is then transformed according to an algorithm which can be found in the appendix.
As you’ve probably guessed, this function constructs path to the binary-translator DLL, which is — on x64 — known as wow64cpu.dll. This DLL will be responsible for the actual low-level emulation.
We can see that there is no wow64cpu.dll on ARM64. Instead, there is xtajit.dll used for x86 emulation and wowarmhw.dll used for ARM32 emulation.
NOTE: The CpuGetBinaryTranslatorPath function is same on both x64 and ARM64 except for one peculiar difference: on Windows x64, if the \Registry\Machine\Software\Microsoft\Wow64\x86 key cannot be opened (is missing/was deleted), the function contains a fallback to load wow64cpu.dll. On Windows ARM64, though, it doesn’t have such fallback and if the registry key is missing, the function fails and the WoW64 process is terminated.
wow64.dll then loads one of the selected DLL and tries to find there following exported functions:
Interestingly, not all functions need to be found — only those marked with the “(!)”, the rest is optional. As a next step, the resolved BTCpuProcessInit function is called, which performs binary-translator-specific process initialization. We’ll cover that in later section.
At the end of the ProcessInit function, wow64!Wow64ProtectMrdata(1) is called, making .mrdatanon-writable again.
For WOW64_CPUFLAGS_SOFTWARE emulators, it creates an event, which added intoAlertByThreadIdEventHashTable and set to NtCurrentTeb()->TlsSlots. This event is used for special emulation of NtAlertThreadByThreadId and NtWaitForAlertByThreadId.
NOTE: The WOW64_CPUFLAGS_MSFT64 (1) or the WOW64_CPUFLAGS_SOFTWARE (2) flag is stored in the NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO], in the WOW64INFO.CpuFlags field. One of these flags is always set in the emulator’s BTCpuProcessInit function (mentioned in the section above):
wow64cpu.dll sets WOW64_CPUFLAGS_MSFT64 (1)
wowarmhw.dll sets WOW64_CPUFLAGS_MSFT64 (1)
xtajit.dll sets WOW64_CPUFLAGS_SOFTWARE (2)
x86 on x64
Entering 32-bit mode
RunSimulatedCode runs in a loop and performs transitions into 32-bit mode either via:
jmp fword ptr[reg] — a “far jump” that not only changes instruction pointer (RIP), but also the code segment register (CS). This segment usually being set to 0x23, while 64-bit code segment is 0x33
synthetic “machine frame” and iret — called on every “state reset”
NOTE: Explanation of segmentation and “why does it work just by changing a segment register” is beyond scope of this article. If you’d like to know more about “long mode” and segmentation, you can start here.
Far jump is used most of the time for the transition, mainly because it’s faster. iret on the other hand is more powerful, as it can change CS, SS, EFLAGS, RSP and RIP all at once. The “state reset” occurs when WOW64_CPURESERVED.Flags has WOW64_CPURESERVED_FLAG_RESET_STATE (1) bit set. This happens during exception (see wow64!Wow64PrepareForException and wow64cpu!BTCpuResetToConsistentState). Also, this flag is cleared on every emulation loop (using btr — bit-test-and-reset).
You can see the simplest form of switching into the 32-bit mode. Also, at the beginning you can see that TurboThunkDispatch address is moved into the r15 register. This register stays untouched during the whole RunSimulatedCode function. Turbo thunks will be explained in more detail later.
Leaving 32-bit mode
The switch back to the 64-bit mode is very similar — it also uses far jumps. The usual situation when code wants to switch back to the 64-bit mode is upon system call:
The Wow64SystemServiceCall is just a simple jump to the Wow64Transition:
If you remember, the Wow64Transition value is resolved by the wow64cpu!BTCpuGetBopCodefunction:
It selects either KiFastSystemCall or KiFastSystemCall2 based on the CpupSystemCallFast value.
The KiFastSystemCall looks like this (used when CpupSystemCallFast != 0):
[x86] jmp 33h:$+9 (jumps to the instruction below)
[x64] jmp qword ptr [r15+offset] (which points to CpupReturnFromSimulatedCode)
The KiFastSystemCall2 looks like this (used when CpupSystemCallFast == 0):
[x86] push 0x33
[x86] push eax
[x86] call $+5
[x86] pop eax
[x86] add eax, 12
[x86] xchg eax, dword ptr [esp]
[x86] jmp fword ptr [esp] (jumps to the instruction below)
[x64] add rsp, 8
[x64] jmp wow64cpu!CpupReturnFromSimulatedCode
Clearly, the KiFastSystemCall is faster, so why it’s not used used every time?
It turns out, CpupSystemCallFast is set to 1 in the wow64cpu!BTCpuProcessInit function if the process is not executed with the ProhibitDynamicCode mitigation policy and if NtProtectVirtualMemory(&KiFastSystemCall, PAGE_READ_EXECUTE) succeeds.
This is because KiFastSystemCall is in a non-executable read-only section (W64SVC) whileKiFastSystemCall2 is in read-executable section (WOW64SVC).
But the actual reason why is KiFastSystemCall in non-executable section by default and needs to be set as executable manually is, honestly, unknown to me. My guess would be that it has something to do with relocations, because the address in the jmp 33h:$+9 instruction must be somehow resolved by the loader. But maybe I’m wrong. Let me know if you know the answer!
I hope you didn’t forget about the TurboThunkDispatch address hanging in the r15 register. This value is used as a jump-table:
There are 32 items in the jump-table.
CpupReturnFromSimulatedCode is the first code that is always executed in the 64-bit mode when 32-bit to 64-bit transition occurs. Let’s recapitulate the code:
Stack is swapped,
Non-volatile registers are saved
eax — which contains the encoded service table index and system call number — is moved into the ecx
it’s high-word is acquired via ecx >> 16.
the result is used as an index into the TurboThunkDispatch jump-table
You might be confused now, because few sections above we’ve defined the service number like this:
Based on our new definition of WOW64_SYSTEM_SERVICE, we can conclude that:
NtMapViewOfSection uses turbo thunk with index 0 (TurboDispatchJumpAddressEnd)
NtWaitForSingleObject uses turbo thunk with index 13 (Thunk3ArgSpNSpNSpReloadState)
NtDeviceIoControlFile uses turbo thunk with index 27 (DeviceIoctlFile)
Let’s finally explain “turbo thunks” in proper way.
Turbo thunks are an optimalization of WoW64 subsystem — specifically on Windows x64 — that enables for particular system calls to never leave the wow64cpu.dll — the conversion of parameters and return value, and the syscall instruction itself is fully performed there. The set of functions that use these turbo thunks reveals, that they are usually very simple in terms of parameter conversion — they receive numerical values or handles.
The notation of Thunk* labels is as follows:
The number specifies how many arguments the function receives
Sp converts parameter with sign-extension
NSp converts parameter without sign-extension
ReloadState will return to the 32-bit mode using iret instead of far jump, if WOW64_CPURESERVED_FLAG_RESET_STATE is set
QuerySystemTime, ReadWriteFile, DeviceIoctlFile, … are special cases
Let’s take the NtWaitForSingleObject and its turbo thunk Thunk3ArgSpNSpNSpReloadState as an example:
it receives 3 parameters
1st parameter is sign-extended
2nd parameter isn’t sign-extended
3rd parameter isn’t sign-extended
it can switch to 32-bit mode using iret if WOW64_CPURESERVED_FLAG_RESET_STATE is set
When we cross-check this information with its function prototype, it makes sense:
The sign-extension of HANDLE makes sense, because if we pass there an INVALID_HANDLE_VALUE, which happens to be 0xFFFFFFFF (-1) on 32-bits, we don’t want to convert this value to 0x00000000FFFFFFFF, but 0xFFFFFFFFFFFFFFFF.
On the other hand, if the TurboThunkNumber is 0, the call will end up in theTurboDispatchJumpAddressEnd which in turn calls wow64!Wow64SystemServiceEx. You can consider this case as the “slow path”.
Disabling Turbo thunks
On Windows x64, the Turbo thunk optimization can be actually disabled!
In one of the previous sections I’ve been talking about ntdll32!NtWow64CallFunction64 andwow64!Wow64CallFunctionTurboThunkControl functions. As with any other NtWow64* function, NtWow64CallFunction64 is only available in the WoW64 ntdll.dll. This function can be called with an index to WoW64 function in the Wow64FunctionDispatch64 table (you could see earlier).
NOTE: This function prototype has been reconstructed with the help of thewow64!Wow64CallFunction64Nop function code, which just logs the parameters.
We can see that wow64!Wow64CallFunctionTurboThunkControl can be called with an index of 2. This function performs some sanity checks and then passes callswow64cpu!BTCpuTurboThunkControl(*(ULONG*)InputBuffer).
wow64cpu!BTCpuTurboThunkControl then checks the input parameter.
If it’s 0, it patches every target of the jump table to point to TurboDispatchJumpAddressEnd(remember, this is the target that is called when WOW64_SYSTEM_SERVICE.TurboThunkNumber is 0).
If it’s non-0, it returns STATUS_NOT_SUPPORTED.
This means 2 things:
Calling wow64cpu!BTCpuTurboThunkControl(0) disables the Turbo thunks, and every system call ends up taking the “slow path”.
It is not possible to enable them back.
With all this in mind, we can achieve disabling Turbo thunks by this call:
What it might be good for? I can think of 3 possible use-cases:
If we deploy custom wow64log.dll (explained earlier), disabling Turbo thunks guarantees that we will see every WoW64 system call in our wow64log!Wow64LogSystemService callback. We wouldn’t see such calls if the Turbo thunks were enabled, because they would take the “fast path” inside of the wow64cpu.dll where the syscall would be executed.
If we decide to hook Nt* functions in the native ntdll.dll, disabling Turbo thunks guarantees that for each Nt* function called in the ntdll32.dll, the correspondint Nt* function will be called in the native ntdll.dll. (This is basically the same point as the previous one.)
NOTE: Keep in mind that this only applies on system calls, i.e. on Nt* or Zw* functions. Other functions are not called from the 32-bit ntdll.dll to the 64-bit ntdll.dll. For example, if we hooked RtlDecompressBuffer in the native ntdll.dll of the WoW64 process, it wouldn’t be called on ntdll32!RtlDecompressBuffer call. This is because the full implementaion of the Rtl*functions is already in the ntdll32.dll.
We can “harmlessly” patch high-word moved to the eax in every WoW64 system call stub to 0. For example we could see in NtWaitForSingleObject there is mov eax, 0D0004h. If we patched appropriate 2 bytes in that instruction so that the instruction would become mov eax, 4h, the system call would still work.This approach can be used as an anti-hooking technique — if there’s a jump at the start of the function, the patch will break it. If there’s not a jump, we just disable the Turbo thunk for this function.
x86 on ARM64
Emulation of x86 applications on ARM64 is handled by an actual binary translation. Instead of wow64cpu.dll, the xtajit.dll (probably shortcut for “x86 to ARM64 JIT”) is used for its emulation. As with other emulation DLLs, this DLL is native (ARM64).
The x86 emulation on Windows ARM64 consists also of other “XTA” components:
xtac.exe — XTA Compiler
XtaCache.exe — XTA Cache Service
Execution of x86 programs on ARM64 appears to go way behind just emulation. It is also capable of caching already binary-translated code, so that next execution of the same application should be faster. This cache is located in the Windows\XtaCache directory which contains files in format FILENAME.EXT.HASH1.HASH2.mp.N.jc. These files are then mapped to the user-mode address space of the application. If you’re asking whether you can find an actual ARM64 code in these files — indeed, you can.
The whole “XTA” and its internals are not in the focus of this article, but they would definitely deserve a separate article.
Unfortunatelly, Microsoft doesn’t provide symbols to any of these xta* DLLs or executables. But if you’re feeling adventurous, you can find some interesting artifacts, like this array of structures inside of the xtajit.dll, which contains name of the function and its pointer. There are thousands of items in this array:
With a simple Python script, we can mass-rename all functions referenced in this array:
I’d like to thank Milan Boháček for providing me this script.
Windows\SyCHPE32 & Windows\SysWOW64
One thing you can observe on ARM64 is that it contains two folders used for x86 emulation. The difference between them is that SyCHPE32 contains small subset of DLLs that are frequently used by applications, while contents of the SysWOW64 folder is quite identical with the content of this folder on Windows x64.
The CHPE DLLs are not pure-x86 DLLs and not even pure-ARM64 DLLs. They are “compiled-hybrid-PE”s. What does it mean? Let’s see:
After opening SyCHPE32\ntdll.dll, IDA will first tell us — unsurprisingly — that it cannot download PDB for this DLL. After looking at randomly chosen Nt* function, we can see that it doesn’t differ from what we would see in the SysWOW64\ntdll.dll. Let’s look at some non-Nt* function:
We can see it contains regular x86 function prologue, immediately followed by x86 function epilogue and then jump somewhere, where it looks like that there’s just garbage.
My guess is that the reason for this prologue is probably compatibility with applications that check whether some particular functions are hooked or not — by checking if the first bytes of the function contain real prologue.
NOTE: Again, if you’re feeling adventurous, you can patch FileHeader.Machine field in the PE header to IMAGE_FILE_MACHINE_ARM64 (0xAA64) and open this file in IDA. You will see a whole lot of correctly resolved ARM64 functions. Again, I’d like to thank to Milan Boháček for this tip.
SVC instruction is sort-of equivalent of SYSCALL instruction on ARM64 — it basically enters the kernel mode. But there is a small difference between SYSCALL and SVC: while on Windows x64 the system call number is moved into the eax register, on ARM64 the system call number can be encoded directly into the SVC instruction.
Let’s peek for a moment into the kernel to see how is this SVC instruction handled:
We can see that:
MRS X30, ELR_EL1 — current interrupt-return address (stored in ELR_EL1 system register) will be moved to the register X30 (link register — LR).
MSR ELR_EL1, X15 — the interrupt-return address will be replaced by value in the register X15(which is aliased to the instruction pointer register — PC — in the 32-bit mode).
ORR X16, X16, #0b10000 — bit  is being set in X16 which is later moved to the SPSR_EL1register. Setting this bit switches the execution mode to 32-bits.
Simply said, in the X15 register, there is an address that will be executed once we leave the kernel-mode and enter the user-mode — which happens with the ERET instruction at the end.
nt!KiExit32BitMode / UND #0xF8
Alright, we’re in the 32-bit ARM mode now, how exactly do we leave? Windows solves this transition via UND instruction — which is similar to the UD2 instruction on the Intel CPUs. If you’re not familiar with it, you just need to know that it is instruction that basically guarantees that it’ll throw “invalid instruction” exception which can OS kernel handle. It is defined-“undefined instruction”. Again there is the same difference between the UND and UD2 instruction in that the ARM can have any 1-byte immediate value encoded directly in the instruction.
Let’s look at the NtMapViewOfSection system call in the SysArm32\ntdll.dll:
Let’s peek into the kernel again:
Keep in mind that meanwhile the 32-bit code is running, it cannot modify the value of the previously stored X30 register — it is not visible in 32-bit mode. It stays there the whole time. Upon UND #0xF8 execution, following happens:
the KiFetchOpcodeAndEmulate function moves value of X30 into X24 register (not shown on the screenshot).
AND X19, X16, #0xFFFFFFFFFFFFFFC0 — bit  (among others) is being cleared in the X19register, which is later moved to the SPSR_EL1 register. Clearing this bit switches the execution mode back to 64-bits.
KiExit32BitMode then moves the value of X24 register into the ELR_EL1 register. That means when this function finishes its execution, the ERET brings us back to the 64bit code, right after the SVC 0xFFFF instruction.
NOTE: It can be noticed that Windows uses UND instruction for several purposes. Common example might also be UND #0xFE which is used as a breakpoint instruction (equivalent of __debugbreak() / int3)
As you could spot, 3 kernel transitions are required for emulation of the system call (SVC 0xFFFF, system call itself, UND 0xF8). This is because on ARM there doesn’t exist a way how to switch between 32-bit and 64-bit mode only in user-mode.
If you’re looking for “ARM Heaven’s Gate” — this is it. Put whatever function address you like into the X15 register and execute SVC 0xFFFF. Next instruction will be executed in the 32-bit ARM mode, starting with that address. When you feel you’d like to come back into 64-bit mode, simply execute UND #0xF8 and your execution will continue with the next instruction after the SVC 0xFFFF.
PE-sieve is a light-weight tool that helps to detect malware running on the system, as well as to collect the potentially malicious material for further analysis. Recognizes and dumps variety of implants within the scanned process: replaced/injected PEs, shellcodes, hooks, and other in-memory patches.
Detects inline hooks, Process Hollowing, Process Doppelgänging, Reflective DLL Injection, etc.
example: classic unmapping (2) vs remapping (3) — with remapping full virtual content of the section is preserved, so it helps i.e. if the full section was unpacked in memory, or if virtual caves were used
VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.
Something About Virtualisation First
Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.
A typical VM set up looks like below:
At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.
Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.
VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.
Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.
Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.
VMX and VMCS
Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.
VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.
VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.
VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.
VMRESUME: Enters non-root mode for an existing VM instance.
When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.
Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.
Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.
As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.
There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.
Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).
VMCS basically stores two types of information:
Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
Control info which determines behaviour of the VM inside non-root mode.
More specifically, VMCS is divided into six parts.
Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
VM exit information fields tell hypervisor why the exit happened and provide additional information.
There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.
Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.
KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.
`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.
System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.
After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.
Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.
The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.
QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.
In the end, let us summarise the overall picture in a diagram: